all posts
Agents 11 min read

Evaluating agentic systems in production

An agent that scores well on a static eval set can still fail constantly in production, because production is not a static eval set. Tools time out, inputs are adversarial, and the cost of a wrong action is real.

Three signals we trust

  • Task completion against a graded rubric, sampled from live traffic and scored asynchronously.
  • Tool-call validity — did the agent call real tools with well-formed arguments?
  • Cost and latency per resolved task, not per token. Tokens are an input; resolved tasks are the output.
We don’t ask “is the model good?” We ask “is this agent, with these tools, resolving these tasks, today?”

The harness runs continuously, samples a fraction of real sessions, and pages us on regression — the same way we’d treat a latency SLO.