Evaluating agentic systems in production

An agent that scores well on a static eval set can still fail constantly in production, because production is not a static eval set. Tools time out, inputs are adversarial, and the cost of a wrong action is real.

Three signals we trust

Task completion against a graded rubric, sampled from live traffic and scored asynchronously.
Tool-call validity — did the agent call real tools with well-formed arguments?
Cost and latency per resolved task, not per token. Tokens are an input; resolved tasks are the output.

We don’t ask “is the model good?” We ask “is this agent, with these tools, resolving these tasks, today?”

The harness runs continuously, samples a fraction of real sessions, and pages us on regression — the same way we’d treat a latency SLO.