all posts
Agents
Evaluating agentic systems in production
An agent that scores well on a static eval set can still fail constantly in production, because production is not a static eval set. Tools time out, inputs are adversarial, and the cost of a wrong action is real.
Three signals we trust
- Task completion against a graded rubric, sampled from live traffic and scored asynchronously.
- Tool-call validity — did the agent call real tools with well-formed arguments?
- Cost and latency per resolved task, not per token. Tokens are an input; resolved tasks are the output.
We don’t ask “is the model good?” We ask “is this agent, with these tools, resolving these tasks, today?”
The harness runs continuously, samples a fraction of real sessions, and pages us on regression — the same way we’d treat a latency SLO.