How to Build Evals That Actually Work for AI Agents

Anthropic's research team spent a year building evals for Claude Code and multi-agent systems. Here's what actually works: start with 20-50 real failures, combine deterministic and model-based graders, and read the transcripts. Teams that invest early ship faster.

How to Build Evals That Actually Work for AI Agents

TL;DR

  • Evals are the difference between shipping confidently and flying blind — teams without them get stuck fixing one bug while creating three others
  • Start with 20-50 real failure cases, not hundreds of synthetic tasks. Early agents need small, high-signal test suites
  • Combine deterministic graders (unit tests), model-based rubrics, and human calibration. No single layer catches everything
  • Read the transcripts. If you're not reviewing actual agent runs, you have no idea if your graders are measuring what matters

The Big Picture

Most teams building AI agents hit the same wall. Early prototypes work great. Manual testing feels sufficient. Then production happens.

Users report the agent "feels worse" after an update. You can't reproduce the issue. You fix one bug and break two others. You have no baseline to compare against. You're flying blind.

This is what happens when you skip evals. Anthropic's research team has spent the last year building evaluation systems for everything from Claude Code to multi-agent harnesses. They've learned what works across coding agents, conversational agents, research agents, and computer use agents in production.

The pattern is consistent: teams that invest in evals early ship faster. Teams that wait get bogged down in reactive loops, unable to distinguish real regressions from noise. The value compounds, but only if you treat evals as infrastructure, not an afterthought.

Here's what actually works.

Why Evals Break Down for Agents

Single-turn LLM evals are straightforward. Send a prompt, check the response, grade it. Done.

Agents are different. They operate over many turns, calling tools, modifying state, adapting based on intermediate results. The same capabilities that make agents useful — autonomy, intelligence, flexibility — make them harder to evaluate.

Mistakes propagate. An agent that misreads a file in turn 3 might spend the next 20 turns debugging the wrong problem. Frontier models find creative solutions that break static test assumptions. Opus 4.5 discovered a loophole in a flight-booking policy that technically "failed" the eval but actually solved the user's problem better than the intended solution.

Non-determinism compounds the challenge. The same agent on the same task might succeed 7 times out of 10. Is 70% good enough? It depends whether you need one working solution or consistent behavior every time.

The Anatomy of an Agent Eval

Before diving into implementation, get the terminology straight. Anthropic's teams use these definitions across all their eval work:

A task is a single test with defined inputs and success criteria. Each attempt at a task is a trial. Because model outputs vary, you run multiple trials to get consistent results.

A grader scores some aspect of performance. Tasks can have multiple graders, each containing multiple assertions. A transcript is the complete record of a trial — every API call, tool invocation, and response. The outcome is the final state in the environment. An agent might say "flight booked" in the transcript, but the outcome is whether a reservation actually exists in the database.

An evaluation harness runs evals end-to-end: provides instructions and tools, executes tasks concurrently, records steps, grades outputs, aggregates results. An agent harness is the system that enables a model to act as an agent — it processes inputs, orchestrates tool calls, returns results. When you evaluate "an agent," you're evaluating the harness and the model together.

An evaluation suite is a collection of tasks measuring specific capabilities. A customer support suite might test refunds, cancellations, escalations.

Get these building blocks right and the rest follows.

Three Types of Graders

Agent evals typically combine three grader types, each with different tradeoffs:

Code-based graders are deterministic. Unit tests, state checks, exact string matches. They're fast, reliable, and unambiguous. Use them wherever possible. Coding agents are natural fits — does the code run? Do the tests pass? SWE-bench Verified uses this approach: give agents GitHub issues, grade solutions by running the test suite. A solution passes only if it fixes failing tests without breaking existing ones.

Model-based graders use LLMs to judge outputs against rubrics. They handle subjective tasks where multiple solutions are valid. A research agent's report might be comprehensive in different ways. A support agent's tone might be empathetic through different phrasings. Model graders need careful calibration against human judgment. Give them structured rubrics. Grade each dimension separately rather than asking one LLM to judge everything at once. Provide an escape hatch — let the model return "Unknown" when it lacks information.

Human graders are the gold standard for subjective quality but don't scale. Use them to calibrate model graders, not to grade every trial. Systematic human studies work for validating LLM judges or evaluating outputs where human consensus is the reference standard. Reserve them for high-stakes decisions.

The best evals layer all three. Unit tests verify correctness. Model rubrics assess quality. Periodic human review catches what automation misses.

Coding Agents: When Tests Are Truth

Coding agents have a natural advantage: software is straightforward to evaluate. Does the code run? Do the tests pass?

SWE-bench Verified and Terminal-Bench both use deterministic graders as the foundation. Terminal-Bench tests end-to-end technical tasks like building a Linux kernel or training an ML model. Pass-or-fail tests validate key outcomes.

But passing tests isn't the whole story. You also want to grade the transcript. Heuristics-based code quality rules can evaluate generated code beyond correctness. Model-based graders with clear rubrics assess behaviors like tool usage or user interaction.

Consider a task where an agent must fix an authentication bypass vulnerability. You'd use:

  • Deterministic tests — required unit tests must pass (test_empty_pw_rejected.py, test_null_pw_rejected.py)
  • LLM rubric — assess overall code quality against a defined rubric
  • Static analysis — run ruff, mypy, bandit to catch issues
  • State check — verify security logs show auth_blocked events
  • Tool calls — confirm the agent read auth files, edited code, ran tests

In practice, most coding evals rely on unit tests for correctness and an LLM rubric for quality. Add other graders only as needed.

Conversational Agents: When the Interaction Matters

Conversational agents present a distinct challenge: the quality of the interaction itself is part of what you're evaluating.

Unlike coding agents where correctness is binary, conversational agents operate in a multidimensional space. Is the ticket resolved? Did it finish in under 10 turns? Was the tone appropriate? All three matter.

Benchmarks like τ-Bench and τ2-Bench simulate multi-turn interactions across domains like retail support and airline booking. One model plays a user persona while the agent navigates realistic scenarios. Success requires both task completion and interaction quality.

For a support task handling a frustrated customer's refund, you'd combine:

  • LLM rubric — verify the agent showed empathy, explained the resolution clearly, grounded responses in tool results
  • State check — confirm the ticket status is resolved and refund is processed
  • Tool calls — ensure the agent verified identity, processed the refund with correct amount, sent confirmation
  • Transcript constraint — complete the task in under 10 turns

Model-based graders dominate here because many tasks have multiple correct solutions. The key is calibrating those graders against expert human judgment.

Research Agents: When Ground Truth Shifts

Research agents gather, synthesize, and analyze information. Quality is relative to the task. What counts as "comprehensive" for a market scan differs from due diligence for an acquisition.

Research evals face unique challenges. Experts disagree on whether a synthesis is complete. Reference content changes constantly. Longer outputs create more room for mistakes.

The solution: combine grader types strategically. Groundedness checks verify claims are supported by retrieved sources. Coverage checks define key facts a good answer must include. Source quality checks confirm consulted sources are authoritative, not just first retrieved.

For objectively correct answers ("What was Company X's Q3 revenue?"), exact match works. For open-ended synthesis, LLM rubrics verify coherence and completeness.

Given the subjective nature of research quality, calibrate LLM-based rubrics frequently against expert human judgment.

Computer Use Agents: When the GUI Is the Interface

Computer use agents interact through screenshots, mouse clicks, keyboard inputs — the same interface humans use. They can use any application with a GUI, from design tools to legacy enterprise software.

Evaluation requires running the agent in a real or sandboxed environment and checking whether it achieved the intended outcome. WebArena tests browser-based tasks using URL and page state checks, plus backend state verification for tasks that modify data. OSWorld extends this to full OS control, with evaluation scripts that inspect file system state, application configs, database contents, UI element properties.

Browser use agents face a specific tradeoff: DOM-based interactions execute quickly but consume many tokens. Screenshot-based interactions are slower but more token-efficient. When asking Claude to summarize Wikipedia, extract text from the DOM. When finding a laptop case on Amazon, take screenshots — extracting the entire DOM is token-intensive.

Anthropic developed evals for Claude for Chrome to check that the agent selects the right tool for each context. This enabled faster, more accurate browser-based task completion.

The Non-Determinism Problem

Agent behavior varies between runs. A task that passed on one eval run might fail on the next. Sometimes what you want to measure is how often an agent succeeds.

Two metrics capture this nuance:

pass@k measures the likelihood that an agent gets at least one correct solution in k attempts. As k increases, pass@k rises. More shots on goal means higher odds of at least one success. A score of 50% pass@1 means the model succeeds at half the tasks on the first try. In coding, you're often most interested in pass@1. In other cases, proposing many solutions is valid as long as one works.

pass^k measures the probability that all k trials succeed. As k increases, pass^k falls since demanding consistency across more trials is harder. If your agent has a 75% per-trial success rate and you run 3 trials, the probability of passing all three is (0.75)³ ≈ 42%. This metric matters for customer-facing agents where users expect reliable behavior every time.

At k=1, both metrics are identical. By k=10, they tell opposite stories: pass@k approaches 100% while pass^k falls to 0%.

Which to use depends on product requirements. Use pass@k for tools where one success matters. Use pass^k for agents where consistency is essential.

From Zero to One: A Practical Roadmap

Start Early With Small Datasets

Teams delay building evals because they think they need hundreds of tasks. In reality, 20-50 simple tasks drawn from real failures is a great start.

In early agent development, each change has a clear, noticeable impact. This large effect size means small sample sizes suffice. More mature agents need larger, more difficult evals to detect smaller effects. Take the 80/20 approach in the beginning.

Evals get harder to build the longer you wait. Early on, product requirements naturally translate into test cases. Wait too long and you're reverse-engineering success criteria from a live system.

Start With What You Already Test Manually

Begin with the manual checks you run during development. The behaviors you verify before each release. Common tasks end users try.

If you're already in production, look at your bug tracker and support queue. Converting user-reported failures into test cases ensures your suite reflects actual usage. Prioritizing by user impact helps you invest effort where it counts.

Write Unambiguous Tasks With Reference Solutions

A good task is one where two domain experts would independently reach the same pass/fail verdict. Could they pass the task themselves? If not, the task needs refinement.

Ambiguity in task specifications becomes noise in metrics. The same applies to criteria for model-based graders: vague rubrics produce inconsistent judgments.

Each task should be passable by an agent that follows instructions correctly. Everything the grader checks should be clear from the task description. Agents shouldn't fail due to ambiguous specs.

With frontier models, a 0% pass rate across many trials (0% pass@100) is most often a signal of a broken task, not an incapable agent. Double-check your task specification and graders.

For each task, create a reference solution: a known working output that passes all graders. This proves the task is solvable and verifies graders are correctly configured.

Build Balanced Problem Sets

Test both cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization.

If you only test whether the agent searches when it should, you might end up with an agent that searches for almost everything. Avoid class-imbalanced evals.

Anthropic learned this building evals for web search in Claude.ai. The challenge was preventing the model from searching when it shouldn't while preserving its ability to do extensive research when appropriate. The team built evals covering both directions: queries where the model should search (finding the weather) and queries where it should answer from existing knowledge (who founded Apple). Striking the right balance took many rounds of refinements to both prompts and the eval.

Build a Robust Eval Harness With a Stable Environment

The agent in the eval must function roughly the same as the agent in production. The environment itself shouldn't introduce noise.

Each trial should start from a clean environment. Unnecessary shared state between runs (leftover files, cached data, resource exhaustion) causes correlated failures due to infrastructure flakiness rather than agent performance.

Shared state can also artificially inflate performance. In some internal evals, Anthropic observed Claude gaining an unfair advantage by examining git history from previous trials.

If multiple distinct trials fail because of the same limitation in the environment (like limited CPU memory), these trials aren't independent. The eval results become unreliable for measuring agent performance.

Design Graders Thoughtfully

Choose deterministic graders where possible, LLM graders where necessary, and use human graders judiciously for validation.

Avoid checking that agents followed very specific steps like a sequence of tool calls in the right order. This approach is too rigid and results in overly brittle tests. Agents regularly find valid approaches that eval designers didn't anticipate. Grade what the agent produced, not the path it took.

For tasks with multiple components, build in partial credit. A support agent that correctly identifies the problem and verifies the customer but fails to process a refund is meaningfully better than one that fails immediately.

Model grading takes careful iteration to validate accuracy. LLM-as-judge graders should be closely calibrated with human experts. To avoid hallucinations, give the LLM a way out — provide an instruction to return "Unknown" when it doesn't have enough information. Create clear, structured rubrics to grade each dimension of a task, then grade each dimension with an isolated LLM-as-judge rather than using one to grade all dimensions.

Some evaluations have subtle failure modes that result in low scores even with good agent performance. Opus 4.5 initially scored 42% on CORE-Bench until an Anthropic researcher found multiple issues: rigid grading that penalized "96.12" when expecting "96.124991…", ambiguous task specs, and stochastic tasks that were impossible to reproduce exactly. After fixing bugs and using a less constrained scaffold, Opus 4.5's score jumped to 95%.

Make your graders resistant to bypasses or hacks. The agent shouldn't be able to easily "cheat" the eval. Tasks and graders should be designed so that passing genuinely requires solving the problem rather than exploiting unintended loopholes.

Check the Transcripts

You won't know if your graders are working well unless you read the transcripts and grades from many trials.

At Anthropic, they invested in tooling for viewing eval transcripts and regularly take the time to read them. When a task fails, the transcript tells you whether the agent made a genuine mistake or whether your graders rejected a valid solution. It also surfaces key details about agent and eval behavior.

Failures should seem fair: it's clear what the agent got wrong and why. When scores don't climb, you need confidence that it's due to agent performance and not the eval. Reading transcripts is how you verify that your eval is measuring what actually matters.

Monitor for Capability Eval Saturation

An eval at 100% tracks regressions but provides no signal for improvement. Eval saturation occurs when an agent passes all solvable tasks, leaving no room for improvement.

SWE-Bench Verified scores started at 30% this year. Frontier models are now nearing saturation at over 80%. As evals approach saturation, progress slows, as only the most difficult tasks remain. This can make results deceptive, as large capability improvements appear as small increases in scores.

The code review startup Qodo was initially unimpressed by Opus 4.5 because their one-shot coding evals didn't capture the gains on longer, more complex tasks. In response, they developed a new agentic eval framework, providing a much clearer picture of progress.

Do not take eval scores at face value until someone digs into the details of the eval and reads some transcripts. If grading is unfair, tasks are ambiguous, valid solutions are penalized, or the harness constrains the model, the eval should be revised.

Keep Evaluation Suites Healthy Long-Term

An eval suite is a living artifact that needs ongoing attention and clear ownership to remain useful.

At Anthropic, what proved most effective was establishing dedicated evals teams to own the core infrastructure, while domain experts and product teams contribute most eval tasks and run the evaluations themselves.

For AI product teams, owning and iterating on evaluations should be as routine as maintaining unit tests. Teams can waste weeks on AI features that "work" in early testing but fail to meet unstated expectations that a well-designed eval would have surfaced early.

Practice eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well. Internally, Anthropic often builds features that work "well enough" today but are bets on what models can do in a few months. Capability evals that start at a low pass rate make this visible. When a new model drops, running the suite quickly reveals which bets paid off.

The people closest to product requirements and users are best positioned to define success. With current model capabilities, product managers, customer success managers, or salespeople can use Claude Code to contribute an eval task as a PR. Let them. Or better yet, actively enable them.

Evals Are One Layer in a Larger System

Automated evaluations can run thousands of tasks without deploying to production or affecting real users. But this is just one way to understand agent performance.

A complete picture includes production monitoring, user feedback, A/B testing, manual transcript review, and systematic human evaluation.

Production monitoring reveals real user behavior at scale and catches issues that synthetic evals miss. It's reactive — problems reach users before you know about them — but it provides ground truth on how agents actually perform.

A/B testing measures actual user outcomes like retention and task completion. It controls for confounds and scales systematically. But it's slow — days or weeks to reach significance — and only tests changes you deploy.

User feedback surfaces problems you didn't anticipate and comes with real examples. But it's sparse, self-selected, skews toward severe issues, and users rarely explain why something failed.

Manual transcript review builds intuition for failure modes and catches subtle quality issues automated checks miss. But it's time-intensive, doesn't scale, and coverage is inconsistent.

Systematic human studies provide gold-standard quality judgments from multiple trained raters. They handle subjective or ambiguous tasks and provide signal for improving model-based graders. But they're expensive, slow, and complex domains require human experts.

These methods map to different stages of agent development. Automated evals are especially useful pre-launch and in CI/CD, running on each agent change and model upgrade as the first line of defense. Production monitoring kicks in post-launch to detect distribution drift and unanticipated real-world failures. A/B testing validates significant changes once you have sufficient traffic. User feedback and transcript review are ongoing practices to fill the gaps.

Like the Swiss Cheese Model from safety engineering, no single evaluation layer catches every issue. With multiple methods combined, failures that slip through one layer are caught by another.

The Bottom Line

Use evals if you're building any agent that will see production traffic. Skip them if you're prototyping for a week and throwing it away.

Start with 20-50 tasks drawn from real failures. Don't wait for the perfect suite. Source realistic tasks from bugs you've already seen. Define unambiguous success criteria. Combine deterministic graders, model rubrics, and periodic human review. Read the transcripts — if you're not reviewing actual agent runs, you have no idea if your graders work.

The real risk is waiting too long. Teams that invest early find development accelerates as failures become test cases and metrics replace guesswork. Teams that wait get bogged down in reactive loops, unable to ship confidently. The value compounds, but only if you treat evals as infrastructure, not an afterthought.

AI agent evaluation is still nascent and fast-evolving. As agents take on longer tasks, collaborate in multi-agent systems, and handle increasingly subjective work, techniques will need to adapt. But the fundamentals described here are constant: start early, source realistic tasks, design graders thoughtfully, iterate continuously, and read the transcripts.

Source: Anthropic