How Anthropic Solved the Long-Running Agent Problem

Anthropic cracked the long-running agent problem with a two-phase approach: an initializer that scaffolds the environment, and a coding agent that works incrementally across context windows using git, feature checklists, and mandatory browser testing.

How Anthropic Solved the Long-Running Agent Problem

TL;DR

  • Long-running agents fail because each new context window starts with zero memory of previous work
  • Anthropic's solution: an initializer agent that scaffolds the environment, plus a coding agent that works incrementally and leaves clean artifacts
  • Key innovations include a 200+ feature JSON checklist, git-based progress tracking, and mandatory browser automation testing
  • This approach enabled Claude to build production-quality web apps across multiple sessions without losing track or declaring victory too early

The Big Picture

AI agents are hitting a wall. Not a capability wall — a memory wall.

The problem is simple: complex projects take longer than a single context window. An agent working on a multi-day coding task has to stop, lose all its working memory, and restart fresh. It's like staffing a software project with engineers who work in shifts but can't talk to each other.

Anthropic's engineering team ran into this hard when building the Claude Agent SDK. Even Opus 4.5 — a frontier model — would fail to build a production-quality web app when given a high-level prompt like "build a clone of claude.ai." The agent would either try to one-shot the entire app and run out of context mid-implementation, or it would look around after a few features were done and declare victory.

The core insight: agents need more than just context management. They need environment management. Anthropic's solution splits the work into two specialized prompts: an initializer agent that scaffolds the project on first run, and a coding agent that makes incremental progress while leaving clean artifacts for the next session.

This isn't just a Claude SDK trick. It's a blueprint for any long-running agent architecture. The techniques here — feature checklists, git-based handoffs, mandatory testing — are patterns you can lift into your own agent harnesses.

How It Works

The architecture has two phases. Phase one runs once. Phase two loops indefinitely.

Phase One: The Initializer Agent

The first context window gets a specialized prompt. Its job is to set up the scaffolding that every future agent will rely on. This includes:

  • A comprehensive feature list in JSON format, with every requirement marked as "failing"
  • An init.sh script that can spin up the development server
  • A claude-progress.txt file for session-to-session notes
  • An initial git commit showing the baseline state

For the claude.ai clone example, the initializer agent generated over 200 features. Each one was structured with a description, test steps, and a passes boolean. The format matters: JSON is harder for the model to accidentally corrupt than Markdown.

The prompt explicitly forbids removing or editing features. Only the passes field can change. This prevents the agent from gaming the system by deleting hard features.

Phase Two: The Coding Agent

Every subsequent session uses a different prompt. This one asks the agent to:

  1. Run pwd to confirm the working directory
  2. Read git logs and claude-progress.txt to understand recent work
  3. Read the feature list and pick the highest-priority incomplete feature
  4. Start the dev server using init.sh
  5. Run a basic end-to-end test to verify the app isn't broken
  6. Implement one feature
  7. Test it thoroughly using browser automation
  8. Commit the work to git with a descriptive message
  9. Update claude-progress.txt and mark the feature as passing

The incremental approach is critical. Early experiments without it showed the agent trying to implement multiple features at once, running out of context, and leaving half-finished code for the next session to untangle.

Git serves as both a progress log and a recovery mechanism. If an agent breaks something, the next session can see the commit history and revert bad changes. This mirrors how real engineering teams work.

Testing as a First-Class Citizen

One major failure mode: Claude would mark features complete without proper end-to-end testing. It might write unit tests or run curl commands, but wouldn't verify the feature worked from a user's perspective.

The solution was mandatory browser automation. The coding agent was given access to the Puppeteer MCP server and explicitly prompted to test every feature as a human would. This caught bugs that weren't obvious from code inspection alone.

There are limits. Claude can't see browser-native alert modals through Puppeteer, so features relying on those tended to be buggier. But the overall improvement was dramatic.

The approach shares DNA with Anthropic's earlier multi-agent harness work, but simplifies the architecture by using prompt variation instead of separate agent types.

What This Changes For Developers

This isn't just an Anthropic internal pattern. It's a set of practices you can apply to any long-running agent system.

Feature lists beat vague goals. Instead of prompting an agent with "build a chat app," give it a structured JSON file with 200 testable features. The agent now has a clear definition of done and can't prematurely declare victory.

Git is your inter-session memory. Agents can't remember across context windows, but git commits can. Descriptive commit messages become the handoff documentation between sessions. This also gives you rollback capability when an agent breaks something.

Testing must be mandatory, not optional. If your agent can mark a feature complete without running an end-to-end test, it will. Build testing into the workflow as a required step, not a suggestion.

Initialization is a separate job. The first context window has different responsibilities than subsequent ones. Use a different prompt for session one. Have it scaffold the environment, write the feature list, and set up the testing infrastructure.

The pattern also suggests when not to use long-running agents. If your task fits in a single context window, this overhead isn't worth it. But for multi-session work — building full apps, conducting research, financial modeling — these techniques become essential.

One open question: is a single general-purpose agent optimal, or would specialized agents (testing agent, QA agent, cleanup agent) perform better? Anthropic's approach uses prompt variation rather than separate agent architectures, but that's not the only way to solve this.

Try It Yourself

Anthropic published a quickstart repository with code examples implementing this pattern. The core structure looks like this:

{
  "category": "functional",
  "description": "New chat button creates a fresh conversation",
  "steps": [
    "Navigate to main interface",
    "Click the 'New Chat' button",
    "Verify a new conversation is created",
    "Check that chat area shows welcome state",
    "Verify conversation appears in sidebar"
  ],
  "passes": false
}

Each feature in the JSON list follows this structure. The agent reads this file at the start of every session, picks an incomplete feature, implements it, tests it, and flips passes to true.

The initializer prompt should explicitly forbid editing or removing features. Use language like "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality."

For the coding agent prompt, include a startup checklist:

# Run at the start of every session
pwd
cat claude-progress.txt
cat feature_list.json
git log --oneline -20
./init.sh
# Run basic smoke test before starting new work

This forces the agent to orient itself before diving into new features. It also catches bugs left by the previous session early, before they compound.

The Bottom Line

Use this pattern if you're building agents that need to work across multiple context windows on complex, multi-step tasks. The overhead of setting up feature lists, git tracking, and testing infrastructure pays off when your agent needs to maintain state across sessions.

Skip it if your tasks fit comfortably in a single context window. The initialization cost and structured handoffs aren't worth it for short-lived agents.

The real risk here is assuming context management alone solves long-running agent problems. It doesn't. Compaction helps, but without explicit environment scaffolding and inter-session artifacts, your agent will either try to do too much at once or quit too early. The breakthrough is treating each new context window like a new engineer joining the project — and giving them the documentation they need to be productive immediately.

The bigger opportunity: this pattern generalizes beyond coding. Scientific research, financial modeling, content creation — any domain where AI agents need to make progress over days or weeks will need some version of this architecture. Anthropic optimized for web app development, but the core principles (structured goals, persistent artifacts, mandatory verification) apply broadly.

Source: Anthropic