Back to guides

How Do I Build a Research Pipeline with the Anthropic Agent SDK?

Jake McCluskeyAdvanced75 min read
How Do I Build a Research Pipeline with the Anthropic Agent SDK?

The Agent SDK is how you stop being the human loop between Claude and "the internet plus your tools." Instead of you typing "search this, then check that, then summarize," the SDK runs an agent that does all three itself — tool call, reason, tool call, reason, until the task is done. For research workflows, it's the difference between "this took me an afternoon" and "a cron job produced the report before I was awake." Here's a production-grade research pipeline you can have running tonight.

Why this matters

A research pipeline is the canonical agent use case. You give the agent a research brief, tools to search and read, and a writing model. It iterates: decide what to look up, look it up, notice gaps, look those up too, synthesize. The human isn't in the loop until the report is drafted.

The Agent SDK gives you the orchestration primitives — tool definitions, tool execution, the reasoning loop, stopping conditions, usage tracking — without you reinventing them. You get to focus on the brief and the tools; the SDK runs the loop.

Before you start

You need:

  • Python 3.10+ or Node 20+. This guide uses Python for the first example; the Node SDK is equivalent.
  • An Anthropic API key with Claude access. Generate from console.anthropic.com.
  • A web search tool you trust — Brave Search API, Tavily, or SerpAPI all work. I'll use Tavily; swap as needed.
  • A research brief to test with. Good first ones: "State of LLM evaluation tools as of Q2 2026." Narrow scope, public signal, checkable output.

Step 1: Install the SDK

bash
pip install anthropic
pip install tavily-python  # or brave-search / serpapi

Set env vars:

bash
export ANTHROPIC_API_KEY="sk-ant-..."
export TAVILY_API_KEY="tvly-..."

Step 2: Define your tools

The agent needs tools. For a research pipeline, minimum viable:

python
import os
from anthropic import Anthropic
from tavily import TavilyClient

client = Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

tools = [
    {
        "name": "web_search",
        "description": "Search the web for current information. Returns top results with title, URL, and snippet.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "The search query"},
                "num_results": {"type": "integer", "default": 5},
            },
            "required": ["query"],
        },
    },
    {
        "name": "fetch_page",
        "description": "Fetch the content of a specific URL and return its text.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "The URL to fetch"},
            },
            "required": ["url"],
        },
    },
]

def execute_tool(name, tool_input):
    if name == "web_search":
        results = tavily.search(
            tool_input["query"],
            max_results=tool_input.get("num_results", 5),
        )
        return results
    if name == "fetch_page":
        # Use tavily's extract, or requests + BeautifulSoup
        content = tavily.extract(tool_input["url"])
        return content
    raise ValueError(f"Unknown tool: {name}")

Keep the tool set small at first. Two or three is enough for a research pipeline. More tools = more decisions for the agent = more chances to wander off the brief.

Step 3: Run the agent loop

The loop: send messages + tools, if Claude's response says stop_reason: tool_use, execute the tool, append the result, send again. Repeat until stop_reason: end_turn.

python
def run_research(brief: str, max_iterations: int = 15):
    messages = [
        {
            "role": "user",
            "content": f"""You are a research assistant. Produce a thorough report
on the following brief. Use the web_search and fetch_page tools to gather
current information. Cite every specific claim with a URL.

Stop and produce the final report when you have enough material — don't
over-research. Aim for 800-1500 words.

Brief: {brief}""",
        }
    ]

    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=8096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            # Agent is done; extract the final text
            final_text = ""
            for block in response.content:
                if block.type == "text":
                    final_text += block.text
            return final_text

        if response.stop_reason == "tool_use":
            # Execute tools, append results to messages
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
            messages.append({"role": "user", "content": tool_results})
            continue

        break  # Unexpected stop reason; bail

    raise RuntimeError(f"Exceeded {max_iterations} iterations without completing")

Replace the model name with whichever Claude model your account has access to. The SDK auto-handles model-specific differences.

Step 4: Run it on a real brief

python
if __name__ == "__main__":
    report = run_research("State of LLM evaluation tools as of Q2 2026")
    print(report)

Watch the agent work (add a print(f"iter {i}: {response.stop_reason}") inside the loop if you want the play-by-play). You'll see it search, fetch, search again, synthesize. It usually converges in 4-8 iterations.

Step 5: Add structured output

Freeform reports are fine for reading. For pipelines that feed into anything downstream (a CMS, a database, another agent), force structured output:

python
# Append to the initial prompt:
"""
Return the final report as valid JSON with this shape:

{
  "title": "string",
  "tldr": "string, max 2 sentences",
  "sections": [
    {"heading": "string", "body": "string", "sources": ["url1", "url2"]}
  ],
  "key_facts": ["fact 1 with inline citation", ...]
}

Do not include any prose outside the JSON block.
"""

Then parse the output in your pipeline. Schema validation catches hallucinated structure.

For stricter enforcement, use tool use for structured output: define a submit_report tool with the schema, force the agent to call it to end the task, and read the structured input from that tool call.

Step 6: Wrap with safety rails

Production research agents need limits:

  • Iteration cap — as above, fail if it runs too long.
  • Token budget — sum usage.input_tokens + usage.output_tokens across iterations; bail if over budget.
  • Rate limiting — if your search tool charges per call, cap calls per run.
  • Domain allowlist — reject fetch_page URLs outside a list of domains you trust. Prevents the agent from fetching adversarial pages.
  • Logging — log every tool call, its input, and its output. When a report goes off the rails, the log is how you debug.

Skip these in a notebook, have them on by the time you schedule the agent or point it at real work.

Verify it worked

1. One complete run produces a report. On a narrow, testable brief, the final text should cover the topic with real citations. Spot-check two citations — they should exist and say roughly what the report claims.

2. Iteration count is sane. Most briefs should finish in 4-10 iterations. If yours is hitting 15, either the brief is too broad or the agent is over-researching. Tighten the stop criterion in the prompt.

3. Cost per run is predictable. Log usage each iteration. A typical research run is $0.05-$0.30 in input + output tokens. If yours is $5, something's wrong — check for tool results that are dumping huge raw HTML into context.

Where this breaks

  • Unbounded fetching. A page with 50KB of content eats context every iteration. Either truncate in the tool execution (content[:5000]), or have the tool extract just the main article text, not the whole HTML.
  • Looping on unproductive queries. Agent keeps searching the same thing with slight variations because results are mediocre. Fix: add a tool-use count per query prefix, or include in the system prompt "if a search returns low-quality results, pivot the query substantially instead of tweaking."
  • Citations that don't exist. The agent can fabricate a URL that looks right but isn't. Always validate URLs in post-processing — HEAD request each cited URL; flag 404s.
  • The agent deciding the brief is "done" too early. Common if the initial search returns a lot of surface-level results. Mitigate by requiring a minimum section count or a minimum citation count in the output schema.
  • Tool result schema drift. Your search provider changes their response format. The agent gets unfamiliar data and behaves weirdly. Normalize tool results to a stable internal schema, regardless of provider.

What to try next

Want this built for you instead?

Let's talk about your AI + SEO stack

If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.

Let's Talk
Questions from readers

Frequently asked

What's the difference between the Agent SDK and just using tool use?

Tool use gives you the building blocks — a single request-response with a tool call. The Agent SDK (or your hand-rolled equivalent) is the orchestration around it: the loop, stopping conditions, tool dispatch, error handling, usage tracking. You can build the SDK pattern yourself in 50 lines; use the official SDK when you want the conveniences without the boilerplate.

How many iterations should a research agent need?

Most narrow briefs converge in 4-8 iterations. If yours is hitting 15+ regularly, the brief is too broad or the stop criterion is too loose. Add 'stop when you have 3-5 credible sources and a clear narrative' to the prompt.

Can I use this with the Batch API?

Yes for the Claude calls, with caveats. Agent loops are inherently sequential — iteration 2 depends on iteration 1's tool result — so you can't batch within a single run. But if you're running the same research agent across 100 briefs, you can orchestrate each run as its own batch job and cut the bill in half.

How do I stop the agent from hallucinating citations?

Always validate cited URLs post-run — HEAD request each one, flag 404s. For stricter checks, have the agent echo back a verbatim quote from the source and compare to what it actually claimed. Hallucinated citations fail the quote match.

What's the fastest path from prototype to production?

Three steps: add structured JSON output so downstream code can consume it, add iteration and token caps so runaway costs are impossible, and wrap the whole thing in a retry-with-exponential-backoff for transient API errors. Everything else (observability, user permissions, multi-tenant routing) is scaffolding on top of that foundation.