Agentic RAG with Claude: Retrieval as a Decision
White Paper

Agentic RAG with Claude: Retrieval as a Decision

Jake McCluskeyUpdated
Back to white papers

Source topic: "Agentic RAG", called out as a 2026 skill in @datasciencebrain's Engineering Roadmap
Stack: Claude + ChromaDB + Tavily (web fallback) + pure Python (no LangGraph)

What "agentic" actually means here

Naive RAGSelf-Healing RAGAgentic RAG
Always retrieve, always stuff into promptRetrieve, grade, retryClaude decides when to retrieve, from which source, and iteratively refines

The agent has access to multiple retrieval tools (private vector DB, web search, specific document readers) and picks based on the question.

Key mental model: retrieval is a tool, not a pipeline step.

Architecture

                   ┌──────────────┐
     user question │              │
    ─────────────► │    Claude    │
                   │              │
                   └──┬──┬──┬──┬──┘
                      │  │  │  │
          search_docs │  │  │  │ no tool needed
                      │  │  │  └─► direct answer
          web_search ─┘  │  │
                         │  │
           read_full ────┘  │
           (specific doc)   │
                            │
           calculate ───────┘

Claude looks at the question and picks:

  • "What's in our Q3 earnings?" goes to search_internal_docs
  • "Who won the 2026 F1 opener?" goes to web_search
  • "What does that specific contract say about termination?" goes to read_full_document
  • "What's 17% of $4.2M?" goes to calculate, no retrieval

Full implementation

1. Setup

pip install anthropic chromadb tavily-python sentence-transformers
import os, json, uuid
import anthropic, chromadb
from chromadb.utils import embedding_functions
from tavily import TavilyClient

claude = anthropic.Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
MODEL = "claude-opus-4-7"

# Local, free embeddings
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction("BAAI/bge-base-en-v1.5")
chroma = chromadb.PersistentClient(path="./rag_store")
docs = chroma.get_or_create_collection("internal_docs", embedding_function=embed_fn)

2. Index your private data (one-time)

def index_docs(folder: str):
    import pathlib
    for path in pathlib.Path(folder).rglob("*.md"):
        text = path.read_text(encoding="utf-8")
        # Simple chunking, 800 chars, 100 overlap
        chunks = [text[i:i+800] for i in range(0, len(text), 700)]
        docs.add(
            ids=[f"{path.stem}_{i}_{uuid.uuid4().hex[:6]}" for i in range(len(chunks))],
            documents=chunks,
            metadatas=[{"source": str(path), "chunk": i} for i in range(len(chunks))],
        )

index_docs("./knowledge_base")

3. Define the retrieval tools

TOOLS = [
    {
        "name": "search_internal_docs",
        "description": (
            "Search the company's internal knowledge base. Use for questions about "
            "internal products, policies, playbooks, or anything proprietary. "
            "Returns top-k chunks with source paths."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "k": {"type": "integer", "default": 5, "maximum": 10},
            },
            "required": ["query"],
        },
    },
    {
        "name": "web_search",
        "description": (
            "Search the public web. Use for current events, public company info, "
            "or any question about information outside the internal knowledge base."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "max_results": {"type": "integer", "default": 3},
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_full_document",
        "description": (
            "Read the FULL text of a specific document when a chunk isn't enough. "
            "Use after search_internal_docs returns a relevant source path and you need "
            "surrounding context (e.g., reading a whole contract or policy)."
        ),
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
    },
    {
        "name": "respond_with_uncertainty",
        "description": (
            "Use when, after attempting retrieval, you cannot find grounded evidence. "
            "Return an honest 'I don't know' to the user rather than hallucinating."
        ),
        "input_schema": {
            "type": "object",
            "properties": {"reason": {"type": "string"}},
            "required": ["reason"],
        },
    },
]

4. Tool dispatcher

def search_internal_docs(query: str, k: int = 5):
    hits = docs.query(query_texts=[query], n_results=k)
    return [
        {"text": hits["documents"][0][i],
         "source": hits["metadatas"][0][i]["source"],
         "score": 1 - hits["distances"][0][i]}
        for i in range(len(hits["documents"][0]))
    ]

def web_search(query: str, max_results: int = 3):
    return [
        {"title": r["title"], "url": r["url"], "content": r["content"][:800]}
        for r in tavily.search(query, max_results=max_results)["results"]
    ]

def read_full_document(path: str):
    try:
        with open(path, encoding="utf-8") as f:
            return {"content": f.read()[:30000]}
    except FileNotFoundError:
        return {"error": f"Not found: {path}"}

def respond_with_uncertainty(reason: str):
    return {"final": f"I don't have enough grounded information to answer. Reason: {reason}"}

DISPATCH = {
    "search_internal_docs": search_internal_docs,
    "web_search": web_search,
    "read_full_document": read_full_document,
    "respond_with_uncertainty": respond_with_uncertainty,
}

5. The agent loop

SYSTEM = """You are a research assistant with access to retrieval tools.

Rules:
1. Decide whether the question needs retrieval at all. For general knowledge or reasoning, answer directly.
2. For internal/proprietary topics → use search_internal_docs first.
3. For current events or public information → use web_search.
4. If initial results are chunks that lack context → follow up with read_full_document.
5. You may call multiple tools in parallel if they're independent.
6. If after retrieval you still don't have grounded evidence, use respond_with_uncertainty.
7. Always cite sources (paths or URLs) in your final answer."""

def agentic_rag(question: str, max_turns: int = 8):
    messages = [{"role": "user", "content": question}]

    for _ in range(max_turns):
        resp = claude.messages.create(
            model=MODEL, max_tokens=2048,
            system=SYSTEM, tools=TOOLS, messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason == "end_turn":
            return next(b.text for b in resp.content if b.type == "text")

        tool_results = []
        for block in resp.content:
            if block.type == "tool_use":
                result = DISPATCH[block.name](**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)[:12000],
                })
        messages.append({"role": "user", "content": tool_results})

    return "Exceeded turn budget."

# Usage
print(agentic_rag("What's our company policy on remote work, and how does it compare to Google's 2026 policy?"))
# Typical trace:
# 1. Claude calls search_internal_docs("remote work policy")  → finds internal doc
# 2. Claude calls read_full_document("./kb/hr/remote-work.md") → full context
# 3. Claude calls web_search("Google 2026 remote work policy") → public info
# 4. Claude synthesizes answer citing both sources

Why this pattern beats rigid RAG pipelines

Example: user asks "What's 23% of our Q3 revenue?"

  • Rigid RAG embeds the question, retrieves irrelevant doc chunks about percentages, hallucinates a number
  • Agentic RAG: Claude sees "23% of $X" is arithmetic plus lookup, calls search_internal_docs("Q3 revenue"), gets $4.2M, calls calculate("4200000 * 0.23"), answers "$966,000 based on Q3 financials (kb/finance/q3-2026.md)"

The LLM's routing decision is the intelligence, not the retrieval algorithm.

Prompt caching for heavy agents

Tool definitions plus system prompt are repeated every turn. Cache them:

resp = claude.messages.create(
    model=MODEL, max_tokens=2048,
    system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
    tools=TOOLS,
    messages=messages,
)

For a 6-turn agent call, cache hit savings are roughly 80% on input tokens.

Evaluation: how to know it's working

Build a small eval set:

EVAL = [
    {"q": "What's our PTO policy?", "expect_tool": "search_internal_docs", "expect_cite": "./kb/hr/"},
    {"q": "Who is the current US president?", "expect_tool": "web_search"},
    {"q": "What's 15 * 234?", "expect_tool": None, "expect_answer_contains": "3510"},
    {"q": "What's the CEO's personal phone number?", "expect_tool": "respond_with_uncertainty"},
]

# For each, run the agent, inspect the trace (capture tool calls into a list).
# Metrics: right tool chosen, right source cited, final answer accurate.

Resume angle

"Designed an agentic RAG system where Claude treats retrieval as a tool-use decision rather than a pipeline. Implemented source routing (internal vector store vs. web), iterative context gathering (chunk to full document), and honest uncertainty handling via a dedicated 'I don't know' tool. Reduced irrelevant retrievals by 60% vs. naive RAG on internal eval."

Common questions

Frequently asked

What is the main difference between agentic RAG and traditional RAG systems?

In agentic RAG, Claude decides when to retrieve information, which source to use, and can iteratively refine results by treating retrieval as a tool rather than a fixed pipeline step. Traditional naive RAG always retrieves and stuffs results into the prompt regardless of whether retrieval is needed. Agentic RAG gives the language model routing intelligence to choose between internal docs, web search, specific document readers, or no retrieval at all based on the question type.

What tools and libraries are needed to implement agentic RAG with Claude?

The implementation requires Claude (using the Anthropic API), ChromaDB for local vector storage, Tavily for web search fallback, and sentence-transformers for free local embeddings (specifically BAAI/bge-base-en-v1.5). The stack uses pure Python without LangGraph. All components can be installed via pip and the system uses Claude Opus 4-7 as the model.

How does agentic RAG handle questions that do not require any retrieval?

Claude evaluates each question and can answer directly without calling any retrieval tools for general knowledge or reasoning tasks. For example, a math question like calculating 17% of $4.2M would trigger a calculate tool rather than searching documents. The system includes a respond_with_uncertainty tool that Claude can invoke when it cannot find grounded evidence after attempting retrieval, preventing hallucination.

How does prompt caching reduce costs in multi-turn agentic RAG conversations?

Tool definitions and the system prompt are repeated in every turn of the agent loop and can be cached by marking them with ephemeral cache control. For a 6-turn agent call, cache hit savings are roughly 80% on input tokens. This optimization is particularly valuable for heavy agents that make multiple tool calls across many conversation turns.

What does a typical agentic RAG trace look like for a hybrid question?

For a question like comparing company remote work policy to Google's 2026 policy, Claude first calls search_internal_docs to find the internal policy, then calls read_full_document to get complete context from the relevant file, and finally calls web_search to retrieve Google's public policy. Claude then synthesizes an answer citing both the internal document path and external URLs. This multi-source routing happens automatically based on Claude's assessment of what information is needed.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

Agentic RAG with Claude: Retrieval as a Decision