Agentic RAG with Claude: Retrieval as a Decision

Source topic: "Agentic RAG", called out as a 2026 skill in @datasciencebrain's Engineering Roadmap
Stack: Claude + ChromaDB + Tavily (web fallback) + pure Python (no LangGraph)

What "agentic" actually means here

Naive RAG	Self-Healing RAG	Agentic RAG
Always retrieve, always stuff into prompt	Retrieve, grade, retry	Claude decides when to retrieve, from which source, and iteratively refines

The agent has access to multiple retrieval tools (private vector DB, web search, specific document readers) and picks based on the question.

Key mental model: retrieval is a tool, not a pipeline step.

Architecture

                   ┌──────────────┐
     user question │              │
    ─────────────► │    Claude    │
                   │              │
                   └──┬──┬──┬──┬──┘
                      │  │  │  │
          search_docs │  │  │  │ no tool needed
                      │  │  │  └─► direct answer
          web_search ─┘  │  │
                         │  │
           read_full ────┘  │
           (specific doc)   │
                            │
           calculate ───────┘

Claude looks at the question and picks:

"What's in our Q3 earnings?" goes to search_internal_docs
"Who won the 2026 F1 opener?" goes to web_search
"What does that specific contract say about termination?" goes to read_full_document
"What's 17% of $4.2M?" goes to calculate, no retrieval

Full implementation

1. Setup

pip install anthropic chromadb tavily-python sentence-transformers

import os, json, uuid
import anthropic, chromadb
from chromadb.utils import embedding_functions
from tavily import TavilyClient

claude = anthropic.Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
MODEL = "claude-opus-4-7"

# Local, free embeddings
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction("BAAI/bge-base-en-v1.5")
chroma = chromadb.PersistentClient(path="./rag_store")
docs = chroma.get_or_create_collection("internal_docs", embedding_function=embed_fn)

2. Index your private data (one-time)

def index_docs(folder: str):
    import pathlib
    for path in pathlib.Path(folder).rglob("*.md"):
        text = path.read_text(encoding="utf-8")
        # Simple chunking — 800 chars, 100 overlap
        chunks = [text[i:i+800] for i in range(0, len(text), 700)]
        docs.add(
            ids=[f"{path.stem}_{i}_{uuid.uuid4().hex[:6]}" for i in range(len(chunks))],
            documents=chunks,
            metadatas=[{"source": str(path), "chunk": i} for i in range(len(chunks))],
        )

index_docs("./knowledge_base")

3. Define the retrieval tools

TOOLS = [
    {
        "name": "search_internal_docs",
        "description": (
            "Search the company's internal knowledge base. Use for questions about "
            "internal products, policies, playbooks, or anything proprietary. "
            "Returns top-k chunks with source paths."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "k": {"type": "integer", "default": 5, "maximum": 10},
            },
            "required": ["query"],
        },
    },
    {
        "name": "web_search",
        "description": (
            "Search the public web. Use for current events, public company info, "
            "or any question about information outside the internal knowledge base."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "max_results": {"type": "integer", "default": 3},
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_full_document",
        "description": (
            "Read the FULL text of a specific document when a chunk isn't enough. "
            "Use after search_internal_docs returns a relevant source path and you need "
            "surrounding context (e.g., reading a whole contract or policy)."
        ),
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
    },
    {
        "name": "respond_with_uncertainty",
        "description": (
            "Use when, after attempting retrieval, you cannot find grounded evidence. "
            "Return an honest 'I don't know' to the user rather than hallucinating."
        ),
        "input_schema": {
            "type": "object",
            "properties": {"reason": {"type": "string"}},
            "required": ["reason"],
        },
    },
]

4. Tool dispatcher

def search_internal_docs(query: str, k: int = 5):
    hits = docs.query(query_texts=[query], n_results=k)
    return [
        {"text": hits["documents"][0][i],
         "source": hits["metadatas"][0][i]["source"],
         "score": 1 - hits["distances"][0][i]}
        for i in range(len(hits["documents"][0]))
    ]

def web_search(query: str, max_results: int = 3):
    return [
        {"title": r["title"], "url": r["url"], "content": r["content"][:800]}
        for r in tavily.search(query, max_results=max_results)["results"]
    ]

def read_full_document(path: str):
    try:
        with open(path, encoding="utf-8") as f:
            return {"content": f.read()[:30000]}
    except FileNotFoundError:
        return {"error": f"Not found: {path}"}

def respond_with_uncertainty(reason: str):
    return {"final": f"I don't have enough grounded information to answer. Reason: {reason}"}

DISPATCH = {
    "search_internal_docs": search_internal_docs,
    "web_search": web_search,
    "read_full_document": read_full_document,
    "respond_with_uncertainty": respond_with_uncertainty,
}

5. The agent loop

SYSTEM = """You are a research assistant with access to retrieval tools.

Rules:
1. Decide whether the question needs retrieval at all. For general knowledge or reasoning, answer directly.
2. For internal/proprietary topics → use search_internal_docs first.
3. For current events or public information → use web_search.
4. If initial results are chunks that lack context → follow up with read_full_document.
5. You may call multiple tools in parallel if they're independent.
6. If after retrieval you still don't have grounded evidence, use respond_with_uncertainty.
7. Always cite sources (paths or URLs) in your final answer."""

def agentic_rag(question: str, max_turns: int = 8):
    messages = [{"role": "user", "content": question}]

    for _ in range(max_turns):
        resp = claude.messages.create(
            model=MODEL, max_tokens=2048,
            system=SYSTEM, tools=TOOLS, messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason == "end_turn":
            return next(b.text for b in resp.content if b.type == "text")

        tool_results = []
        for block in resp.content:
            if block.type == "tool_use":
                result = DISPATCH[block.name](**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)[:12000],
                })
        messages.append({"role": "user", "content": tool_results})

    return "Exceeded turn budget."

# Usage
print(agentic_rag("What's our company policy on remote work, and how does it compare to Google's 2026 policy?"))
# Typical trace:
# 1. Claude calls search_internal_docs("remote work policy")  → finds internal doc
# 2. Claude calls read_full_document("./kb/hr/remote-work.md") → full context
# 3. Claude calls web_search("Google 2026 remote work policy") → public info
# 4. Claude synthesizes answer citing both sources

Why this pattern beats rigid RAG pipelines

Example: user asks "What's 23% of our Q3 revenue?"

Rigid RAG embeds the question, retrieves irrelevant doc chunks about percentages, hallucinates a number
Agentic RAG: Claude sees "23% of $X" is arithmetic plus lookup, calls search_internal_docs("Q3 revenue"), gets $4.2M, calls calculate("4200000 * 0.23"), answers "$966,000 based on Q3 financials (kb/finance/q3-2026.md)"

The LLM's routing decision is the intelligence, not the retrieval algorithm.

Prompt caching for heavy agents

Tool definitions plus system prompt are repeated every turn. Cache them:

resp = claude.messages.create(
    model=MODEL, max_tokens=2048,
    system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
    tools=TOOLS,
    messages=messages,
)

For a 6-turn agent call, cache hit savings are roughly 80% on input tokens.

Evaluation: how to know it's working

Build a small eval set:

EVAL = [
    {"q": "What's our PTO policy?", "expect_tool": "search_internal_docs", "expect_cite": "./kb/hr/"},
    {"q": "Who is the current US president?", "expect_tool": "web_search"},
    {"q": "What's 15 * 234?", "expect_tool": None, "expect_answer_contains": "3510"},
    {"q": "What's the CEO's personal phone number?", "expect_tool": "respond_with_uncertainty"},
]

# For each, run the agent, inspect the trace (capture tool calls into a list).
# Metrics: right tool chosen, right source cited, final answer accurate.

Resume angle

"Designed an agentic RAG system where Claude treats retrieval as a tool-use decision rather than a pipeline. Implemented source routing (internal vector store vs. web), iterative context gathering (chunk to full document), and honest uncertainty handling via a dedicated 'I don't know' tool. Reduced irrelevant retrievals by 60% vs. naive RAG on internal eval."