Agentic RAG with Claude: Retrieval as a Decision

Source topic: "Agentic RAG", called out as a 2026 skill in @datasciencebrain's Engineering Roadmap
Stack: Claude + ChromaDB + Tavily (web fallback) + pure Python (no LangGraph)
What "agentic" actually means here
| Naive RAG | Self-Healing RAG | Agentic RAG |
|---|---|---|
| Always retrieve, always stuff into prompt | Retrieve, grade, retry | Claude decides when to retrieve, from which source, and iteratively refines |
The agent has access to multiple retrieval tools (private vector DB, web search, specific document readers) and picks based on the question.
Key mental model: retrieval is a tool, not a pipeline step.
Architecture
┌──────────────┐
user question │ │
─────────────► │ Claude │
│ │
└──┬──┬──┬──┬──┘
│ │ │ │
search_docs │ │ │ │ no tool needed
│ │ │ └─► direct answer
web_search ─┘ │ │
│ │
read_full ────┘ │
(specific doc) │
│
calculate ───────┘
Claude looks at the question and picks:
- "What's in our Q3 earnings?" goes to
search_internal_docs - "Who won the 2026 F1 opener?" goes to
web_search - "What does that specific contract say about termination?" goes to
read_full_document - "What's 17% of $4.2M?" goes to
calculate, no retrieval
Full implementation
1. Setup
pip install anthropic chromadb tavily-python sentence-transformers
import os, json, uuid
import anthropic, chromadb
from chromadb.utils import embedding_functions
from tavily import TavilyClient
claude = anthropic.Anthropic()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
MODEL = "claude-opus-4-7"
# Local, free embeddings
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction("BAAI/bge-base-en-v1.5")
chroma = chromadb.PersistentClient(path="./rag_store")
docs = chroma.get_or_create_collection("internal_docs", embedding_function=embed_fn)
2. Index your private data (one-time)
def index_docs(folder: str):
import pathlib
for path in pathlib.Path(folder).rglob("*.md"):
text = path.read_text(encoding="utf-8")
# Simple chunking — 800 chars, 100 overlap
chunks = [text[i:i+800] for i in range(0, len(text), 700)]
docs.add(
ids=[f"{path.stem}_{i}_{uuid.uuid4().hex[:6]}" for i in range(len(chunks))],
documents=chunks,
metadatas=[{"source": str(path), "chunk": i} for i in range(len(chunks))],
)
index_docs("./knowledge_base")
3. Define the retrieval tools
TOOLS = [
{
"name": "search_internal_docs",
"description": (
"Search the company's internal knowledge base. Use for questions about "
"internal products, policies, playbooks, or anything proprietary. "
"Returns top-k chunks with source paths."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"k": {"type": "integer", "default": 5, "maximum": 10},
},
"required": ["query"],
},
},
{
"name": "web_search",
"description": (
"Search the public web. Use for current events, public company info, "
"or any question about information outside the internal knowledge base."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer", "default": 3},
},
"required": ["query"],
},
},
{
"name": "read_full_document",
"description": (
"Read the FULL text of a specific document when a chunk isn't enough. "
"Use after search_internal_docs returns a relevant source path and you need "
"surrounding context (e.g., reading a whole contract or policy)."
),
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"],
},
},
{
"name": "respond_with_uncertainty",
"description": (
"Use when, after attempting retrieval, you cannot find grounded evidence. "
"Return an honest 'I don't know' to the user rather than hallucinating."
),
"input_schema": {
"type": "object",
"properties": {"reason": {"type": "string"}},
"required": ["reason"],
},
},
]
4. Tool dispatcher
def search_internal_docs(query: str, k: int = 5):
hits = docs.query(query_texts=[query], n_results=k)
return [
{"text": hits["documents"][0][i],
"source": hits["metadatas"][0][i]["source"],
"score": 1 - hits["distances"][0][i]}
for i in range(len(hits["documents"][0]))
]
def web_search(query: str, max_results: int = 3):
return [
{"title": r["title"], "url": r["url"], "content": r["content"][:800]}
for r in tavily.search(query, max_results=max_results)["results"]
]
def read_full_document(path: str):
try:
with open(path, encoding="utf-8") as f:
return {"content": f.read()[:30000]}
except FileNotFoundError:
return {"error": f"Not found: {path}"}
def respond_with_uncertainty(reason: str):
return {"final": f"I don't have enough grounded information to answer. Reason: {reason}"}
DISPATCH = {
"search_internal_docs": search_internal_docs,
"web_search": web_search,
"read_full_document": read_full_document,
"respond_with_uncertainty": respond_with_uncertainty,
}
5. The agent loop
SYSTEM = """You are a research assistant with access to retrieval tools.
Rules:
1. Decide whether the question needs retrieval at all. For general knowledge or reasoning, answer directly.
2. For internal/proprietary topics → use search_internal_docs first.
3. For current events or public information → use web_search.
4. If initial results are chunks that lack context → follow up with read_full_document.
5. You may call multiple tools in parallel if they're independent.
6. If after retrieval you still don't have grounded evidence, use respond_with_uncertainty.
7. Always cite sources (paths or URLs) in your final answer."""
def agentic_rag(question: str, max_turns: int = 8):
messages = [{"role": "user", "content": question}]
for _ in range(max_turns):
resp = claude.messages.create(
model=MODEL, max_tokens=2048,
system=SYSTEM, tools=TOOLS, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
return next(b.text for b in resp.content if b.type == "text")
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = DISPATCH[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)[:12000],
})
messages.append({"role": "user", "content": tool_results})
return "Exceeded turn budget."
# Usage
print(agentic_rag("What's our company policy on remote work, and how does it compare to Google's 2026 policy?"))
# Typical trace:
# 1. Claude calls search_internal_docs("remote work policy") → finds internal doc
# 2. Claude calls read_full_document("./kb/hr/remote-work.md") → full context
# 3. Claude calls web_search("Google 2026 remote work policy") → public info
# 4. Claude synthesizes answer citing both sources
Why this pattern beats rigid RAG pipelines
Example: user asks "What's 23% of our Q3 revenue?"
- Rigid RAG embeds the question, retrieves irrelevant doc chunks about percentages, hallucinates a number
- Agentic RAG: Claude sees "23% of $X" is arithmetic plus lookup, calls
search_internal_docs("Q3 revenue"), gets$4.2M, callscalculate("4200000 * 0.23"), answers "$966,000 based on Q3 financials (kb/finance/q3-2026.md)"
The LLM's routing decision is the intelligence, not the retrieval algorithm.
Prompt caching for heavy agents
Tool definitions plus system prompt are repeated every turn. Cache them:
resp = claude.messages.create(
model=MODEL, max_tokens=2048,
system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
tools=TOOLS,
messages=messages,
)
For a 6-turn agent call, cache hit savings are roughly 80% on input tokens.
Evaluation: how to know it's working
Build a small eval set:
EVAL = [
{"q": "What's our PTO policy?", "expect_tool": "search_internal_docs", "expect_cite": "./kb/hr/"},
{"q": "Who is the current US president?", "expect_tool": "web_search"},
{"q": "What's 15 * 234?", "expect_tool": None, "expect_answer_contains": "3510"},
{"q": "What's the CEO's personal phone number?", "expect_tool": "respond_with_uncertainty"},
]
# For each, run the agent, inspect the trace (capture tool calls into a list).
# Metrics: right tool chosen, right source cited, final answer accurate.
Resume angle
"Designed an agentic RAG system where Claude treats retrieval as a tool-use decision rather than a pipeline. Implemented source routing (internal vector store vs. web), iterative context gathering (chunk to full document), and honest uncertainty handling via a dedicated 'I don't know' tool. Reduced irrelevant retrievals by 60% vs. naive RAG on internal eval."