Recursive Language Models (RLMs) solve a critical problem in multi-agent systems: context window bloat. Traditional agentic architectures pass results by value, forcing every intermediate output back into the LLM's context window. RLMs use a scaffold that passes results by reference as Python variables, letting agents access data selectively rather than loading everything into every request. This approach achieves roughly 90% KV cache hit rates and enables unbounded outputs limited only by Python's memory, not token limits.
What Are Recursive Language Models and How Do They Work
RLMs aren't a new model architecture or training method. They're a scaffolding pattern that changes how you structure agent workflows to manage context more efficiently.
In traditional agentic systems built with frameworks like LangChain or AutoGPT, when Agent A completes a task and passes results to Agent B, the entire output gets serialized into Agent B's prompt. If Agent B then passes to Agent C, you're now carrying Agent A's output plus Agent B's output in Agent C's context. This compounds rapidly in nested workflows.
RLMs flip this model. Instead of passing full outputs as text, the scaffold stores results in Python variables. Each agent receives a reference to these variables and can choose to read them only when needed. The LLM sees something like "previous_result available in context.agent_a_output" rather than the full 2,000-token response from Agent A.
The key architectural components include a FINAL() function that ships results directly to the calling code without forcing the LLM to regenerate output token by token, and context variables that agents explicitly read rather than having data force-fed into every prompt.
Recursive Language Models vs Traditional LLM Agents
Traditional agent frameworks like ReAct or CodeAct patterns pass information by value. When your agent uses a tool or calls a subagent, the complete response gets appended to the conversation history. A 500-token tool output becomes 500 tokens in your next LLM call, whether you need all that information or not.
Here's what that looks like in a typical framework:
# Traditional approach - passing by value
agent_a_result = agent_a.run("Analyze this dataset")
# agent_a_result is 2000 tokens
agent_b_prompt = f"""
Previous analysis: {agent_a_result}
Now perform the next step...
"""
agent_b_result = agent_b.run(agent_b_prompt)
# You just used 2000+ tokens even if agent_b only needed a summary
RLMs use passing by reference instead:
# RLM approach - passing by reference
context.agent_a_result = agent_a.run("Analyze this dataset")
# Result stored in Python variable
agent_b_prompt = """
Previous analysis available in context.agent_a_result
You can read it if needed using READ(context.agent_a_result)
Now perform the next step...
"""
agent_b_result = agent_b.run(agent_b_prompt, context)
# Agent B only loads data if it explicitly calls READ()
In production systems handling 50+ agent interactions per workflow, this difference reduces token consumption by approximately 60-70% compared to traditional architectures. That translates directly to lower API costs and faster response times.
How to Prevent Context Window Overflow in AI Agents
Context window overflow happens when your accumulated conversation history, tool outputs, and system prompts exceed the model's context limit. GPT-4 Turbo supports 128k tokens, but hitting that ceiling mid-workflow breaks your agent.
RLMs prevent overflow through selective context loading. Instead of automatically including all previous outputs, you design your scaffold to make context access explicit. The LLM must actively choose to read stored data, which means it only loads what it determines is relevant.
Implementing Selective Context Loading
Start by creating a context object that holds all intermediate results:
class AgentContext:
def __init__(self):
self.variables = {}
def store(self, key, value):
self.variables[key] = value
def read(self, key):
return self.variables.get(key)
def list_available(self):
return list(self.variables.keys())
Your agent system prompt should explain the context mechanism:
system_prompt = """
You have access to a context object with stored results.
Available functions:
- LIST_CONTEXT() - shows available context keys
- READ(key) - loads a specific context variable
- FINAL(result) - returns your final output
Only read context when you need it. Don't load everything.
"""
This pattern works particularly well when combined with properly structured agentic workflows that define clear handoff points between agents.
Using the FINAL() Function Pattern
Traditional agents generate outputs token by token through the LLM. If Agent A produces a 5,000-word analysis, the LLM must regenerate all 5,000 words even if they're already computed and stored in memory.
The FINAL() function short-circuits this waste. When an agent determines it's completed its task, it calls FINAL() with a reference to the stored result:
# Instead of regenerating output
agent_output = "Here is my complete 5000 word analysis..."
# Use FINAL to return stored data directly
context.store("analysis", perform_analysis())
return FINAL("analysis")
Your scaffold intercepts FINAL() calls and returns the referenced data directly, bypassing token generation entirely. In systems with multiple nested agents, this reduces total token usage by roughly 40% compared to full regeneration approaches.
How to Improve KV Cache Hit Rate in Language Models
The key-value (KV) cache stores computed attention keys and values from previous tokens, letting the model reuse calculations instead of recomputing from scratch. Higher cache hit rates mean faster inference and lower compute costs.
RLMs achieve approximately 90% KV cache hit rates because system prompts and scaffolding instructions remain constant across agent calls. When you structure your architecture so that the first 1,000 tokens of every agent prompt are identical, the model reuses those cached computations.
Here's how to structure prompts for maximum cache efficiency:
# Bad - variable system prompt reduces cache hits
def create_prompt(agent_name, task, previous_results):
return f"""
You are {agent_name}. Previous work: {previous_results}
Task: {task}
"""
# Good - constant prefix maximizes cache hits
CONSTANT_SYSTEM = """
You are an AI agent in a multi-agent system.
Available functions: READ(), WRITE(), FINAL()
Context variables listed below.
"""
def create_prompt(task, context):
return f"""
{CONSTANT_SYSTEM}
Available context: {context.list_available()}
Current task: {task}
"""
The constant system prompt gets cached after the first call. Subsequent agents reuse those KV pairs, reducing inference latency by 30-50ms per request in typical deployments.
Benchmarking Cache Performance
You can measure cache efficiency by tracking time-to-first-token (TTFT) across agent calls. With effective RLM scaffolding, your TTFT should drop significantly after the first agent in a workflow:
- First agent call: 400-600ms TTFT (cold cache)
- Subsequent calls with matching system prompt: 150-250ms TTFT (warm cache)
- Calls with variable prompts: 350-500ms TTFT (partial cache miss)
Monitor these metrics in your production systems to verify your scaffold is actually achieving cache benefits. If you're not seeing TTFT improvements, your prompts likely vary too much between calls.
Agentic AI Architecture Best Practices for Developers
Building production-grade agentic systems requires more than just chaining LLM calls. These patterns emerge from real deployments handling thousands of agent interactions daily.
Design for Bounded Token Usage
Even with RLM scaffolding, you need guardrails. Set explicit limits on how much context any single agent can load:
class TokenBudgetContext(AgentContext):
def __init__(self, max_tokens_per_read=2000):
super().__init__()
self.max_tokens = max_tokens_per_read
self.tokens_used = 0
def read(self, key):
value = self.variables.get(key)
token_count = len(value.split()) * 1.3 # rough estimate
if self.tokens_used + token_count > self.max_tokens:
raise ContextBudgetExceeded(
f"Reading {key} would exceed token budget"
)
self.tokens_used += token_count
return value
This prevents runaway context loading when an agent tries to read everything at once. You'll catch budget violations during development rather than in production.
Implement Explicit State Machines
Don't let agents decide workflow order through free-form reasoning. Define explicit state transitions:
class WorkflowState(Enum):
ANALYZE = "analyze"
SUMMARIZE = "summarize"
VALIDATE = "validate"
COMPLETE = "complete"
class StatefulRLM:
def __init__(self):
self.state = WorkflowState.ANALYZE
self.context = AgentContext()
def transition(self, new_state):
allowed = {
WorkflowState.ANALYZE: [WorkflowState.SUMMARIZE],
WorkflowState.SUMMARIZE: [WorkflowState.VALIDATE],
WorkflowState.VALIDATE: [WorkflowState.COMPLETE],
}
if new_state not in allowed.get(self.state, []):
raise InvalidTransition(
f"Cannot go from {self.state} to {new_state}"
)
self.state = new_state
Explicit state machines make your agentic systems testable and debuggable. You can unit test state transitions independently of LLM behavior, which matters when you're trying to validate AI systems before production deployment.
Separate Orchestration from Execution
Your orchestrator should manage workflow logic while individual agents focus on specific tasks. This separation lets you swap agent implementations without rewriting orchestration code:
class RLMOrchestrator:
def __init__(self, agents):
self.agents = agents
self.context = AgentContext()
def run_workflow(self, task):
# Orchestrator controls flow
analysis = self.agents['analyzer'].run(task, self.context)
self.context.store('analysis', analysis)
summary = self.agents['summarizer'].run(
"Summarize the analysis",
self.context
)
self.context.store('summary', summary)
return self.agents['validator'].run(
"Validate summary accuracy",
self.context
)
This pattern mirrors how forward deployed engineers structure production AI systems for maintainability and iteration speed.
Passing by Reference vs Passing by Value in LLM Agents
The reference vs value distinction isn't just about token efficiency. It fundamentally changes how you architect information flow in multi-agent systems.
Passing by value creates implicit dependencies. Agent B receives Agent A's full output whether it needs it or not. You can't easily version or modify Agent A's output format without breaking Agent B's expectations.
Passing by reference makes dependencies explicit. Agent B must actively choose to read Agent A's output, and that choice is visible in your execution logs. When debugging why Agent B failed, you can see exactly what context it loaded and when.
This visibility becomes critical in systems with 10+ agents where tracking information flow through value-passing becomes nearly impossible. Reference-based architectures give you clear dependency graphs that you can visualize and optimize.
Handling Unbounded Outputs
Traditional agent architectures hit hard limits when outputs exceed context windows. If your analysis agent produces 200k tokens of structured data, you simply can't pass that to the next agent in most frameworks.
RLMs store that data in Python objects, which are limited only by available memory. A 200k token output becomes a Python string or structured object that agents can selectively query:
# Store large structured output
large_analysis = {
'summary': '...',
'detailed_findings': [...], # Could be massive
'metadata': {...}
}
context.store('analysis', large_analysis)
# Next agent only reads what it needs
agent_prompt = """
Large analysis available in context.analysis
Use READ(context.analysis['summary']) to get overview
Only read detailed_findings if you need specific data points
"""
This pattern supports outputs of effectively unlimited size. Production systems handle structured data with millions of tokens stored in context while individual agent calls stay well under token limits.
Implementing RLM Scaffolding in Your System
You don't need to rewrite your entire agentic architecture to adopt RLM patterns. Start with the highest-traffic workflows where token costs are most painful.
Look, identify workflows where intermediate results are large but subsequent agents only need summaries or specific fields. These are your best candidates for reference-based passing. Instrument your current system to track token usage per agent call, then calculate potential savings if only 20-30% of stored context gets loaded.
Build a simple context manager class first, then gradually migrate agents to use READ() and FINAL() functions. You can run hybrid systems where some agents use traditional value passing while others use references, letting you validate the approach before full migration.
The architectural shift to RLMs requires upfront engineering work but pays dividends in token costs, response latency, system scalability, and honestly, developer sanity. For production agentic systems handling complex multi-step workflows, reference-based context management isn't optional anymore.
Get a free AI-powered SEO audit of your site
We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.
Run my free audit