Gemini 2.5: The 1M-Token Long-Context Agent Model
White Paper

Gemini 2.5: The 1M-Token Long-Context Agent Model

Jake McCluskeyUpdated
Back to white papers

Source: Gemini 2.5 Technical Report (PDF) (Google DeepMind, 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

Gemini 2.5 Pro is the first frontier model where context length stops being a constraint. With 1 million tokens (2M in preview) and native MCP support baked into the SDK, it lets agents hold an entire codebase, a full day of logs, or a 200-page research corpus in one prompt. No chunking, no retrieval tricks. SWE-bench Verified: 63.8% with a custom agent. Deep Think mode adds parallel-hypothesis reasoning that scores state of the art on 2025 USAMO, currently one of the hardest public math benchmarks.

For agent builders, Gemini 2.5 is the "skip the vector DB" model. Its value prop is simple: if your data fits in 1M tokens, you can probably delete your entire RAG pipeline.

1. What it is

Gemini 2.5 ships three things that matter for agents.

1.1. A 1M-token context window (2M in preview)

That's roughly:

  • About 750,000 English words
  • About 60,000 lines of Python
  • The entire Linux kernel's core subsystem
  • 8 hours of transcribed audio
  • A 1,500-page book

The cost is non-zero (you pay per input token), but the engineering complexity of RAG for mid-size corpora disappears.

1.2. Native MCP support

Gemini's SDK added first-class Model Context Protocol support in 2025. You point it at MCP servers the same way you would with Claude Desktop. No adapter layer. Which means every MCP server built for Claude also works with Gemini, and vice versa.

1.3. Deep Think, parallel-hypothesis reasoning

Deep Think is Gemini's analog to GPT-5's thinking tier and Claude's extended thinking. It differs in one important way: instead of a single long chain of thought, it explores multiple hypotheses in parallel and picks the best. The public evidence it works: state-of-the-art on 2025 USAMO (the US Math Olympiad), which is brutally hard.

1.4. Benchmark snapshot (from the report and launch materials)

BenchmarkScoreWhy it matters
SWE-bench Verified63.8% (custom agent)Real-world code-edit agent performance
2025 USAMOState-of-the-art (Deep Think)Hard multi-step reasoning
Long-context needle-in-haystack @1MAbout very highConfirms the context window is usable, not just nominal
GAIACompetitive with top closed modelsGeneral agentic task benchmark

2. Why it matters

Three architectural shifts that become possible or cheap with Gemini 2.5.

2.1. "Just put the repo in the prompt"

RAG was always a cost optimization. When your context was 8K or 32K tokens, you had to chunk, embed, retrieve, re-rank. With 1M tokens, most internal corpora fit whole. Which means:

  • No chunk-boundary bugs
  • No embedding model drift
  • No re-ranking heuristics
  • No "the relevant paragraph got split across two chunks" issues

You trade that complexity for a higher per-call input-token bill. For agent workloads (where reasoning quality dominates cost), the trade is almost always worth it.

2.2. Full conversation replay

For support agents, sales agents, or anything with a long history, you can keep the entire conversation in context. Memory MCP becomes an optimization, not a requirement. Users notice. The agent stops forgetting what they said ten turns ago.

2.3. Whole-session agent traces as context

For debugging agents in production, you can feed the model the entire failing trace (system prompt plus every tool call plus every result) and ask "where did this go wrong?" That kind of post-hoc analysis was infeasible at 32K.

2.4. MCP as the portable layer

Gemini, Claude, Cursor, and Zed all speak MCP. The same Slack, GitHub, Postgres, and Drive servers run against all of them. This is the biggest under-appreciated fact in the agent ecosystem right now: the connector layer is now multi-vendor.

3. How to do it

3.1. Install and authenticate

pip install google-genai
export GEMINI_API_KEY=...

3.2. The minimum long-context agent

from google import genai

client = genai.Client()

# Load an entire codebase into a single prompt
codebase = ""
for path in Path("./src").rglob("*.py"):
    codebase += f"\n\n# FILE: {path}\n" + path.read_text()

resp = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        codebase,
        "Find every function that opens a DB connection without closing it. "
        "Return file path + line number. Think hard before answering."
    ],
)
print(resp.text)

No vector DB. No chunking. No re-ranking. 1M tokens, one call.

3.3. Enable Deep Think for hard reasoning

resp = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=prompt,
    config={
        "thinking_config": {"thinking_budget": 32000},  # deep think
    },
)

thinking_budget caps how much internal reasoning Gemini spends. Higher equals more correct, slower, more expensive. For agent planning, 8K to 32K is typical.

3.4. Use MCP servers natively

from google.genai import types
from mcp_client import MCPServerStdio  # pseudocode, use Gemini's MCP client

async with MCPServerStdio(
    command="npx", args=["-y", "@modelcontextprotocol/server-github"]
) as github_mcp:
    resp = await client.aio.models.generate_content(
        model="gemini-2.5-pro",
        contents="Summarize open PRs on anthropics/anthropic-sdk-python.",
        config=types.GenerateContentConfig(
            tools=[github_mcp],  # pass the MCP server as a tool source
        ),
    )

Every MCP server in the 10 Best MCP Servers list works here unchanged.

3.5. Multi-step tool use, the agent loop

tools = [types.Tool(function_declarations=[{
    "name": "run_sql",
    "description": "Read-only warehouse query.",
    "parameters": {"type": "object", "properties": {"q": {"type": "string"}}, "required": ["q"]},
}])]

contents = [types.Content(role="user", parts=[types.Part(text=task)])]
while True:
    resp = client.models.generate_content(
        model="gemini-2.5-pro",
        contents=contents,
        config=types.GenerateContentConfig(tools=tools),
    )
    fn = resp.candidates[0].content.parts[0].function_call
    if not fn:
        print(resp.text); break
    result = dispatch(fn.name, fn.args)
    contents.append(resp.candidates[0].content)
    contents.append(types.Content(role="user", parts=[
        types.Part(function_response={"name": fn.name, "response": result})
    ]))

3.6. When Gemini 2.5 is the right pick

Use Gemini 2.5 when:

  • Your corpus is between 200K and 2M tokens (above: split; below: use Sonnet)
  • You have non-text content. Native video and audio understanding is a differentiator.
  • You need the lowest latency at Pro quality. Flash 2.5 is still the fastest Pro-tier thinking model on the market.
  • You're building a Google Cloud-native agent (Vertex AI, BigQuery, GCS integration is first-class)

Use a different model when:

  • You need the strongest SWE-bench performance, Claude Opus or GPT-5
  • You need tight cache economics on small prompts, Claude Haiku
  • You need open weights, Kimi K2 or Llama

4. Architecture patterns made cheap by 1M context

4.1. "Entire codebase as context" code agent

┌──────────────────────────────────────────────┐
│  System prompt (2K tokens)                   │
│  Repo snapshot: all .py + .md (400K tokens)  │
│  Issue text (2K tokens)                      │
│  Prior reviewer notes (10K tokens)           │
│  → Gemini 2.5 Pro                            │
│  → Returns diff + rationale                  │
└──────────────────────────────────────────────┘

No retrieval. The model sees everything.

4.2. "Full transcript analysis" support agent

Feed every ticket, every reply, every internal note into the context. Ask "what's the root cause of this customer's churn risk?" The model has the whole timeline.

4.3. "Whole research corpus" analyst

50 PDFs at about 20K tokens each equals 1M tokens. The agent writes a cross-paper literature review with citations back to the source papers.

5. Key takeaways

  • 1M tokens kills RAG for most use cases. You'll still want retrieval for the 10M-plus token domain, but the 100K to 1M middle class just got simpler.
  • Native MCP means the connector layer is cross-vendor. Build servers once; use from Claude, Gemini, Cursor, Zed.
  • Deep Think for hard planning. Use it for multi-step agent plans and constraint-satisfaction problems; skip it for chit-chat.
  • Budget input tokens, not vector-DB ops. Your cost model shifts from "embeddings plus infra" to "per-token input". That's a product/finance decision, not just an engineering one.
Common questions

Frequently asked

How many tokens can Gemini 2.5 Pro handle in a single context window?

Gemini 2.5 Pro supports a 1 million token context window, with a 2 million token preview version available. This is roughly equivalent to 750,000 English words, 60,000 lines of Python code, or a 1,500-page book. The large context window means you can often skip chunking and retrieval pipelines for mid-size corpora.

What is Deep Think mode in Gemini 2.5 and when should I use it?

Deep Think is Gemini 2.5's extended reasoning mode that explores multiple hypotheses in parallel rather than following a single chain of thought. It achieved state-of-the-art performance on the 2025 USAMO math benchmark. Use it for multi-step agent plans and constraint-satisfaction problems, and skip it for simple conversational tasks where the extra reasoning cost is unnecessary.

Does Gemini 2.5 support the Model Context Protocol natively?

Yes, Gemini 2.5's SDK includes first-class Model Context Protocol support. You can point it at MCP servers the same way you would with Claude Desktop, with no adapter layer required. This means every MCP server built for Claude also works with Gemini, making the connector layer portable across multiple vendors including Claude, Cursor, and Zed.

What score did Gemini 2.5 achieve on SWE-bench Verified?

Gemini 2.5 achieved 63.8 percent on SWE-bench Verified using a custom agent. SWE-bench Verified measures real-world code-edit agent performance, making this benchmark a practical indicator of how well the model handles actual software engineering tasks.

When should I use Gemini 2.5 instead of Claude or GPT models for agent workloads?

Use Gemini 2.5 when your corpus is between 200K and 2M tokens, when you need native video and audio understanding, when you require the lowest latency at Pro quality, or when building Google Cloud-native agents with Vertex AI and BigQuery integration. For the strongest SWE-bench performance choose Claude Opus or GPT-5, for tight cache economics on small prompts use Claude Haiku, and for open weights use Kimi K2 or Llama.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

Gemini 2.5: The 1M-Token Long-Context Agent Model