GPT-5 System Card: The Router Era of AI Agents
White Paper

GPT-5 System Card: The Router Era of AI Agents

Jake McCluskeyUpdated
Back to white papers

Source: OpenAI GPT-5 System Card (PDF) (OpenAI, August 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

GPT-5 isn't a model, it's a routed system. A real-time router sits in front of two model families (gpt-5-main for fast answers, gpt-5-thinking for hard problems) and decides which handles each turn. It's also the first OpenAI model where native tool use is not a feature but a default. For agent builders, that means: the router you were going to build yourself (fast vs. deep, tool-using vs. chatty) is now baked into the frontier API.

Benchmarks that matter: 74.9% SWE-bench Verified, 94.6% AIME 2025 (no tools), 88.4% GPQA (Pro). State of the art on the three evals most predictive of agent quality.

1. What it is

The GPT-5 system card describes a unified endpoint that transparently dispatches a single user request to one of four models:

ModelRole
gpt-5-mainFast, high-throughput path. Answers most queries directly.
gpt-5-main-miniCheaper main model for when limits are hit
gpt-5-thinkingExtended reasoning. Hard math, multi-step code, ambiguous tasks.
gpt-5-thinking-miniCheaper thinking model for limit overflow

The router

The router is a learned online system fed by four signals:

  1. Conversation type: casual chat vs. technical problem
  2. Complexity: short factual query vs. multi-constraint puzzle
  3. Tool needs: does the request imply retrieval, code, or web access?
  4. Explicit intent: phrases like "think hard about this" force the thinking path

It retrains on real usage signals: which model users switched to after an initial answer, thumbs-up rates, measured correctness. The implication is that the router gets better over time without you doing anything.

Native tool use

Where GPT-4 supported tool calling as an optional feature, GPT-5 treats tools as a first-class modality. The system card reports a large gap between "GPT-5 no tools" and "GPT-5 with tools" on AIME (math), specifically so the field stops comparing apples to oranges.

2. Why it matters

Every serious agent project before GPT-5 had to build its own router. The pattern was always the same:

if query_looks_simple:
    use cheap_model
elif query_mentions_code_or_math:
    use reasoning_model
else:
    use default_model

That routing logic is fragile, manual, and never kept up with the frontier. GPT-5 moves it inside the model boundary. Consequences for agent builders:

  1. One endpoint, no router code. You call gpt-5, the system figures it out. Your agent stops needing a dispatcher.
  2. Bill shrinks without code changes. Fast queries go to gpt-5-main automatically; you no longer overspend on Pro for chit-chat.
  3. Agent eval surface changes. You now eval the system end-to-end, not "model X with prompt Y." The router is part of your product whether you like it or not.
  4. "Think hard" becomes a real API switch. You can bias toward the reasoning path with intent phrasing in your system prompt ("Before acting, think hard."), no new parameter required.

Why this is bigger than another model release: it's the first time a major lab has shipped a multi-model product under a single ID. The router is the same architectural idea you'll find in AgentKit, Magentic-UI, and Gemini Deep Think. Fast path for throughput, slow path for depth. GPT-5's system card is the clearest public description of how that idea ships in production.

3. How to do it

3.1. Call the unified endpoint

from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-5",  # routed automatically
    messages=[
        {"role": "system", "content": "You are a senior ops engineer. Think hard before acting."},
        {"role": "user",   "content": "Debug why the K8s deploy on prod-us-west is stuck in CrashLoopBackoff."},
    ],
    tools=[...],  # native tool use
)

The router sees the complexity, picks gpt-5-thinking, and you pay the thinking-tier rate. For a casual "summarize this email," the same endpoint picks gpt-5-main at a fraction of the cost.

3.2. Force a tier when you need determinism

The card documents that explicit intent phrases bias the router. Use them when cost or latency matter:

# Force the fast path (cheap + low-latency)
messages=[{"role":"user","content":"Quick: what's the capital of Peru?"}]

# Force the thinking path (slow + correct)
messages=[{"role":"user","content":"Think hard about this. Prove ..."}]

# Or just call the sub-model directly (OpenAI exposes them)
model="gpt-5-thinking"  # deterministic reasoning tier

For agent orchestrators, put "Think hard before calling any tool." in your system prompt. That biases every reasoning step toward the deep path.

3.3. Use native tool use, not function calling v1 style

The card emphasizes parallel, chained, long-horizon tool use. Pattern:

tools = [
    {"type": "function", "function": {
        "name": "run_sql",
        "description": "Runs a read-only SQL query against the warehouse.",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    }},
    {"type": "function", "function": {
        "name": "send_slack",
        "description": "Posts a message to a Slack channel.",
        "parameters": {
            "type": "object",
            "properties": {
                "channel": {"type": "string"},
                "text":    {"type": "string"},
            },
            "required": ["channel", "text"],
        },
    }},
]

resp = client.chat.completions.create(
    model="gpt-5",
    messages=[...],
    tools=tools,
    parallel_tool_calls=True,   # GPT-5 is good at this
)

Rule: let GPT-5 decide parallelism. Don't serialize in your code what the model can run concurrently.

3.4. The agent loop with GPT-5

messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": task}]
while True:
    resp = client.chat.completions.create(model="gpt-5", messages=messages, tools=tools)
    msg  = resp.choices[0].message
    messages.append(msg)
    if msg.tool_calls:
        for call in msg.tool_calls:
            result = dispatch(call.function.name, call.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })
        continue  # give the model the results and let it decide next
    break  # plain assistant reply, done

This is the same shape as Anthropic's gather-act-verify loop. GPT-5 just happens to be doing the tier selection inside the model boundary.

3.5. Cost and latency budgeting

The router makes budgeting trickier because each call's tier isn't deterministic. Two practical tactics:

  1. Per-request tier pinning: for anything latency-critical, call gpt-5-main directly. For anything correctness-critical, call gpt-5-thinking directly. Leave the router for the gray middle.
  2. Caps, not guesses: set max_completion_tokens low and measure which prompts actually need more. The thinking tier will use more tokens; size caps on a per-prompt basis.

3.6. Safety considerations from the card

The GPT-5 card is stricter than its predecessors on three fronts you'll see in your own agent:

  • Jailbreak resistance: higher refusal accuracy on prompt-injection attempts. If you were using GPT-4 with a quarantine pattern, GPT-5 may be safe enough without it. Don't remove the quarantine without testing.
  • Hallucination: lower but not zero. The card shows HealthBench Hard at 46.2%, a specialized benchmark where most production agents should still require human sign-off.
  • Tool-call hallucination: the model sometimes invents tool arguments even when it picks the right tool. Validate with JSON Schema server-side, always.

4. Benchmarks that should change your architecture

BenchmarkGPT-5 ScoreWhat it says about agent design
SWE-bench Verified74.9%You can stop stitching GPT-4 plus Claude. One model is enough for code-edit agents.
AIME 2025 (no tools)94.6%Planning/reasoning is no longer the bottleneck; integration is.
GPQA (Pro)88.4%Domain-expert agents (legal, med, finance) become viable
Aider Polyglot88%Multi-language refactor agents cross the "good enough to ship" line
MMMU84.2%Multimodal agents (screenshot to action) are production-viable
HealthBench Hard46.2%High-stakes verticals still need human-in-the-loop

5. Key takeaways

  • The router is the product. You're not calling a model, you're calling a system. Design your agent to work with that fact.
  • Native tool use equals fewer glue layers. Remove the custom function-calling abstractions you wrote for GPT-4.
  • "Think hard" is now real. Use intent phrases in system prompts to bias the router toward depth when it matters.
  • SWE-bench 74.9% changes the build-vs-buy math for code agents. Off-the-shelf GPT-5 plus good tools beats most custom stacks.
  • Server-side validation is not optional. Tool-argument hallucination is the remaining failure mode. JSON Schema every tool.
Common questions

Frequently asked

What models does the GPT-5 router choose between when handling a request?

The GPT-5 router dispatches requests to one of four models: gpt-5-main for fast, high-throughput queries, gpt-5-main-mini as a cheaper fallback, gpt-5-thinking for extended reasoning on hard problems, and gpt-5-thinking-mini as a cheaper thinking fallback. The router selects the appropriate model based on conversation type, complexity, tool needs, and explicit intent.

How does GPT-5 score on SWE-bench Verified and what does that mean for code agents?

GPT-5 achieves 74.9% on SWE-bench Verified. This score suggests that a single off-the-shelf GPT-5 system with good tooling is sufficient for production code-edit agents, eliminating the need to combine multiple models like GPT-4 and Claude.

How can I force GPT-5 to use the thinking tier instead of the fast tier?

You can bias the router toward the thinking tier by using explicit intent phrases like "Think hard about this" or "Think hard before acting" in your system prompt or user message. Alternatively, you can call gpt-5-thinking directly to deterministically use the reasoning tier for correctness-critical tasks.

What signals does the GPT-5 router use to decide which model tier to route a request to?

The router uses four signals: conversation type (casual chat versus technical problem), complexity (short factual query versus multi-constraint puzzle), tool needs (whether the request implies retrieval, code, or web access), and explicit intent (phrases that force the thinking path). The router retrains on real usage signals including user model switches, thumbs-up rates, and measured correctness.

Does GPT-5 still have tool-call hallucination issues and how should I handle them?

Yes, GPT-5 can still hallucinate tool arguments even when it selects the correct tool. The system card explicitly warns about this failure mode. You should validate all tool calls with JSON Schema on the server side, always, regardless of confidence in the model output.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

GPT-5 System Card: The Router Era of AI Agents