Back to white papers
White Paper

GPT-5 System Card: The Router Era of AI Agents

Jake McCluskey
GPT-5 System Card: The Router Era of AI Agents

Source: OpenAI GPT-5 System Card (PDF) (OpenAI, August 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read

TL;DR

GPT-5 isn't a model, it's a routed system. A real-time router sits in front of two model families (gpt-5-main for fast answers, gpt-5-thinking for hard problems) and decides which handles each turn. It's also the first OpenAI model where native tool use is not a feature but a default. For agent builders, that means: the router you were going to build yourself (fast vs. deep, tool-using vs. chatty) is now baked into the frontier API.

Benchmarks that matter: 74.9% SWE-bench Verified, 94.6% AIME 2025 (no tools), 88.4% GPQA (Pro). State of the art on the three evals most predictive of agent quality.

1. What it is

The GPT-5 system card describes a unified endpoint that transparently dispatches a single user request to one of four models:

ModelRole
gpt-5-mainFast, high-throughput path. Answers most queries directly.
gpt-5-main-miniCheaper main model for when limits are hit
gpt-5-thinkingExtended reasoning. Hard math, multi-step code, ambiguous tasks.
gpt-5-thinking-miniCheaper thinking model for limit overflow

The router

The router is a learned online system fed by four signals:

  1. Conversation type: casual chat vs. technical problem
  2. Complexity: short factual query vs. multi-constraint puzzle
  3. Tool needs: does the request imply retrieval, code, or web access?
  4. Explicit intent: phrases like "think hard about this" force the thinking path

It retrains on real usage signals: which model users switched to after an initial answer, thumbs-up rates, measured correctness. The implication is that the router gets better over time without you doing anything.

Native tool use

Where GPT-4 supported tool calling as an optional feature, GPT-5 treats tools as a first-class modality. The system card reports a large gap between "GPT-5 no tools" and "GPT-5 with tools" on AIME (math), specifically so the field stops comparing apples to oranges.

2. Why it matters

Every serious agent project before GPT-5 had to build its own router. The pattern was always the same:

if query_looks_simple:
    use cheap_model
elif query_mentions_code_or_math:
    use reasoning_model
else:
    use default_model

That routing logic is fragile, manual, and never kept up with the frontier. GPT-5 moves it inside the model boundary. Consequences for agent builders:

  1. One endpoint, no router code. You call gpt-5, the system figures it out. Your agent stops needing a dispatcher.
  2. Bill shrinks without code changes. Fast queries go to gpt-5-main automatically; you no longer overspend on Pro for chit-chat.
  3. Agent eval surface changes. You now eval the system end-to-end, not "model X with prompt Y." The router is part of your product whether you like it or not.
  4. "Think hard" becomes a real API switch. You can bias toward the reasoning path with intent phrasing in your system prompt ("Before acting, think hard."), no new parameter required.

Why this is bigger than another model release: it's the first time a major lab has shipped a multi-model product under a single ID. The router is the same architectural idea you'll find in AgentKit, Magentic-UI, and Gemini Deep Think. Fast path for throughput, slow path for depth. GPT-5's system card is the clearest public description of how that idea ships in production.

3. How to do it

3.1. Call the unified endpoint

from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-5",  # routed automatically
    messages=[
        {"role": "system", "content": "You are a senior ops engineer. Think hard before acting."},
        {"role": "user",   "content": "Debug why the K8s deploy on prod-us-west is stuck in CrashLoopBackoff."},
    ],
    tools=[...],  # native tool use
)

The router sees the complexity, picks gpt-5-thinking, and you pay the thinking-tier rate. For a casual "summarize this email," the same endpoint picks gpt-5-main at a fraction of the cost.

3.2. Force a tier when you need determinism

The card documents that explicit intent phrases bias the router. Use them when cost or latency matter:

# Force the fast path (cheap + low-latency)
messages=[{"role":"user","content":"Quick: what's the capital of Peru?"}]

# Force the thinking path (slow + correct)
messages=[{"role":"user","content":"Think hard about this. Prove ..."}]

# Or just call the sub-model directly (OpenAI exposes them)
model="gpt-5-thinking"  # deterministic reasoning tier

For agent orchestrators, put "Think hard before calling any tool." in your system prompt. That biases every reasoning step toward the deep path.

3.3. Use native tool use, not function calling v1 style

The card emphasizes parallel, chained, long-horizon tool use. Pattern:

tools = [
    {"type": "function", "function": {
        "name": "run_sql",
        "description": "Runs a read-only SQL query against the warehouse.",
        "parameters": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    }},
    {"type": "function", "function": {
        "name": "send_slack",
        "description": "Posts a message to a Slack channel.",
        "parameters": {
            "type": "object",
            "properties": {
                "channel": {"type": "string"},
                "text":    {"type": "string"},
            },
            "required": ["channel", "text"],
        },
    }},
]

resp = client.chat.completions.create(
    model="gpt-5",
    messages=[...],
    tools=tools,
    parallel_tool_calls=True,   # GPT-5 is good at this
)

Rule: let GPT-5 decide parallelism. Don't serialize in your code what the model can run concurrently.

3.4. The agent loop with GPT-5

messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": task}]
while True:
    resp = client.chat.completions.create(model="gpt-5", messages=messages, tools=tools)
    msg  = resp.choices[0].message
    messages.append(msg)
    if msg.tool_calls:
        for call in msg.tool_calls:
            result = dispatch(call.function.name, call.function.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })
        continue  # give the model the results and let it decide next
    break  # plain assistant reply — done

This is the same shape as Anthropic's gather-act-verify loop. GPT-5 just happens to be doing the tier selection inside the model boundary.

3.5. Cost and latency budgeting

The router makes budgeting trickier because each call's tier isn't deterministic. Two practical tactics:

  1. Per-request tier pinning: for anything latency-critical, call gpt-5-main directly. For anything correctness-critical, call gpt-5-thinking directly. Leave the router for the gray middle.
  2. Caps, not guesses: set max_completion_tokens low and measure which prompts actually need more. The thinking tier will use more tokens; size caps on a per-prompt basis.

3.6. Safety considerations from the card

The GPT-5 card is stricter than its predecessors on three fronts you'll see in your own agent:

  • Jailbreak resistance: higher refusal accuracy on prompt-injection attempts. If you were using GPT-4 with a quarantine pattern, GPT-5 may be safe enough without it. Don't remove the quarantine without testing.
  • Hallucination: lower but not zero. The card shows HealthBench Hard at 46.2%, a specialized benchmark where most production agents should still require human sign-off.
  • Tool-call hallucination: the model sometimes invents tool arguments even when it picks the right tool. Validate with JSON Schema server-side, always.

4. Benchmarks that should change your architecture

BenchmarkGPT-5 ScoreWhat it says about agent design
SWE-bench Verified74.9%You can stop stitching GPT-4 plus Claude. One model is enough for code-edit agents.
AIME 2025 (no tools)94.6%Planning/reasoning is no longer the bottleneck; integration is.
GPQA (Pro)88.4%Domain-expert agents (legal, med, finance) become viable
Aider Polyglot88%Multi-language refactor agents cross the "good enough to ship" line
MMMU84.2%Multimodal agents (screenshot to action) are production-viable
HealthBench Hard46.2%High-stakes verticals still need human-in-the-loop

5. Key takeaways

  • The router is the product. You're not calling a model, you're calling a system. Design your agent to work with that fact.
  • Native tool use equals fewer glue layers. Remove the custom function-calling abstractions you wrote for GPT-4.
  • "Think hard" is now real. Use intent phrases in system prompts to bias the router toward depth when it matters.
  • SWE-bench 74.9% changes the build-vs-buy math for code agents. Off-the-shelf GPT-5 plus good tools beats most custom stacks.
  • Server-side validation is not optional. Tool-argument hallucination is the remaining failure mode. JSON Schema every tool.