GPT-5 System Card: The Router Era of AI Agents

Source: OpenAI GPT-5 System Card (PDF) (OpenAI, August 2025)
Series: The 10 Agent Whitepapers Every Builder Should Read
TL;DR
GPT-5 isn't a model, it's a routed system. A real-time router sits in front of two model families (gpt-5-main for fast answers, gpt-5-thinking for hard problems) and decides which handles each turn. It's also the first OpenAI model where native tool use is not a feature but a default. For agent builders, that means: the router you were going to build yourself (fast vs. deep, tool-using vs. chatty) is now baked into the frontier API.
Benchmarks that matter: 74.9% SWE-bench Verified, 94.6% AIME 2025 (no tools), 88.4% GPQA (Pro). State of the art on the three evals most predictive of agent quality.
1. What it is
The GPT-5 system card describes a unified endpoint that transparently dispatches a single user request to one of four models:
| Model | Role |
|---|---|
gpt-5-main | Fast, high-throughput path. Answers most queries directly. |
gpt-5-main-mini | Cheaper main model for when limits are hit |
gpt-5-thinking | Extended reasoning. Hard math, multi-step code, ambiguous tasks. |
gpt-5-thinking-mini | Cheaper thinking model for limit overflow |
The router
The router is a learned online system fed by four signals:
- Conversation type: casual chat vs. technical problem
- Complexity: short factual query vs. multi-constraint puzzle
- Tool needs: does the request imply retrieval, code, or web access?
- Explicit intent: phrases like "think hard about this" force the thinking path
It retrains on real usage signals: which model users switched to after an initial answer, thumbs-up rates, measured correctness. The implication is that the router gets better over time without you doing anything.
Native tool use
Where GPT-4 supported tool calling as an optional feature, GPT-5 treats tools as a first-class modality. The system card reports a large gap between "GPT-5 no tools" and "GPT-5 with tools" on AIME (math), specifically so the field stops comparing apples to oranges.
2. Why it matters
Every serious agent project before GPT-5 had to build its own router. The pattern was always the same:
if query_looks_simple:
use cheap_model
elif query_mentions_code_or_math:
use reasoning_model
else:
use default_model
That routing logic is fragile, manual, and never kept up with the frontier. GPT-5 moves it inside the model boundary. Consequences for agent builders:
- One endpoint, no router code. You call
gpt-5, the system figures it out. Your agent stops needing a dispatcher. - Bill shrinks without code changes. Fast queries go to
gpt-5-mainautomatically; you no longer overspend on Pro for chit-chat. - Agent eval surface changes. You now eval the system end-to-end, not "model X with prompt Y." The router is part of your product whether you like it or not.
- "Think hard" becomes a real API switch. You can bias toward the reasoning path with intent phrasing in your system prompt ("Before acting, think hard."), no new parameter required.
Why this is bigger than another model release: it's the first time a major lab has shipped a multi-model product under a single ID. The router is the same architectural idea you'll find in AgentKit, Magentic-UI, and Gemini Deep Think. Fast path for throughput, slow path for depth. GPT-5's system card is the clearest public description of how that idea ships in production.
3. How to do it
3.1. Call the unified endpoint
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-5", # routed automatically
messages=[
{"role": "system", "content": "You are a senior ops engineer. Think hard before acting."},
{"role": "user", "content": "Debug why the K8s deploy on prod-us-west is stuck in CrashLoopBackoff."},
],
tools=[...], # native tool use
)
The router sees the complexity, picks gpt-5-thinking, and you pay the thinking-tier rate. For a casual "summarize this email," the same endpoint picks gpt-5-main at a fraction of the cost.
3.2. Force a tier when you need determinism
The card documents that explicit intent phrases bias the router. Use them when cost or latency matter:
# Force the fast path (cheap + low-latency)
messages=[{"role":"user","content":"Quick: what's the capital of Peru?"}]
# Force the thinking path (slow + correct)
messages=[{"role":"user","content":"Think hard about this. Prove ..."}]
# Or just call the sub-model directly (OpenAI exposes them)
model="gpt-5-thinking" # deterministic reasoning tier
For agent orchestrators, put "Think hard before calling any tool." in your system prompt. That biases every reasoning step toward the deep path.
3.3. Use native tool use, not function calling v1 style
The card emphasizes parallel, chained, long-horizon tool use. Pattern:
tools = [
{"type": "function", "function": {
"name": "run_sql",
"description": "Runs a read-only SQL query against the warehouse.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}},
{"type": "function", "function": {
"name": "send_slack",
"description": "Posts a message to a Slack channel.",
"parameters": {
"type": "object",
"properties": {
"channel": {"type": "string"},
"text": {"type": "string"},
},
"required": ["channel", "text"],
},
}},
]
resp = client.chat.completions.create(
model="gpt-5",
messages=[...],
tools=tools,
parallel_tool_calls=True, # GPT-5 is good at this
)
Rule: let GPT-5 decide parallelism. Don't serialize in your code what the model can run concurrently.
3.4. The agent loop with GPT-5
messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": task}]
while True:
resp = client.chat.completions.create(model="gpt-5", messages=messages, tools=tools)
msg = resp.choices[0].message
messages.append(msg)
if msg.tool_calls:
for call in msg.tool_calls:
result = dispatch(call.function.name, call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
continue # give the model the results and let it decide next
break # plain assistant reply — done
This is the same shape as Anthropic's gather-act-verify loop. GPT-5 just happens to be doing the tier selection inside the model boundary.
3.5. Cost and latency budgeting
The router makes budgeting trickier because each call's tier isn't deterministic. Two practical tactics:
- Per-request tier pinning: for anything latency-critical, call
gpt-5-maindirectly. For anything correctness-critical, callgpt-5-thinkingdirectly. Leave the router for the gray middle. - Caps, not guesses: set
max_completion_tokenslow and measure which prompts actually need more. The thinking tier will use more tokens; size caps on a per-prompt basis.
3.6. Safety considerations from the card
The GPT-5 card is stricter than its predecessors on three fronts you'll see in your own agent:
- Jailbreak resistance: higher refusal accuracy on prompt-injection attempts. If you were using GPT-4 with a quarantine pattern, GPT-5 may be safe enough without it. Don't remove the quarantine without testing.
- Hallucination: lower but not zero. The card shows HealthBench Hard at 46.2%, a specialized benchmark where most production agents should still require human sign-off.
- Tool-call hallucination: the model sometimes invents tool arguments even when it picks the right tool. Validate with JSON Schema server-side, always.
4. Benchmarks that should change your architecture
| Benchmark | GPT-5 Score | What it says about agent design |
|---|---|---|
| SWE-bench Verified | 74.9% | You can stop stitching GPT-4 plus Claude. One model is enough for code-edit agents. |
| AIME 2025 (no tools) | 94.6% | Planning/reasoning is no longer the bottleneck; integration is. |
| GPQA (Pro) | 88.4% | Domain-expert agents (legal, med, finance) become viable |
| Aider Polyglot | 88% | Multi-language refactor agents cross the "good enough to ship" line |
| MMMU | 84.2% | Multimodal agents (screenshot to action) are production-viable |
| HealthBench Hard | 46.2% | High-stakes verticals still need human-in-the-loop |
5. Key takeaways
- The router is the product. You're not calling a model, you're calling a system. Design your agent to work with that fact.
- Native tool use equals fewer glue layers. Remove the custom function-calling abstractions you wrote for GPT-4.
- "Think hard" is now real. Use intent phrases in system prompts to bias the router toward depth when it matters.
- SWE-bench 74.9% changes the build-vs-buy math for code agents. Off-the-shelf GPT-5 plus good tools beats most custom stacks.
- Server-side validation is not optional. Tool-argument hallucination is the remaining failure mode. JSON Schema every tool.