How Do I Decide When to Turn Claude's Extended Thinking On?

Extended thinking is Claude's "show its work" mode — the model spends tokens reasoning through a problem before answering. It's powerful, and it's the #1 way teams accidentally triple their Anthropic bill. Here's when to turn it on, when to leave it off, and the one settings change that cuts most wasted spend.
Why this matters
Extended thinking produces better answers on problems that actually need reasoning: multi-step math, complex planning, code where correctness depends on a chain of decisions, analyses that compare many options. On those, it's transformative.
It also produces better answers on trivial questions, at several times the token cost, for gains you can't measure. Asking "capitalize the word dog" with extended thinking on is technically fine and financially wasteful.
The decision isn't "use it" or "don't." It's "use it for the 20% of queries where it earns its keep." That requires understanding when reasoning tokens actually help and building your routing accordingly.
Before you start
You need:
- API access to Claude (extended thinking is API-exposed; Claude.ai has its own surface for this). Console-level access is enough.
- A handful of real prompts from your workload to test with. Synthetic tests don't reveal the real ROI.
- 15 minutes. This is a shorter guide because the answer is mostly "measure and decide."
Step 1: Understand the three thinking modes
Claude API exposes thinking as a parameter. Roughly:
thinking: off— normal request, no reasoning tokens. Fastest, cheapest.thinking: { type: "enabled", budget_tokens: N }— Claude may spend up to N tokens reasoning before answering. You pay for whatever thinking tokens it actually uses.thinking: { type: "enabled", budget_tokens: 16000 }(or higher) — deep thinking for genuinely hard problems. The model may take 30+ seconds.
Check Anthropic's current docs for the exact parameter names — they've stabilized but the API reference is the truth. The principles below don't change.
Step 2: Classify your prompts by reasoning need
Open a notebook (or notes file). List your last 20 production Claude prompts. For each, ask: if I gave this to a careful human, would they need to think before answering?
Rough buckets:
- No thinking needed: "Rewrite this paragraph in a friendlier tone." "Summarize this article in 3 bullets." "Extract the date from this email." — Fast pattern-matching tasks. Thinking adds cost with no accuracy gain.
- Moderate thinking: "Review this PR for logical bugs." "Plan the steps to migrate this database." — Helps sometimes, neutral other times. Worth testing.
- Deep thinking: "Given these five product strategies, which has the best risk-adjusted return and why?" "Debug this 200-line function that's producing a wrong result in edge case X." — Extended thinking meaningfully improves output.
Most production workloads are 70% bucket 1, 20% bucket 2, 10% bucket 3.
Step 3: Run a side-by-side on a representative prompt from each bucket
Take one real prompt from each bucket. Run each twice: once with thinking off, once with a medium budget (e.g. 5000 tokens).
Compare:
- Output quality: is the thinking-on answer actually better on a metric you care about? Not "does it look more thorough" — better.
- Latency: thinking adds 10-30+ seconds typically. For interactive UX that's painful; for batch workloads it's fine.
- Cost: thinking tokens aren't free. Sum
usage.thinking_tokens * input_token_priceto see the uplift.
You'll often find bucket-1 prompts show no quality improvement with thinking on, bucket-3 prompts show substantial improvement, and bucket-2 is mixed.
Step 4: Route prompts by classification
The right answer is almost never "thinking on everywhere." Build your app to pick per prompt:
def needs_deep_thinking(prompt: str, task_type: str) -> bool:
# Cheap rules first
if task_type in ("formatting", "extraction", "summarization"):
return False
if task_type in ("planning", "debugging", "strategic_analysis"):
return True
# For ambiguous cases, a short classifier can decide
return task_type == "analysis" and len(prompt) > 1000
def call_claude(prompt: str, task_type: str):
request = {"model": "claude-sonnet-4-5", "max_tokens": 4096}
if needs_deep_thinking(prompt, task_type):
request["thinking"] = {"type": "enabled", "budget_tokens": 8000}
# ... add messages and sendFor consumer-facing chatbots, you can also let the user flag their own prompts ("think harder" button) — cheap way to route without building a classifier.
Step 5: Monitor and adjust
Log the thinking decision and the outcome. Useful metrics:
- % of requests with thinking on. Target: matches your bucket distribution, ~15-30% for most workloads.
- Avg thinking tokens used. If your budget is 8000 and usage is 2000, lower the budget — you're overpaying for ceiling.
- Quality deltas by bucket. If bucket-2 prompts aren't actually benefiting, move them to bucket-1 and save the spend.
Re-check monthly. Workloads shift. What was bucket 2 six months ago might be bucket 1 now as prompts and data get refined.
Verify it worked
1. Thinking is off for simple extraction/formatting tasks. Check your logs — if format-rewrite prompts have thinking: enabled, your routing is wrong.
2. Thinking is on for genuinely complex ones. Same check in reverse — debug-the-function prompts should be routed to thinking on.
3. Cost line item makes sense. In the Anthropic usage dashboard, your thinking-token spend should be a real but minority fraction of total. If it's the dominant line, you have thinking on everywhere.
Where this breaks
- Using thinking for "safety" across the board. Some teams turn it on everywhere because "more thinking = safer answers." That's true for bucket-3 tasks; for bucket-1, it's not safer and it's 3-5x more expensive. Route.
- Not counting thinking tokens in your budget math. If you're forecasting API spend, thinking tokens are input-priced separately — easy to miss. Always pull both
input_tokensandthinking_tokensfrom usage. - Users seeing latency spikes. A UI expecting 2-second responses now takes 20 seconds when thinking is on. Either set user expectations ("this one's complex, give me a minute") or cap the thinking budget.
- Over-reliance for problems Claude should just retrieve. Some "thinking" questions are better solved by a tool call to real data — no amount of reasoning replaces a fresh Ahrefs pull. Route to MCP tools first; use thinking for decisions that reason over tool output, not reasoning in place of tool output.
- Mixing thinking with the Batch API. Check the current docs — thinking mode may interact with batch in unexpected ways (supported or not, priced differently). Verify before assuming stacks cleanly.
What to try next
- How Do I Cut My Anthropic Bill in Half Using the Batch API? — the cost lever that actually matters when thinking is routed right.
- How Do I Keep My Claude Prompt Cache Hit Rate High? — the other cost lever: cache the repeated system prompt so you're only paying full price for the dynamic tail.
- How Do I Build a Research Pipeline with the Agent SDK? — research agents are bucket-3 workloads; a great place to use thinking by default.
Let's talk about your AI + SEO stack
If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.
Let's Talk