Does extended thinking always improve answer quality?

No. On simple extraction, formatting, or summarization tasks, it produces equivalent answers at 3-5x the cost. On multi-step reasoning, debugging, and strategic comparisons, it meaningfully improves output. The trick is routing by task type.

What's a reasonable thinking budget?

Start at 5000-8000 tokens for moderate reasoning and 16000+ for deep work. Monitor actual usage — if Claude consistently uses 2000 tokens of a 16000-token budget, lower the ceiling. You pay for used tokens, but higher ceilings can subtly encourage longer responses.

Can I let users decide when to turn thinking on?

Yes, and it's a cheap way to do routing without a classifier. A 'think harder' button in your UI that toggles thinking mode handles edge cases gracefully — users know when their question is complex, and you only pay for thinking when they opt in.

Does thinking work with tool use?

Yes. Claude can think, then call a tool, then think about the tool result, then respond. This is actually where thinking shines — reasoning over fresh data rather than reasoning in a vacuum. Research agents are a great fit.

How do I see how much I'm spending on thinking tokens?

The API response includes thinking_tokens in the usage block, priced as input tokens. In the console.anthropic.com usage dashboard you can see the breakdown. If thinking is more than 30% of total cost and you haven't deliberately routed it, you're almost certainly overusing it.

Back to guides

How Do I Decide When to Turn Claude's Extended Thinking On?

Jake McCluskey•April 24, 2026•Intermediate•25 min read

Extended thinking is Claude's "show its work" mode — the model spends tokens reasoning through a problem before answering. It's powerful, and it's the #1 way teams accidentally triple their Anthropic bill. Here's when to turn it on, when to leave it off, and the one settings change that cuts most wasted spend.

Why this matters

Extended thinking produces better answers on problems that actually need reasoning: multi-step math, complex planning, code where correctness depends on a chain of decisions, analyses that compare many options. On those, it's transformative.

It also produces better answers on trivial questions, at several times the token cost, for gains you can't measure. Asking "capitalize the word dog" with extended thinking on is technically fine and financially wasteful.

The decision isn't "use it" or "don't." It's "use it for the 20% of queries where it earns its keep." That requires understanding when reasoning tokens actually help and building your routing accordingly.

Before you start

You need:

API access to Claude (extended thinking is API-exposed; Claude.ai has its own surface for this). Console-level access is enough.
A handful of real prompts from your workload to test with. Synthetic tests don't reveal the real ROI.
15 minutes. This is a shorter guide because the answer is mostly "measure and decide."

Step 1: Understand the three thinking modes

Claude API exposes thinking as a parameter. Roughly:

thinking: off — normal request, no reasoning tokens. Fastest, cheapest.
thinking: { type: "enabled", budget_tokens: N } — Claude may spend up to N tokens reasoning before answering. You pay for whatever thinking tokens it actually uses.
thinking: { type: "enabled", budget_tokens: 16000 } (or higher) — deep thinking for genuinely hard problems. The model may take 30+ seconds.

Check Anthropic's current docs for the exact parameter names — they've stabilized but the API reference is the truth. The principles below don't change.

Step 2: Classify your prompts by reasoning need

Open a notebook (or notes file). List your last 20 production Claude prompts. For each, ask: if I gave this to a careful human, would they need to think before answering?

Rough buckets:

No thinking needed: "Rewrite this paragraph in a friendlier tone." "Summarize this article in 3 bullets." "Extract the date from this email." — Fast pattern-matching tasks. Thinking adds cost with no accuracy gain.
Moderate thinking: "Review this PR for logical bugs." "Plan the steps to migrate this database." — Helps sometimes, neutral other times. Worth testing.
Deep thinking: "Given these five product strategies, which has the best risk-adjusted return and why?" "Debug this 200-line function that's producing a wrong result in edge case X." — Extended thinking meaningfully improves output.

Most production workloads are 70% bucket 1, 20% bucket 2, 10% bucket 3.

Step 3: Run a side-by-side on a representative prompt from each bucket

Take one real prompt from each bucket. Run each twice: once with thinking off, once with a medium budget (e.g. 5000 tokens).

Compare:

Output quality: is the thinking-on answer actually better on a metric you care about? Not "does it look more thorough" — better.
Latency: thinking adds 10-30+ seconds typically. For interactive UX that's painful; for batch workloads it's fine.
Cost: thinking tokens aren't free. Sum usage.thinking_tokens * input_token_price to see the uplift.

You'll often find bucket-1 prompts show no quality improvement with thinking on, bucket-3 prompts show substantial improvement, and bucket-2 is mixed.

Step 4: Route prompts by classification

The right answer is almost never "thinking on everywhere." Build your app to pick per prompt:

python

def needs_deep_thinking(prompt: str, task_type: str) -> bool:
    # Cheap rules first
    if task_type in ("formatting", "extraction", "summarization"):
        return False
    if task_type in ("planning", "debugging", "strategic_analysis"):
        return True

    # For ambiguous cases, a short classifier can decide
    return task_type == "analysis" and len(prompt) > 1000

def call_claude(prompt: str, task_type: str):
    request = {"model": "claude-sonnet-4-5", "max_tokens": 4096}
    if needs_deep_thinking(prompt, task_type):
        request["thinking"] = {"type": "enabled", "budget_tokens": 8000}
    # ... add messages and send

For consumer-facing chatbots, you can also let the user flag their own prompts ("think harder" button) — cheap way to route without building a classifier.

Step 5: Monitor and adjust

Log the thinking decision and the outcome. Useful metrics:

% of requests with thinking on. Target: matches your bucket distribution, ~15-30% for most workloads.
Avg thinking tokens used. If your budget is 8000 and usage is 2000, lower the budget — you're overpaying for ceiling.
Quality deltas by bucket. If bucket-2 prompts aren't actually benefiting, move them to bucket-1 and save the spend.

Re-check monthly. Workloads shift. What was bucket 2 six months ago might be bucket 1 now as prompts and data get refined.

Verify it worked

1. Thinking is off for simple extraction/formatting tasks. Check your logs — if format-rewrite prompts have thinking: enabled, your routing is wrong.

2. Thinking is on for genuinely complex ones. Same check in reverse — debug-the-function prompts should be routed to thinking on.

3. Cost line item makes sense. In the Anthropic usage dashboard, your thinking-token spend should be a real but minority fraction of total. If it's the dominant line, you have thinking on everywhere.

Where this breaks

Using thinking for "safety" across the board. Some teams turn it on everywhere because "more thinking = safer answers." That's true for bucket-3 tasks; for bucket-1, it's not safer and it's 3-5x more expensive. Route.
Not counting thinking tokens in your budget math. If you're forecasting API spend, thinking tokens are input-priced separately — easy to miss. Always pull both input_tokens and thinking_tokens from usage.
Users seeing latency spikes. A UI expecting 2-second responses now takes 20 seconds when thinking is on. Either set user expectations ("this one's complex, give me a minute") or cap the thinking budget.
Over-reliance for problems Claude should just retrieve. Some "thinking" questions are better solved by a tool call to real data — no amount of reasoning replaces a fresh Ahrefs pull. Route to MCP tools first; use thinking for decisions that reason over tool output, not reasoning in place of tool output.
Mixing thinking with the Batch API. Check the current docs — thinking mode may interact with batch in unexpected ways (supported or not, priced differently). Verify before assuming stacks cleanly.

What to try next

How Do I Cut My Anthropic Bill in Half Using the Batch API? — the cost lever that actually matters when thinking is routed right.
How Do I Keep My Claude Prompt Cache Hit Rate High? — the other cost lever: cache the repeated system prompt so you're only paying full price for the dynamic tail.
How Do I Build a Research Pipeline with the Agent SDK? — research agents are bucket-3 workloads; a great place to use thinking by default.

Want this built for you instead?

Let's talk about your AI + SEO stack

If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.

Let's Talk