Back to guides

How Do I Keep My Claude Prompt Cache Hit Rate Above 80%?

Jake McCluskeyIntermediate30 min read
How Do I Keep My Claude Prompt Cache Hit Rate Above 80%?

Prompt caching is the "discount button" most Claude API users leave on the table. Turn it on and watch your bill drop — if you structure your prompts right. Leave the structure wrong and caching barely activates. Here's how to get your cache hit rate above 80%, which is where the 90% token discount on cached reads actually shows up on your invoice.

Why this matters

Prompt caching lets Anthropic re-use the computation of your prompt's prefix across requests. If your system prompt is 5000 tokens and it's identical across a million requests, cache it and pay for those tokens at ~10% of the normal rate after the first miss. On high-volume production workloads, that's the difference between a $2000/month bill and a $400 one.

But caching is fussy. Cache breakpoints must align. The prefix has to be byte-identical. Tool definitions and system prompts have to stack in the right order. Get any of those wrong and every request is a cache miss, and the caching premium on writes costs more than no caching at all.

Before you start

You need:

  • The Anthropic API integrated in a real workload. Caching matters for apps making hundreds of requests; it's negligible for one-off chats.
  • Observability on token usage. Either the console dashboard or logs where you can see usage.cache_read_input_tokens and usage.cache_creation_input_tokens.
  • 30 minutes. Most of this is tuning.

Step 1: Understand the prefix rule

Caching happens on request prefixes, not arbitrary substrings. The API caches everything from the start of your prompt up to a cache breakpoint.

So a request structure like:

text
[system prompt — 5000 tokens] →
[tool definitions — 2000 tokens] →
[conversation so far — 3000 tokens] →
[new user message — 200 tokens]

…with a cache breakpoint after tool definitions, caches 7000 tokens on the first miss. Every subsequent request starting with the same 7000 tokens is a cache hit.

If you change the system prompt by even one character, the cache is invalidated and rebuilt. If you change tool definitions, same thing.

Step 2: Add cache_control to your request

The API parameter name is cache_control: {"type": "ephemeral"}, placed on a block to mark the cut point:

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=TOOLS,  # Typically you cache tools too; see step 3
    messages=messages,
)

The breakpoint means "cache everything up to and including this block." For production, you usually want breakpoints at the end of the two stable sections: after system prompt, and after tool definitions.

Step 3: Cache tools too

Tool definitions are often more stable than the system prompt. Cache them:

python
# Anthropic lets you mark cache breakpoints on the last tool definition
tools_with_cache = [*TOOLS[:-1], {**TOOLS[-1], "cache_control": {"type": "ephemeral"}}]

Check the current SDK for the idiomatic API — it evolves. The concept is: the tools block, treated as a unit, gets cached as part of the prefix.

Step 4: Verify cache hits

Every response has a usage object with three token fields:

  • input_tokens — uncached input
  • cache_creation_input_tokens — tokens written to cache this request
  • cache_read_input_tokens — tokens read from cache this request

On the first request with a given prefix, cache_creation is high and cache_read is zero. On subsequent identical-prefix requests, cache_creation is zero and cache_read matches the prefix size.

Log and aggregate these. Your cache hit rate is:

text
cache_hit_rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + input_tokens)

Target: 80%+ on production workloads with stable prompts. If you're under 30%, your prefix is drifting and caching is barely helping.

Step 5: Diagnose low hit rates

If your rate is below target, the cause is almost always prefix drift. Common culprits:

  • Dynamic data in the system prompt. Timestamp, user ID, session ID inlined into the system prompt means every request has a unique prefix. Move dynamic data out of the cached block.
  • Different tool sets per request. If request A has 5 tools and request B has 6, they have different cached prefixes. Either cache the superset of tools, or group requests by tool set.
  • Different system prompts per tenant. In multi-tenant apps, each tenant's system prompt is its own cache. That's fine — you're still getting hits within a tenant — but total hit rate looks low. Measure per-tenant if this applies.
  • Whitespace changes. A trailing newline added or removed mid-refactor invalidates every cache. Normalize whitespace at the boundary of the cached block.

Step 6: Use the right TTL

Anthropic offers a standard ephemeral cache with ~5-minute TTL and, for higher-volume workloads, extended durations. If your request volume is bursty (say, 1 request per minute per user), a 5-minute cache is fine. If requests come once per hour, you'll pay the cache-creation cost every time and save nothing.

Check the current documentation for TTL options. If your workload pattern doesn't match the default, it may be worth using a longer TTL — or restructuring your app to batch requests so they fall within one TTL window.

Step 7: Stack caching with the Batch API

Caching and the Batch API stack. Batch cuts the per-token price in half; caching cuts cached-input tokens to ~10%. The combined effect on a workload with a large stable prefix can exceed 70% cost savings.

The migration path from a plain sync loop:

  1. Add caching → hit rate up to 80%+, bill drops 30-50%.
  2. Move to Batch API for async work → another 30-50% drop on what remains.

The Batch API guide covers the other half of the stack.

Verify it worked

1. Cache hit rate is above 80% on repeat workloads. Run your app for 100 requests. Sum cache_read / (cache_read + cache_creation + input_tokens). Target: 0.8+.

2. Bill drops in the dashboard. In console.anthropic.com, the usage dashboard now shows cache usage separately. If the cache line is material and the plain input line is shrinking, caching is working.

3. Latency drops too. Cached reads are faster than fresh reads. If you're logging response time, you should see a 10-30% latency improvement on cache hits.

Where this breaks

  • Cache creation penalty. Writing to cache is more expensive than a normal input read. If your prefix only repeats twice, you can pay more than you would have without caching. Caching is a volume play — don't bother for one-off requests.
  • Short TTLs on slow workloads. A 5-minute TTL with one request per hour means every request is a miss. Workload shape and TTL must match.
  • Conditional system prompt. If you conditionally add a section to the system prompt based on the request, the prefix varies. Either always include the section (waste some tokens) or stop caching that block.
  • Model upgrades invalidating caches. Switching from claude-sonnet-4 to claude-sonnet-4-5 is a fresh cache — they're separate models. Plan cutover carefully; expect a period where caching hit rate drops during migration.
  • Developers copy-pasting prompts with drifting whitespace. Over time, engineers mutate the system prompt and unknowingly break the cache. Lock the canonical prompt in one file, load it everywhere, and add a prefix-hash log line so you can spot drift immediately.

What to try next

Want this built for you instead?

Let's talk about your AI + SEO stack

If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.

Let's Talk
Questions from readers

Frequently asked

Does caching always save money?

Only on workloads with repeated prefixes. Writing to the cache is more expensive than a normal read; if your prefix repeats only once or twice, caching costs more than not caching. It's a volume play. Match TTL to your request cadence.

What's the cheapest hit-rate win?

Move dynamic data (timestamps, user IDs, session IDs) out of the cached block. Just that, often, takes a 20% hit rate to 80%+. The prefix must be byte-identical across requests to hit.

Can I cache the tool definitions too?

Yes, and you usually should — tool definitions are often larger and more stable than the system prompt. Add cache_control on the last tool definition (or use the idiomatic SDK pattern for your version) to cap the cached prefix there.

How do I know my cache is actually being hit?

Every response has usage.cache_read_input_tokens and usage.cache_creation_input_tokens. On hits, cache_read is high and cache_creation is zero. On misses, the reverse. Log both and compute the ratio.

Does caching work across different Claude models?

No. Cache is per-model. When you upgrade from claude-sonnet-4 to claude-sonnet-4-5, hit rate drops to zero while the new cache warms. Plan upgrades around low-traffic windows to minimize the cost spike.