Prompt Caching for Claude: The 90% Cost Cut Most People Miss
White Paper

Prompt Caching for Claude: The 90% Cost Cut Most People Miss

Jake McCluskeyUpdated
Back to white papers

Source topic: Prompt caching is one of the highest-impact wins for any production agent. Cached tokens cost a fraction of standard input tokens and load faster.

Claim: Cache-hit tokens cost about 10% of standard input tokens and load in roughly 1/5 the latency. For agents with large system prompts, tool definitions, or RAG context, that's a 5-10x cost reduction.

The mental model

Every LLM call reprocesses your entire context from scratch. If your system prompt is 4,000 tokens and you make 100 calls, you pay for 400,000 input tokens, even though the system prompt never changed.

Prompt caching = tell Anthropic "this prefix is stable, remember its computed state for 5 minutes." Subsequent calls within that window pay 10% for the cached portion.

What's cachable (prefixes):

  • system prompt
  • tools definitions
  • Early messages (e.g. few-shot examples, RAG context)

What's NOT cachable:

  • The last user message (that's what varies)
  • Anything after a change point

The cache breakpoint is set with cache_control: {"type": "ephemeral"}.

Minimum viable caching

import anthropic
claude = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior financial analyst. Answer questions about 10-K filings.

Rules:
- Always cite the section and page number
- Quote exact numbers, don't round
- If information is not in the provided context, say so explicitly
- Format tables with markdown
[... 3000 more tokens of instructions and examples ...]
"""

def ask(question: str):
    return claude.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},   # <-- THIS is the whole trick
            }
        ],
        messages=[{"role": "user", "content": question}],
    )

# First call: writes to cache. Slightly MORE expensive (~25% premium on cached tokens).
ask("What was the 2024 revenue?")

# Subsequent calls within 5 min: cache hit. 10% cost on the system prompt.
ask("What were the operating margins?")
ask("What's the cash position?")

Cache the tools too (bigger win for agents)

Tool definitions are often 2K-8K tokens and rarely change. Cache them together with the system prompt:

TOOLS = [ ... big list of tool definitions ... ]

# Caching works by cache_control on the LAST item of a cachable block.
# If you put cache_control on system, tools are automatically cached too
# as part of the same prefix, as long as they appear in the same order.

resp = claude.messages.create(
    model=MODEL,
    system=[{"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}}],
    tools=TOOLS,
    messages=messages,
)

# Read the usage to confirm:
print(resp.usage)
# cache_creation_input_tokens: 4500  (first call, building cache)
# cache_read_input_tokens: 0
# input_tokens: 87   (only the user message)

# Next call:
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 4500  (← paid at 10%)
# input_tokens: 120

Cache RAG context: biggest win of all

RAG typically stuffs retrieved chunks into the user message. But if the retrieval is stable across a conversation (a user asking follow-ups about the same doc), cache it:

# User uploaded a 50-page report. They'll ask ~10 follow-up questions.
retrieved_chunks = retrieve(...)   # 20K tokens

messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Context documents:\n\n{retrieved_chunks}",
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": f"Question: {user_question}",
            # no cache_control, this varies per question
        },
    ]},
]

First question: pays full price for 20K tokens.
Questions 2-10: pay 10% for the cached 20K plus full price for the ~50-token new question.

Savings on 10 questions: about 8x cheaper than re-sending context each time.

The 5-minute TTL (and how to extend it)

Cache entries expire 5 minutes after the last hit. For long sessions:

  1. Keep the conversation active. Every call within 5 min of the last hit refreshes the timer.
  2. Request 1-hour caching (rolling out 2025): {"type": "ephemeral", "ttl": "1h"}. 2x the write cost but 12x the TTL.
system=[{"type": "text", "text": SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}}]

Use 1h for stable multi-hour sessions (support agents, long code-review runs). Default 5m for short interactive loops.

Measuring the savings

Always inspect usage. Otherwise you won't know if caching is actually helping.

def call_with_metrics(**kwargs):
    resp = claude.messages.create(**kwargs)
    u = resp.usage
    # Rough cost (check current pricing for exact numbers)
    cost = (
        u.input_tokens * 0.015 / 1000 +
        (u.cache_creation_input_tokens or 0) * 0.01875 / 1000 +
        (u.cache_read_input_tokens or 0) * 0.0015 / 1000 +
        u.output_tokens * 0.075 / 1000
    )
    print(f"in={u.input_tokens}, cache_write={u.cache_creation_input_tokens}, "
          f"cache_read={u.cache_read_input_tokens}, out={u.output_tokens}, "
          f"cost=${cost:.4f}")
    return resp

Typical agent session without caching: $0.15/turn.
Same session with caching on system + tools + stable context: $0.03/turn.

Four cache breakpoints: advanced layout

You can set up to 4 cache breakpoints in a single call. Use them for layered stable content:

messages = [
    {"role": "user", "content": [
        # Breakpoint 1, very stable, weeks-long
        {"type": "text", "text": ORG_WIKI_DUMP,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}},
        # Breakpoint 2, session-stable (this user's data)
        {"type": "text", "text": USER_PROFILE,
         "cache_control": {"type": "ephemeral"}},
        # Breakpoint 3, current-task context
        {"type": "text", "text": RETRIEVED_DOCS,
         "cache_control": {"type": "ephemeral"}},
        # Not cached, varies per message
        {"type": "text", "text": user_question},
    ]},
]

Each breakpoint caches independently. If ORG_WIKI_DUMP doesn't change but RETRIEVED_DOCS does, only breakpoint 3 invalidates.

Common mistakes

  1. Putting cache_control on a varying prefix. If you cache USER_PROFILE but it changes per user, you thrash the cache and pay WRITE cost every call. Only cache prefixes that are stable across multiple calls.
  2. Reordering tools or system between calls. The cache matches by exact prefix. Swapping tool order invalidates it.
  3. Not checking usage. You think you're saving money but you aren't. Always log cache_creation_input_tokens vs cache_read_input_tokens.
  4. Caching tiny prefixes. Minimum 1024 tokens to cache (Opus) or 2048 (Haiku). Below that, caching does nothing.
  5. Ignoring the 5-minute window. Batch-processing jobs with long gaps between calls get zero cache benefit. Run back-to-back or use 1h TTL.

Resume angle

"Engineered prompt caching into a production agent: layered cache breakpoints (wiki dump, user profile, retrieved context) with appropriate TTLs. Cut per-turn API cost by 80% (verified via usage metrics) while holding latency under 1s p50. Implemented cache-hit-rate alerting, a regression in cache hits is usually a bug, not a user-behavior shift."

Common questions

Frequently asked

How much do cached tokens cost compared to standard input tokens in Claude?

Cached tokens cost approximately 10% of standard input tokens. For example, if you have a 4,000-token system prompt and make 100 calls without caching, you pay for 400,000 input tokens, but with caching enabled, subsequent calls within the cache window pay only 10% for the cached portion, resulting in a 5 to 10 times cost reduction for agents with large system prompts, tool definitions, or RAG context.

What can be cached with Claude prompt caching?

You can cache system prompts, tools definitions, and early messages such as few-shot examples or RAG context. The cache breakpoint is set using cache_control with type ephemeral. The last user message and anything after a change point cannot be cached because they vary with each request.

How long do Claude cache entries last?

Cache entries expire 5 minutes after the last hit by default. Every call within 5 minutes of the last hit refreshes the timer. You can request 1-hour caching by setting ttl to 1h in the cache_control, which costs 2 times the write cost but provides 12 times the time-to-live, useful for stable multi-hour sessions like support agents or long code-review runs.

What is the minimum token count required for Claude prompt caching to work?

The minimum is 1024 tokens for Opus or 2048 tokens for Haiku. Below these thresholds, caching does nothing. This means tiny prefixes will not benefit from caching, so you should only cache sufficiently large stable content like extensive system prompts, tool definitions, or retrieved document chunks.

How can I verify that prompt caching is actually saving money in my Claude API calls?

Always inspect the usage object returned by the API call. Check the values for cache_creation_input_tokens (first call building the cache), cache_read_input_tokens (subsequent calls using the cache at 10% cost), and input_tokens (the varying portion). Log these metrics to calculate actual cost per turn, a typical agent session might drop from 15 cents per turn without caching to 3 cents per turn with caching on system, tools, and stable context.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

Prompt Caching for Claude | Elite AI Advantage