Back to white papers
White Paper

Prompt Caching for Claude: The 90% Cost Cut Most People Miss

Jake McCluskey
Prompt Caching for Claude: The 90% Cost Cut Most People Miss

Source topic: Prompt caching is one of the highest-impact wins for any production agent. Cached tokens cost a fraction of standard input tokens and load faster.

Claim: Cache-hit tokens cost about 10% of standard input tokens and load in roughly 1/5 the latency. For agents with large system prompts, tool definitions, or RAG context, that's a 5-10x cost reduction.

The mental model

Every LLM call reprocesses your entire context from scratch. If your system prompt is 4,000 tokens and you make 100 calls, you pay for 400,000 input tokens, even though the system prompt never changed.

Prompt caching = tell Anthropic "this prefix is stable, remember its computed state for 5 minutes." Subsequent calls within that window pay 10% for the cached portion.

What's cachable (prefixes):

  • system prompt
  • tools definitions
  • Early messages (e.g. few-shot examples, RAG context)

What's NOT cachable:

  • The last user message (that's what varies)
  • Anything after a change point

The cache breakpoint is set with cache_control: {"type": "ephemeral"}.

Minimum viable caching

import anthropic
claude = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior financial analyst. Answer questions about 10-K filings.

Rules:
- Always cite the section and page number
- Quote exact numbers, don't round
- If information is not in the provided context, say so explicitly
- Format tables with markdown
[... 3000 more tokens of instructions and examples ...]
"""

def ask(question: str):
    return claude.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},   # <-- THIS is the whole trick
            }
        ],
        messages=[{"role": "user", "content": question}],
    )

# First call: writes to cache. Slightly MORE expensive (~25% premium on cached tokens).
ask("What was the 2024 revenue?")

# Subsequent calls within 5 min: cache hit. 10% cost on the system prompt.
ask("What were the operating margins?")
ask("What's the cash position?")

Cache the tools too (bigger win for agents)

Tool definitions are often 2K-8K tokens and rarely change. Cache them together with the system prompt:

TOOLS = [ ... big list of tool definitions ... ]

# Caching works by cache_control on the LAST item of a cachable block.
# If you put cache_control on system, tools are automatically cached too
# as part of the same prefix — as long as they appear in the same order.

resp = claude.messages.create(
    model=MODEL,
    system=[{"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}}],
    tools=TOOLS,
    messages=messages,
)

# Read the usage to confirm:
print(resp.usage)
# cache_creation_input_tokens: 4500  (first call — building cache)
# cache_read_input_tokens: 0
# input_tokens: 87   (only the user message)

# Next call:
# cache_creation_input_tokens: 0
# cache_read_input_tokens: 4500  (← paid at 10%)
# input_tokens: 120

Cache RAG context: biggest win of all

RAG typically stuffs retrieved chunks into the user message. But if the retrieval is stable across a conversation (a user asking follow-ups about the same doc), cache it:

# User uploaded a 50-page report. They'll ask ~10 follow-up questions.
retrieved_chunks = retrieve(...)   # 20K tokens

messages = [
    {"role": "user", "content": [
        {
            "type": "text",
            "text": f"Context documents:\n\n{retrieved_chunks}",
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": f"Question: {user_question}",
            # no cache_control — this varies per question
        },
    ]},
]

First question: pays full price for 20K tokens.
Questions 2-10: pay 10% for the cached 20K plus full price for the ~50-token new question.

Savings on 10 questions: about 8x cheaper than re-sending context each time.

The 5-minute TTL (and how to extend it)

Cache entries expire 5 minutes after the last hit. For long sessions:

  1. Keep the conversation active. Every call within 5 min of the last hit refreshes the timer.
  2. Request 1-hour caching (rolling out 2025): {"type": "ephemeral", "ttl": "1h"}. 2x the write cost but 12x the TTL.
system=[{"type": "text", "text": SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}}]

Use 1h for stable multi-hour sessions (support agents, long code-review runs). Default 5m for short interactive loops.

Measuring the savings

Always inspect usage. Otherwise you won't know if caching is actually helping.

def call_with_metrics(**kwargs):
    resp = claude.messages.create(**kwargs)
    u = resp.usage
    # Rough cost (check current pricing for exact numbers)
    cost = (
        u.input_tokens * 0.015 / 1000 +
        (u.cache_creation_input_tokens or 0) * 0.01875 / 1000 +
        (u.cache_read_input_tokens or 0) * 0.0015 / 1000 +
        u.output_tokens * 0.075 / 1000
    )
    print(f"in={u.input_tokens}, cache_write={u.cache_creation_input_tokens}, "
          f"cache_read={u.cache_read_input_tokens}, out={u.output_tokens}, "
          f"cost=${cost:.4f}")
    return resp

Typical agent session without caching: $0.15/turn.
Same session with caching on system + tools + stable context: $0.03/turn.

Four cache breakpoints: advanced layout

You can set up to 4 cache breakpoints in a single call. Use them for layered stable content:

messages = [
    {"role": "user", "content": [
        # Breakpoint 1 — very stable, weeks-long
        {"type": "text", "text": ORG_WIKI_DUMP,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}},
        # Breakpoint 2 — session-stable (this user's data)
        {"type": "text", "text": USER_PROFILE,
         "cache_control": {"type": "ephemeral"}},
        # Breakpoint 3 — current-task context
        {"type": "text", "text": RETRIEVED_DOCS,
         "cache_control": {"type": "ephemeral"}},
        # Not cached — varies per message
        {"type": "text", "text": user_question},
    ]},
]

Each breakpoint caches independently. If ORG_WIKI_DUMP doesn't change but RETRIEVED_DOCS does, only breakpoint 3 invalidates.

Common mistakes

  1. Putting cache_control on a varying prefix. If you cache USER_PROFILE but it changes per user, you thrash the cache and pay WRITE cost every call. Only cache prefixes that are stable across multiple calls.
  2. Reordering tools or system between calls. The cache matches by exact prefix. Swapping tool order invalidates it.
  3. Not checking usage. You think you're saving money but you aren't. Always log cache_creation_input_tokens vs cache_read_input_tokens.
  4. Caching tiny prefixes. Minimum 1024 tokens to cache (Opus) or 2048 (Haiku). Below that, caching does nothing.
  5. Ignoring the 5-minute window. Batch-processing jobs with long gaps between calls get zero cache benefit. Run back-to-back or use 1h TTL.

Resume angle

"Engineered prompt caching into a production agent: layered cache breakpoints (wiki dump, user profile, retrieved context) with appropriate TTLs. Cut per-turn API cost by 80% (verified via usage metrics) while holding latency under 1s p50. Implemented cache-hit-rate alerting, a regression in cache hits is usually a bug, not a user-behavior shift."