Back to guides

How Do I Cut My Anthropic Bill in Half Using the Batch API?

Jake McCluskeyIntermediate35 min read
How Do I Cut My Anthropic Bill in Half Using the Batch API?

If your Anthropic bill has a line item for "evaluations" or "content generation" or anything else that runs in bulk, you're almost certainly overpaying. Anthropic has a batch-processing endpoint that charges 50% of the regular API price. It's been live for more than a year. Most teams I audit have never turned it on.

Here's how to migrate a sync script to the Batch API in under an hour, cut that bill line in half, and stack it with prompt caching for another pound of savings.

Why this matters

The Batch API is designed for workloads where you don't need the answer in the next second. Evaluations. Offline content generation. Classification sweeps. Data labeling. Your prompt and your response both live at rest for up to 24 hours before completion.

In exchange, Anthropic charges you exactly half. No setup fees, no minimum volume, no separate contract. You just change which endpoint you call.

For any non-user-facing workload — anything a human is not sitting and waiting for — there's no reason to pay the sync price. I've moved workflows from $1,200/month to $480/month with no behavior change beyond the latency rising from "seconds" to "minutes."

Before you start

You need:

  • An Anthropic API key from console.anthropic.com.
  • A sync script that's already working — we're migrating an existing one, not building from scratch. If you don't have one yet, see our Claude Code setup guide and write one first.
  • Python 3.10+ or Node 18+ — the examples below are Python, but the Batch API works identically from the Node SDK.
  • About 35 minutes if you've used the sync API before.

Step 1: Identify a candidate workload

Not every API call should move to Batch. If you're building a chatbot or a live code assistant, keep sync — your user is waiting.

Migrate calls that meet all three of these:

  1. Nobody's actively waiting on the response. It's a cron job, a backend worker, an overnight eval.
  2. You have more than one request at a time. Batch shines when you hand it 100+ requests in one submission.
  3. A 24-hour turnaround is acceptable. In practice most batches finish in minutes, but you can't count on that.

A typical match: "once a night, classify every support ticket that came in today." A typical mismatch: "when a user hits the API, return a classification." The first is batch; the second stays sync.

Step 2: Look at the sync version

Here's the sync script we'll be migrating. It classifies support tickets into categories:

python
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

CATEGORIES = ["billing", "technical", "account", "feature-request", "other"]

def classify(ticket_body: str) -> str:
    resp = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=64,
        messages=[{
            "role": "user",
            "content": (
                f"Classify this support ticket into exactly one of "
                f"these categories: {', '.join(CATEGORIES)}\n\n"
                f"Ticket:\n{ticket_body}\n\n"
                f"Reply with just the category name."
            ),
        }],
    )
    return resp.content[0].text.strip().lower()


# Called once per ticket, a few thousand times a day.
for ticket in todays_tickets():
    category = classify(ticket.body)
    save_category(ticket.id, category)

Simple. Works. Expensive.

Step 3: Rewrite as a batch job

The Batch API takes a list of "requests" — each one is a dictionary that looks almost exactly like a regular messages.create call, just wrapped in a bit of metadata. You submit them all at once, poll for completion, and read back the results.

Here's the same script as a batch job:

python
import os
import json
import time
from anthropic import Anthropic
from anthropic.types.messages.batch_create_params import Request

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

CATEGORIES = ["billing", "technical", "account", "feature-request", "other"]


def build_request(ticket) -> Request:
    return Request(
        custom_id=f"ticket-{ticket.id}",  # so we can match results back
        params={
            "model": "claude-sonnet-4-5-20250929",
            "max_tokens": 64,
            "messages": [{
                "role": "user",
                "content": (
                    f"Classify this support ticket into exactly one "
                    f"of these categories: {', '.join(CATEGORIES)}\n\n"
                    f"Ticket:\n{ticket.body}\n\n"
                    f"Reply with just the category name."
                ),
            }],
        },
    )


def submit_batch(tickets: list) -> str:
    requests = [build_request(t) for t in tickets]
    batch = client.messages.batches.create(requests=requests)
    print(f"Submitted batch {batch.id} with {len(requests)} requests.")
    return batch.id


def wait_for_batch(batch_id: str, poll_seconds: int = 30) -> None:
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            print(f"Batch {batch_id} finished: {batch.request_counts}")
            return
        print(f"Status: {batch.processing_status} — waiting {poll_seconds}s")
        time.sleep(poll_seconds)


def process_results(batch_id: str):
    for result in client.messages.batches.results(batch_id):
        ticket_id = result.custom_id.replace("ticket-", "")
        if result.result.type == "succeeded":
            category = (
                result.result.message.content[0].text.strip().lower()
            )
            save_category(ticket_id, category)
        else:
            # Log failures; common causes are rate limits or
            # individual-request validation errors.
            print(f"Ticket {ticket_id} failed: {result.result}")


if __name__ == "__main__":
    tickets = list(todays_tickets())
    if not tickets:
        print("No tickets today.")
        exit(0)
    batch_id = submit_batch(tickets)
    wait_for_batch(batch_id)
    process_results(batch_id)

Three things to notice.

custom_id is required. It's how you match each result back to the original request. Use the record ID from your own system — order of results is not guaranteed.

Individual requests can fail without failing the whole batch. Always check result.result.type per item and log failures separately.

You don't have to wait inline. The script above polls, but if your batch is large, store the batch ID and process results in a separate run triggered by a webhook or a later cron. Don't burn a long-running Python process on time.sleep.

Step 4: Verify the pricing

Before you ship this, prove the savings. Run both versions on a small sample and check the Anthropic console.

Go to console.anthropic.comUsage. The batch rows are tagged separately. Compare the per-token cost of your sync runs against the batch runs on the same prompts. You should see exactly a 50% reduction on both input and output tokens.

If you don't see that reduction, you're hitting the sync endpoint somewhere else in your script — the Python SDK makes it easy to accidentally mix the two. Grep your code for messages.create (that's sync) and confirm the batch path is the only one your cron triggers.

Step 5: Stack it with prompt caching

Batch discount is 50%. Prompt caching is up to 90% on the cached portion. Combine them and the savings compound.

Prompt caching tells Anthropic: "this prefix of my prompt is reusable across requests, charge me less for it after the first call." For a classification workload where the instruction text is identical across every ticket, this is a natural fit.

Add a cache_control breakpoint to the shared portion of your prompt:

python
def build_request(ticket) -> Request:
    return Request(
        custom_id=f"ticket-{ticket.id}",
        params={
            "model": "claude-sonnet-4-5-20250929",
            "max_tokens": 64,
            "system": [
                {
                    "type": "text",
                    "text": (
                        f"You are a support ticket classifier. "
                        f"Respond with exactly one of: "
                        f"{', '.join(CATEGORIES)}. "
                        f"No punctuation, no extra text."
                    ),
                    "cache_control": {"type": "ephemeral"},
                }
            ],
            "messages": [{
                "role": "user",
                "content": f"Ticket:\n{ticket.body}",
            }],
        },
    )

The system text — identical for every ticket — now gets cached after the first request in the batch. Per-ticket cost drops further because that chunk is billed at cache-hit pricing.

This is where tight prompt design pays for itself. The shorter your per-ticket text is and the larger your cached prefix is, the more of your bill lands in the cached (cheap) lane.

Verify it worked

Three checkpoints:

Correctness. Run the batch on a sample of 20 tickets and spot-check the categories against the sync output. They should be identical. Different model, different endpoint, same answer — no exceptions.

Cost. Open the Usage dashboard. Batch runs show at 50% of sync token pricing. Cached-input tokens show at roughly 10% of standard.

Latency. Track your batch end-to-end time. Most batches under 500 requests complete in under 15 minutes. If yours routinely take hours, you're probably submitting enormous batches — split them.

Where this breaks

  • Mixing cache and batch wrong. Cache breakpoints that sit in the message (not the system) can still cache, but the rules are finicky. Put stable instruction text in system with cache_control; put variable user input in messages. Swap them and caching silently doesn't work.
  • Forgetting that custom_id must be unique. Anthropic enforces this per batch. If you reuse a ticket ID, the batch submission fails. Append a timestamp or a UUID if your IDs could repeat.
  • Assuming batch is free. It's half-price, not no-price. A 10,000-request batch still costs real money. Cap batch size in code if your upstream data might spike.
  • Polling too aggressively. Every retrieve call is cheap but not free. A 30-second poll interval is fine. A 1-second poll interval on a 10-minute batch eats 600 requests for no reason.
  • Mixing batch results with sync logic. Batch results return in a streaming iterator. If your save logic expects a synchronous return, you'll get surprises. Design the save-side to be idempotent and stream-friendly.
  • Running the same batch twice after a crash. If your polling loop dies mid-batch, the work is already paid for on Anthropic's side. Resume from the batch_id stored in your DB, don't resubmit.

What to try next

Want this built for you instead?

Let's talk about your AI + SEO stack

If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.

Let's Talk
Questions from readers

Frequently asked

Is the Batch API slower than sync?

Yes, but usually not by much. In practice most batches under 500 requests finish in under 15 minutes. The SLA is 24 hours, so plan for the worst case but expect minutes. If you need a real-time response, stay on sync.

Can I use Batch and Prompt Caching together?

Yes, and you should. Batch cuts the token price in half; caching cuts the cached-input price by up to 90%. Stacking them on workloads with repeated instruction prefixes — evals, classifiers, content generators — is where the 75% headline comes from.

What happens if my polling process dies mid-batch?

Anthropic keeps processing and your charges don't double. Persist the batch ID to your database as soon as you get it, and on the next run pick it up from there instead of resubmitting. Idempotent result-saving does the rest.

Does Batch work with tool use and streaming?

Tool use yes, streaming no — batch responses come back as a completed result, not as a stream of tokens. If a workflow needs streaming (live chat output, progressive rendering), it's a sync workload by definition.

How do I make sure I'm actually hitting the batch endpoint?

Check the Usage dashboard in console.anthropic.com — batch usage is tagged separately and priced at 50%. If you're not seeing batch rows, you still have a sync path somewhere in your code. Grep for `messages.create` — that's sync; only `messages.batches.create` is batch.