Skip to content

[BUG] Server writes additional cache entries past explicit cache_control breakpoint, billing 1.25x for tokens that are never read #1547

@salman-a-shah

Description

@salman-a-shah

Summary

When using an explicit cache_control breakpoint on the system block, with no top-level cache_control field (i.e., not opted into automatic caching), the server writes an additional cache entry inside the user content on warm calls. These additional writes are billed at the 1.25x cache-write rate but are never read back on subsequent calls, since they live past the only explicit breakpoint and the trailing content varies per request.

This contradicts the documented behavior at https://platform.claude.com/docs/en/build-with-claude/prompt-caching:

Cache writes happen only at your breakpoint. Marking a block with cache_control writes exactly one cache entry: a hash of the prefix ending at that block. The system does not write entries for any earlier position.

I have confirmed this reproduces against both the latest SDK (0.102.0) and older versions (0.79.0), ruling out the Python SDK as the cause. The behavior is server-side.

Reproduction

import uuid
import anthropic

# Nonce keeps Call 1 cold across re-runs of this script.
SYSTEM_PROMPT = f"<nonce>{uuid.uuid4().hex}</nonce>\n" + "FACT: The capital of France is Paris.\n" * 200
USER_BLOCK_1 = (
    "Ignore this gibberish\n" * 200 + "\nGenerate some fictional JSON data about a person from france who is "
)
USER_BLOCK_2 = " years old"
STARTING_AGE = 20


def print_cache_usage(usage):
    print(
        f"cache_creation={usage.cache_creation_input_tokens} "
        f"cache_read={usage.cache_read_input_tokens} "
        f"uncached_input={usage.input_tokens}"
    )


client = anthropic.Anthropic()


# Call 1 (cold)
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=32,
    system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": USER_BLOCK_1},
                {"type": "text", "text": str(STARTING_AGE) + USER_BLOCK_2},
            ],
        },
        {"role": "assistant", "content": "```json"},
    ],
)
print("Call 1: ", end="")
print_cache_usage(response.usage)


# Call 2 (warm)
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=32,
    system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": USER_BLOCK_1},
                {"type": "text", "text": str(STARTING_AGE + 10) + USER_BLOCK_2},
            ],
        },
        {"role": "assistant", "content": "```json"},
    ],
)
print("Call 2: ", end="")
print_cache_usage(response.usage)


# Call 3 (warm)
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=32,
    system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}],
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": USER_BLOCK_1},
                {"type": "text", "text": str(STARTING_AGE + 20) + USER_BLOCK_2},
            ],
        },
        {"role": "assistant", "content": "```json"},
    ],
)
print("Call 3: ", end="")
print_cache_usage(response.usage)

Expected behavior

Per the documentation, since there is exactly one cache_control marker (on the system block), only one cache entry should be written. Warm calls should show cache_creation=0:

Call 1: cache_creation=2227 cache_read=0    uncached_input=1424
Call 2: cache_creation=0    cache_read=2227 uncached_input=1424
Call 3: cache_creation=0    cache_read=2227 uncached_input=1424

Actual behavior

Warm calls write an additional ~1416 tokens to cache, even though no second breakpoint exists. These writes are billed at 1.25x but never produce a cache read in subsequent calls:

Call 1: cache_creation=2227 cache_read=0    uncached_input=1424
Call 2: cache_creation=1416 cache_read=2227 uncached_input=8
Call 3: cache_creation=1416 cache_read=2227 uncached_input=8

There is currently no documented way to opt out.

Request

  1. Confirm whether this is intended behavior.
  2. If intended: please update the prompt-caching docs (the quoted paragraph is misleading) and provide an opt-out mechanism (header or request field) so users on explicit caching can prevent unrequested cache writes.
  3. If unintended: please patch.

Environment

  • anthropic Python SDK: 0.102.0 (also reproduced on 0.79.0)
  • Model: claude-sonnet-4-5-20250929
  • Python: 3.12.3
  • OS: Ubuntu 24.04
  • API: Direct Anthropic API (not Bedrock/Vertex)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions