Optimize Claude API latency and throughput via prompt caching, model selection, streaming, and request optimization. The biggest wins come from prompt caching (90% input cost reduction) and model selection (Haiku is 4x faster than Sonnet).
import anthropic
client = anthropic.Anthropic()
# Mark long, reusable content with cache_control
# Cached content: 90% cheaper on subsequent requests, near-zero latency for cached portion
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert on the following 50-page document: ...<long document>...",
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[{"role": "user", "content": "What does section 3.2 say?"}]
)
# Check cache performance
print(f"Cache read tokens: {message.usage.cache_read_input_tokens}") # Free/cheap
print(f"Cache creation tokens: {message.usage.cache_creation_input_tokens}") # First call only
print(f"Uncached input tokens: {message.usage.input_tokens}")
Cache requirements: Minimum 1,024 tokens for Sonnet/Opus, 2,048 for Haiku. Cache lives for 5 minutes (refreshed on each hit).
| Model | Speed | Cost (per MTok in/out) | Best For |
|---|---|---|---|
| Claude Haiku | Fastest | $0.80 / $4.00 | Classification, extraction, routing |
| Claude Sonnet | Balanced | $3.00 / $15.00 | General tasks, tool use, code |
| Claude Opus | Deepest | $15.00 / $75.00 | Complex reasoning, research |
# Route by task complexity
def select_model(task_type: str) -> str:
routing = {
"classify": "claude-haiku-4-20250514",
"extract": "claude-haiku-4-20250514",
"summarize": "claude-sonnet-4-20250514",
"code": "claude-sonnet-4-20250514",
"research": "claude-opus-4-20250514",
}
return routing.get(task_type, "claude-sonnet-4-20250514")
# Streaming reduces time-to-first-token from seconds to ~200ms
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
yield text # User sees response immediately
# 1. Set max_tokens to what you actually need (not max)
msg = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=128, # Not 4096 — smaller = faster generation
messages=[{"role": "user", "content": "Classify as positive/negative: 'Great product!'"}]
)
# 2. Use prefill to skip preamble
msg = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=64,
messages=[
{"role": "user", "content": "Classify sentiment: 'Great product!'"},
{"role": "assistant", "content": "Sentiment:"} # Skip "Sure, I'd be happy to..."
]
)
# 3. Pre-check token count for large inputs
count = client.messages.count_tokens(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": large_document}]
)
if count.input_tokens > 100_000:
# Chunk or summarize first
pass
import Anthropic from '@anthropic-ai/sdk';
import PQueue from 'p-queue';
const client = new Anthropic();
const queue = new PQueue({ concurrency: 10 });
// Process multiple prompts in parallel (within rate limits)
const results = await Promise.all(
prompts.map(p => queue.add(() =>
client.messages.create({
model: 'claude-haiku-4-20250514',
max_tokens: 256,
messages: [{ role: 'user', content: p }],
})
))
);
| Optimization | Latency Impact | Cost Impact |
|---|---|---|
| Prompt caching | -50% (cached portion) | -90% input cost |
| Haiku over Sonnet | -75% TTFT | -73% cost |
| Streaming | -80% TTFT (perceived) | Same cost |
| Lower max_tokens | -10-30% total time | Same cost |
| Prefill technique | -20% output tokens | Proportional savings |
For cost optimization, see anth-cost-tuning.