Skills Development Claude API Performance Optimization Guide

Claude API Performance Optimization Guide

v20260423
clade-performance-tuning
This guide outlines advanced strategies to optimize latency and throughput when integrating Anthropic's Claude API. Learn critical techniques such as streaming responses, prompt caching, model selection (Haiku, Sonnet, Opus), and request parallelization, ensuring your applications provide a low-latency, high-performance user experience.
Get Skill
479 downloads
Overview

Anthropic Performance Tuning

Overview

Claude latency has two components: time to first token (TTFT) and tokens per second (TPS). Different strategies target each.

Latency Benchmarks (approximate)

Model TTFT (p50) TTFT (p95) Output TPS
Claude Haiku 4.5 200ms 600ms ~150
Claude Sonnet 4 400ms 1.2s ~90
Claude Opus 4 800ms 2.5s ~40

Optimization Strategies

Instructions

Step 1: Always Stream

// Streaming delivers the first token ASAP — user sees response instantly
// instead of waiting for the full response to generate

const stream = client.messages.stream({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages,
});

// First token arrives in ~400ms (Sonnet)
// Full response may take 5-10s, but user sees progress immediately
for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    yield event.delta.text;
  }
}

Step 2: Prompt Caching — Faster TTFT

// Cached prompts skip re-processing — dramatically lower TTFT for large system prompts
const message = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: [{
    type: 'text',
    text: largeSystemPrompt, // 10K+ tokens
    cache_control: { type: 'ephemeral' },
  }],
  messages,
}, {
  headers: { 'claude-beta': 'prompt-caching-2024-07-31' },
});
// TTFT drops from ~2s to ~500ms on cache hit with large prompts

Step 3: Use Haiku for Speed-Critical Paths

// Haiku is 2-4x faster than Sonnet with 80% quality for many tasks
// Use for: classification, extraction, simple Q&A, routing decisions

const route = await client.messages.create({
  model: 'claude-haiku-4-5-20251001', // 200ms TTFT
  max_tokens: 10,
  system: 'Classify the intent. Reply with exactly one word: search, create, update, delete.',
  messages: [{ role: 'user', content: userInput }],
});

// Then use Sonnet/Opus for the actual task

Step 4: Reuse Client Instance

// BAD — creates new connection pool per request
app.get('/api/chat', async (req, res) => {
  const client = new Anthropic(); // DON'T
  // ...
});

// GOOD — single client shared across requests
const client = new Anthropic(); // Module-level singleton

app.get('/api/chat', async (req, res) => {
  const message = await client.messages.create({ ... });
  // ...
});

Step 5: Parallel Requests

// When you need multiple independent Claude calls, fire them in parallel
const [summary, sentiment, entities] = await Promise.all([
  client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
    messages: [{ role: 'user', content: `Summarize: ${text}` }] }),
  client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 20,
    messages: [{ role: 'user', content: `Sentiment (positive/negative/neutral): ${text}` }] }),
  client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
    messages: [{ role: 'user', content: `Extract named entities from: ${text}` }] }),
]);

Step 6: Minimize Output Tokens

// Fewer output tokens = faster response
system: 'Be extremely concise. Use bullet points, not paragraphs.',

// Set tight max_tokens
max_tokens: 256, // Don't use 4096 for short answers

Output

  • Streaming enabled for all user-facing responses (first token in ~400ms with Sonnet)
  • Prompt caching reducing TTFT for large system prompts
  • Model routing to Haiku for speed-critical classification/routing tasks
  • Client instance reused across requests (no per-request connection overhead)
  • Parallel requests firing independent Claude calls concurrently

Error Handling

Issue Cause Fix
TTFT > 3s Large uncached prompt Enable prompt caching
Slow output Using Opus for simple tasks Downgrade to Haiku/Sonnet
Timeouts Long generation + default timeout new Anthropic({ timeout: 120_000 })
529 overloaded API capacity SDK auto-retries; add fallback model

Examples

See Latency Benchmarks table and six numbered strategy sections above, each with complete TypeScript code examples.

Resources

Next Steps

See clade-deploy-integration for production deployment patterns.

Prerequisites

  • Completed clade-install-auth
  • User-facing application where latency matters
  • Understanding of streaming and async patterns
Info
Category Development
Name clade-performance-tuning
Version v20260423
Size 3.48KB
Updated At 2026-04-26
Language