Skills Development Anthropic API Incident Runbook

Anthropic API Incident Runbook

v20260423
clade-incident-runbook
A comprehensive runbook designed to guide developers through responding to Anthropic API incidents, including outages, sustained errors (like 529), and rate limit issues. It covers systematic steps like checking the official status page, classifying incident severity, implementing model fallback logic (e.g., downgrading Opus to Sonnet), and conducting post-incident impact assessments to ensure service resilience.
Get Skill
483 downloads
Overview

Anthropic Incident Runbook

Overview

Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.

Step 1: Confirm the Issue

# Check Anthropic status
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")"

# Test API directly
curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "claude-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'

Step 2: Classify Severity

Symptom Severity Action
529 overloaded (intermittent) Low SDK auto-retries handle this
529 overloaded (sustained 5+ min) Medium Switch to fallback model
401/403 on all requests High API key issue — check console
All requests timing out High Check status page, activate fallback
Status page shows incident Varies Follow status page updates

Step 3: Activate Fallback

async function callWithFallback(params: Anthropic.MessageCreateParams) {
  try {
    return await client.messages.create(params);
  } catch (err) {
    if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) {
      // Try a different model
      if (params.model.includes('opus')) {
        return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' });
      }
      if (params.model.includes('sonnet')) {
        return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' });
      }
    }
    throw err;
  }
}

Step 4: Communicate

  • Update your status page if user-facing
  • Note: Anthropic incidents typically resolve in 15-60 minutes

Step 5: Post-Incident

  • Check your error logs for the incident window
  • Calculate impact (failed requests, user impact)
  • Verify all systems recovered

Output

  • Incident confirmed via status page and direct API test
  • Severity classified (Low/Medium/High) based on symptoms
  • Fallback activated if needed (downgrade model or queue requests)
  • Impact assessed and documented post-incident

Error Handling

Error Cause Solution
API Error Check error type and status code See clade-common-errors

Examples

See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.

Resources

Next Steps

See clade-reliability-patterns for building resilient integrations.

Prerequisites

  • Production Claude integration deployed
  • Fallback model configuration in place (see clade-reliability-patterns)
  • Monitoring/alerting configured (see clade-observability)

Instructions

Step 1: Review the patterns below

Each section contains production-ready code examples. Copy and adapt them to your use case.

Step 2: Apply to your codebase

Integrate the patterns that match your requirements. Test each change individually.

Step 3: Verify

Run your test suite to confirm the integration works correctly.

Info
Category Development
Name clade-incident-runbook
Version v20260423
Size 3.07KB
Updated At 2026-04-28
Language