技能 编程开发 Anthropic API 故障应对手册

Anthropic API 故障应对手册

v20260423
clade-incident-runbook
本手册提供了一套全面的流程,指导开发者应对Anthropic API的各种故障场景,如服务中断、持续性错误(如529)或速率限制。它涵盖了从检查官方状态页、判断故障严重性,到实施模型降级(如从Opus切换至Sonnet)等关键步骤,确保系统在API服务不稳定时的健壮性和持续可用性。
获取技能
483 次下载
概览

Anthropic Incident Runbook

Overview

Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.

Step 1: Confirm the Issue

# Check Anthropic status
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")"

# Test API directly
curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "claude-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'

Step 2: Classify Severity

Symptom Severity Action
529 overloaded (intermittent) Low SDK auto-retries handle this
529 overloaded (sustained 5+ min) Medium Switch to fallback model
401/403 on all requests High API key issue — check console
All requests timing out High Check status page, activate fallback
Status page shows incident Varies Follow status page updates

Step 3: Activate Fallback

async function callWithFallback(params: Anthropic.MessageCreateParams) {
  try {
    return await client.messages.create(params);
  } catch (err) {
    if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) {
      // Try a different model
      if (params.model.includes('opus')) {
        return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' });
      }
      if (params.model.includes('sonnet')) {
        return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' });
      }
    }
    throw err;
  }
}

Step 4: Communicate

  • Update your status page if user-facing
  • Note: Anthropic incidents typically resolve in 15-60 minutes

Step 5: Post-Incident

  • Check your error logs for the incident window
  • Calculate impact (failed requests, user impact)
  • Verify all systems recovered

Output

  • Incident confirmed via status page and direct API test
  • Severity classified (Low/Medium/High) based on symptoms
  • Fallback activated if needed (downgrade model or queue requests)
  • Impact assessed and documented post-incident

Error Handling

Error Cause Solution
API Error Check error type and status code See clade-common-errors

Examples

See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.

Resources

Next Steps

See clade-reliability-patterns for building resilient integrations.

Prerequisites

  • Production Claude integration deployed
  • Fallback model configuration in place (see clade-reliability-patterns)
  • Monitoring/alerting configured (see clade-observability)

Instructions

Step 1: Review the patterns below

Each section contains production-ready code examples. Copy and adapt them to your use case.

Step 2: Apply to your codebase

Integrate the patterns that match your requirements. Test each change individually.

Step 3: Verify

Run your test suite to confirm the integration works correctly.

信息
Category 编程开发
Name clade-incident-runbook
版本 v20260423
大小 3.07KB
更新时间 2026-04-28
语言