技能 硬件工程 Anthropic API故障应急手册

Anthropic API故障应急手册

v20260423
anth-incident-runbook
本手册提供了一个全面的指南,用于诊断和解决在使用Claude API时遇到的各种故障、性能下降和速率限制问题。它提供了从P1到P4的结构化分级响应流程、决策树和缓解措施,帮助工程师快速恢复服务,确保系统稳定运行。
获取技能
129 次下载
概览

Anthropic Incident Runbook

Severity Classification

Severity Condition Response Time
P1 API returning 500/529 for all requests Immediate
P2 Rate limiting (429) or high latency (>10s p99) 15 minutes
P3 Intermittent errors (<5% error rate) 1 hour
P4 Degraded quality (not errors) Next business day

Immediate Triage (First 5 Minutes)

# 1. Check Anthropic status page
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(d['status']['indicator'], '-', d['status']['description'])"

# 2. Test API connectivity
curl -s -w "\nHTTP %{http_code} | Time: %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}'

# 3. Check rate limit headers
curl -s -D - https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' \
  2>/dev/null | grep -i "ratelimit\|retry-after\|request-id"

Decision Tree

API returning errors?
├── 401/403 → Key issue → Check ANTHROPIC_API_KEY is set and valid
├── 429 → Rate limited → Check headers, reduce traffic, wait for retry-after
├── 500 → Server error → Check status.anthropic.com, retry with backoff
├── 529 → Overloaded → Temporary, retry after 30-60s
└── Timeouts → Network or long generation → Increase timeout, check max_tokens

Mitigation Actions

Rate Limiting (429)

# Immediate: reduce traffic
# 1. Enable circuit breaker
# 2. Queue non-critical requests
# 3. Switch to Message Batches for bulk work
# 4. Reduce max_tokens to shorten generation time

API Outage (500/529)

# Graceful degradation
def get_response_with_fallback(prompt: str) -> str:
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return msg.content[0].text
    except (anthropic.InternalServerError, anthropic.APIStatusError):
        return "Our AI assistant is temporarily unavailable. Please try again shortly."

Key Compromise

# 1. Immediately revoke key at console.anthropic.com
# 2. Generate new key
# 3. Deploy new key to all environments
# 4. Audit recent usage for unauthorized calls
# 5. File incident report

Postmortem Template

## Incident: [Title]
- **Duration:** [start] to [end]
- **Severity:** P[1-4]
- **Impact:** [what users experienced]
- **Root Cause:** [what went wrong]
- **Detection:** [how we found out]
- **Mitigation:** [what we did to fix it]
- **Request IDs:** [from debug logs]
- **Action Items:**
  - [ ] [preventive measure 1]
  - [ ] [preventive measure 2]

Error Handling

Symptom Likely Cause Quick Fix
All requests fail 401 Key rotated/expired Check Console for active keys
Sudden 429 spike Traffic burst or tier change Check rate limit headers
Slow responses (>10s) Large max_tokens or complex prompt Reduce max_tokens, use Haiku
Intermittent 500s Upstream API issue Check status.anthropic.com

Resources

Next Steps

For data compliance, see anth-data-handling.

信息
Category 硬件工程
Name anth-incident-runbook
版本 v20260423
大小 4.29KB
更新时间 2026-04-28
语言