技能 编程开发 Intercom故障应急处理手册

Intercom故障应急处理手册

v20260423
intercom-incident-runbook
本手册提供了一套完整的Intercom故障应急处理流程。它指导技术团队进行系统性的故障排查,涵盖了从初步分级、使用Shell脚本检查API状态码和限流情况,到实施降级模式,再到撰写详细的事件回顾报告(Postmortem),确保在Intercom服务中断时能够系统高效地恢复业务。
获取技能
269 次下载
概览

Intercom Incident Runbook

Overview

Rapid incident response procedures for Intercom integration failures, including triage by HTTP status code, mitigation steps, and postmortem template.

Severity Levels

Level Definition Response Time Example
P1 All Intercom API calls failing < 15 min 401 auth failures, API unreachable
P2 Degraded service < 1 hour High latency, rate limited (429)
P3 Partial impact < 4 hours Webhook delays, search timeouts
P4 No user impact Next business day Monitoring gaps, stale cache

Quick Triage (Copy-Paste)

#!/bin/bash
echo "=== Intercom Incident Triage ==="

# 1. Is Intercom's API responding?
echo -n "1. API reachable: "
curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $INTERCOM_ACCESS_TOKEN" \
  https://api.intercom.io/me
echo ""

# 2. Is there a platform-wide incident?
echo -n "2. Intercom status: "
curl -s https://status.intercom.com/api/v2/status.json | jq -r '.status.description'

# 3. Active incidents on Intercom's side?
echo -n "3. Active incidents: "
curl -s https://status.intercom.com/api/v2/incidents/unresolved.json | jq '.incidents | length'

# 4. Rate limit status
echo -n "4. Rate limit remaining: "
curl -s -D - -o /dev/null \
  -H "Authorization: Bearer $INTERCOM_ACCESS_TOKEN" \
  https://api.intercom.io/me 2>/dev/null | grep -i x-ratelimit-remaining | awk '{print $2}'

# 5. Our health check
echo -n "5. Our integration health: "
curl -s https://your-app.com/health | jq '.services.intercom.status' 2>/dev/null || echo "UNKNOWN"

Decision Tree

API returning errors?
├── YES ──▶ Check status.intercom.com
│           ├── Incident reported ──▶ Intercom's problem
│           │   → Enable graceful degradation
│           │   → Monitor for resolution
│           │   → No action needed on our side
│           └── No incident ──▶ Our integration issue
│               ├── 401 → Token expired/revoked → Rotate token
│               ├── 403 → Scope missing → Add OAuth scope
│               ├── 429 → Rate limited → Enable queue/backoff
│               └── 5xx → Server error → Retry with backoff
└── NO ──▶ Is our service healthy?
           ├── YES → Resolved or intermittent → Monitor
           └── NO → Our infrastructure issue
               → Check pods, memory, network, DNS

Mitigation by Error Type

401 - Authentication Failed

# Verify token is valid
curl -s -H "Authorization: Bearer $INTERCOM_ACCESS_TOKEN" \
  https://api.intercom.io/me | jq '.type'
# Expected: "admin"
# If error: Token is invalid or revoked

# IMMEDIATE: Regenerate token
# Developer Hub > Your App > Authentication > Generate new token
# Update in secret manager:
aws secretsmanager update-secret \
  --secret-id intercom/production/token \
  --secret-string "new_token_here"

# Restart application to pick up new token
kubectl rollout restart deployment/intercom-service

429 - Rate Limited

# Check rate limit headers
curl -s -D - -o /dev/null \
  -H "Authorization: Bearer $INTERCOM_ACCESS_TOKEN" \
  https://api.intercom.io/me 2>/dev/null | grep -i "x-ratelimit"

# Immediate: Reduce request volume
# - Pause any batch/sync jobs
# - Enable request queuing if available

# Check if multiple apps are consuming workspace quota
# Limit: 25,000 req/min per workspace across all apps

5xx - Intercom Server Errors

# 1. Check Intercom status
curl -s https://status.intercom.com/api/v2/status.json | jq

# 2. Enable graceful degradation
# Your app should serve cached data or fallback UI

# 3. Track request_id from error responses for Intercom support
# Error response includes: { "request_id": "req_abc123" }

Graceful Degradation Pattern

import { IntercomClient, IntercomError } from "intercom-client";
import { LRUCache } from "lru-cache";

const cache = new LRUCache<string, any>({ max: 10000, ttl: 3600000 }); // 1hr fallback

async function getContactWithFallback(contactId: string): Promise<any> {
  try {
    const contact = await client.contacts.find({ contactId });
    cache.set(contactId, contact); // Update cache on success
    return contact;
  } catch (err) {
    if (err instanceof IntercomError && (err.statusCode === 429 || (err.statusCode ?? 0) >= 500)) {
      // Return stale cached data during outages
      const cached = cache.get(contactId);
      if (cached) {
        console.warn(`[Intercom] Serving cached data for ${contactId} due to ${err.statusCode}`);
        return { ...cached, _stale: true };
      }
    }
    throw err;
  }
}

Communication Templates

Internal Slack

[P1] INCIDENT: Intercom Integration
Status: INVESTIGATING
Impact: [Customer conversations not loading / messages not sending]
Cause: [Intercom API returning 5xx / our token expired / rate limited]
Action: [Enabling fallback / rotating token / pausing sync jobs]
Next update: [Time]
Commander: @[name]

Postmortem Template

## Incident: Intercom [Type]
**Date:** YYYY-MM-DD HH:MM - HH:MM UTC
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Intercom request_ids:** [req_abc123, req_def456]

### Summary
[1-2 sentences describing what happened and user impact]

### Timeline
- HH:MM - First alert: [what triggered]
- HH:MM - Triage started: [findings]
- HH:MM - Mitigation: [action taken]
- HH:MM - Resolution: [what fixed it]

### Root Cause
[Technical explanation of why it happened]

### Impact
- Conversations affected: N
- Users unable to reach support: N
- Duration of degraded service: Xm

### Action Items
- [ ] [Preventive measure] - Owner - Due
- [ ] [Monitoring gap to fill] - Owner - Due
- [ ] [Documentation to update] - Owner - Due

Error Handling

Issue Cause Solution
Triage script fails Token not set Export INTERCOM_ACCESS_TOKEN
Status page unreachable DNS/network Try mobile network or VPN
Can't rotate token No Developer Hub access Escalate to workspace admin
Cache empty during outage No pre-warming Implement cache warming job

Resources

Next Steps

For data handling compliance, see intercom-data-handling.

信息
Category 编程开发
Name intercom-incident-runbook
版本 v20260423
大小 7.13KB
更新时间 2026-04-28
语言