Skills Development Vercel Incident Response Runbook

Vercel Incident Response Runbook

v20260423
vercel-incident-runbook
A comprehensive, step-by-step guide for responding to Vercel-related production outages, deployment failures, and platform issues. This runbook covers rapid triage using CLI commands, executing instant rollbacks, investigating root causes (log analysis, diffing), managing communication templates, and conducting thorough postmortem reviews.
Get Skill
59 downloads
Overview

Vercel Incident Runbook

Overview

Step-by-step incident response for Vercel deployment failures, function errors, and platform outages. Covers rapid triage, instant rollback, communication templates, and postmortem procedures.

Prerequisites

  • Access to Vercel dashboard and CLI
  • Access to Vercel status page (vercel-status.com)
  • Communication channels (Slack, PagerDuty) configured
  • Log drain or runtime log access

Instructions

Step 1: Rapid Triage (First 5 Minutes)

# 1. Check if it's a Vercel platform issue
curl -s "https://www.vercel-status.com/api/v2/summary.json" \
  | jq '.status.description, [.components[] | select(.status != "operational") | {name, status}]'

# 2. Check current production deployment status
vercel ls --prod
vercel inspect $(vercel ls --prod --json | jq -r '.[0].url')

# 3. Check recent deployments — did a deploy just happen?
curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \
  "https://api.vercel.com/v6/deployments?target=production&limit=5&projectId=prj_xxx" \
  | jq '.deployments[] | {uid, state, createdAt: (.createdAt/1000 | todate), url}'

# 4. Check function logs for errors
vercel logs $(vercel ls --prod --json | jq -r '.[0].url') --level=error --limit=20

Step 2: Decision Tree

Is vercel-status.com showing an incident?
├── YES → Vercel platform issue
│   ├── Subscribe to updates on status page
│   ├── Post internal status: "Vercel platform incident — monitoring"
│   └── No action needed from us — wait for Vercel resolution
│
└── NO → Issue is in our deployment
    ├── Did a deployment happen in the last 30 minutes?
    │   ├── YES → Likely deployment regression
    │   │   └── ROLLBACK immediately (Step 3)
    │   └── NO → Application-level issue
    │       ├── Check function logs for new errors
    │       ├── Check external dependency status (DB, APIs)
    │       └── Investigate and hotfix (Step 4)
    │
    └── Is the issue region-specific?
        ├── YES → Check function regions, possible edge issue
        └── NO → Global issue, check code and env vars

Step 3: Instant Rollback (< 30 Seconds)

# Option A: Rollback to previous production deployment (fastest)
vercel rollback
# This instantly swaps production traffic — no rebuild needed

# Option B: Rollback to a specific known-good deployment
vercel rollback dpl_xxxxxxxxxxxx

# Option C: Via API (for automation/PagerDuty integration)
curl -X POST "https://api.vercel.com/v9/projects/my-app/promote" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"deploymentId": "dpl_known_good_id"}'

# Verify rollback succeeded
vercel ls --prod
curl -s https://yourdomain.com/api/health | jq .

Step 4: Investigate Root Cause

# Collect evidence while it's fresh
mkdir incident-$(date +%Y%m%d)
cd incident-$(date +%Y%m%d)

# Function logs around the incident time
vercel logs https://yourdomain.com --limit=200 > function-logs.txt

# Deployment diff — what changed?
curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \
  "https://api.vercel.com/v13/deployments/dpl_broken" \
  | jq '.meta' > broken-deployment-meta.json

# Compare env vars between working and broken deployments
vercel env ls > env-vars.txt

# Check git diff between last good and broken commit
git log --oneline -10
git diff dpl_good_commit..dpl_broken_commit -- api/ src/

Step 5: Enable Maintenance Page (If Needed)

// vercel.json — temporary maintenance mode via rewrite
{
  "rewrites": [
    {
      "source": "/((?!_next|api/health).*)",
      "destination": "/maintenance.html"
    }
  ]
}
<!-- public/maintenance.html -->
<!DOCTYPE html>
<html>
<head><title>Maintenance</title></head>
<body>
  <h1>We'll be right back</h1>
  <p>We're performing scheduled maintenance. Please check back shortly.</p>
</body>
</html>

Step 6: Communication Templates

Internal — Slack (Incident Start)

:rotating_light: INCIDENT: [Project Name] production issue detected
Status: Investigating
Impact: [Description of user impact]
Start time: [UTC timestamp]
On-call: @[engineer]
Thread: replies here

Internal — Slack (Mitigation)

:white_check_mark: MITIGATED: [Project Name]
Action: Rolled back to deployment dpl_xxx
Impact duration: [X minutes]
Root cause: [Brief description]
Postmortem: [link] scheduled for [date]

External — Status Page

Title: Degraded performance on [service]
Body: We are investigating reports of [issue]. Some users may experience
[impact]. Our team is actively working on a resolution.
Update: The issue has been resolved. [Brief root cause].

Step 7: Postmortem Template

# Incident Postmortem: [Title]

## Summary
- Duration: [start] to [end] ([X minutes])
- Impact: [users/requests affected]
- Severity: [P1/P2/P3]

## Timeline (UTC)
- HH:MM — [event]
- HH:MM — Alert fired
- HH:MM — On-call acknowledged
- HH:MM — Root cause identified
- HH:MM — Rollback executed
- HH:MM — Service restored

## Root Cause
[What broke and why]

## Resolution
[What was done to fix it]

## Action Items
- [ ] [Preventive action] — Owner: @xxx — Due: [date]
- [ ] [Detection improvement] — Owner: @xxx — Due: [date]
- [ ] [Process improvement] — Owner: @xxx — Due: [date]

Incident Severity Levels

Severity Definition Response Time Rollback?
P1 Production down, all users affected < 5 min Immediate
P2 Degraded, some users affected < 15 min If not fixable in 30 min
P3 Minor issue, workaround exists < 1 hour No
P4 Cosmetic or non-urgent Next business day No

Output

  • Incident categorized and triaged within 5 minutes
  • Instant rollback executed if deployment regression detected
  • Communication sent to internal and external stakeholders
  • Postmortem scheduled with action items

Error Handling

Scenario Action
Vercel status page shows incident Monitor, communicate, no deployment changes
vercel rollback fails Use API promotion: POST to /v9/projects/.../promote
Rollback deployment also broken Deploy from a known-good git tag
Cannot access Vercel dashboard Use CLI with saved VERCEL_TOKEN
Log retention expired Check external log drain provider

Resources

Next Steps

For data handling and compliance, see vercel-data-handling.

Info
Category Development
Name vercel-incident-runbook
Version v20260423
Size 4.85KB
Updated At 2026-04-28
Language