Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026
Incident response framework for availability/reliability incidents (outages, degradations, failed deploys): severity classification, timeline reconstruction, and post-incident review.
This is NOT security incident triage. For security events (ransomware, intrusion, data exfiltration, IOC analysis, NIST SP 800-61 forensics), route to incident-response. Both skills use SEV1-SEV4 labels; this one scores operational impact (users, revenue, SLA), while incident-response classifies attack types and forensic handling.
Incident Classifier (incident_classifier.py)
Timeline Reconstructor (timeline_reconstructor.py)
PIR Generator (pir_generator.py)
Definition: Complete service failure affecting all users or critical business functions
Characteristics:
Response Requirements:
Communication Frequency: Every 15 minutes until resolution
Definition: Significant degradation affecting subset of users or non-critical functions
Characteristics:
Response Requirements:
Communication Frequency: Every 30 minutes during active response
Definition: Limited impact with workarounds available
Characteristics:
Response Requirements:
Communication Frequency: At key milestones only
Definition: Minimal impact, cosmetic issues, or planned maintenance
Characteristics:
Response Requirements:
Communication Frequency: Standard development cycle updates
Command and Control
Communication Hub
Process Management
Post-Incident Leadership
Emergency Decisions (SEV1/2):
Resource Allocation:
Technical Decisions:
Subject: [SEV{severity}] {Service Name} - {Brief Description}
Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}
Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}
Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}
Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}
---
{Incident Commander Name}
{Contact Information}
Subject: URGENT - Customer-Impacting Outage - {Service Name}
Executive Summary:
{2-3 sentence description of customer impact and business implications}
Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes}
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}
Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination
- [ ] Resource allocation decisions
- [ ] External vendor engagement
Incident Commander: {name} ({contact})
Next Update: {time}
---
This is an automated alert from our incident response system.
We are currently experiencing {brief description of issue} affecting {scope of impact}.
Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.
What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}
What we're doing:
- {primary response action}
- {secondary response action}
Workaround (if available):
{workaround steps or "No workaround currently available"}
We apologize for the inconvenience and will share more information as it becomes available.
Next update: {time}
Status page: {link}
Internal Stakeholders:
External Stakeholders:
| Stakeholder | SEV1 | SEV2 | SEV3 | SEV4 |
|---|---|---|---|---|
| Engineering Leadership | Real-time | 30min | 4hrs | Daily |
| Executive Team | 15min | 1hr | EOD | Weekly |
| Customer Support | Real-time | 30min | 2hrs | As needed |
| Customers | 15min | 1hr | Optional | None |
| Partners | 30min | 2hrs | Optional | None |
Detection Playbooks
Response Playbooks
Recovery Playbooks
# {Service/Component} Incident Response Runbook
## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}
## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}
### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}
## Initial Response (0-15 minutes)
1. **Assess Severity**
- [ ] Check {primary metric}
- [ ] Verify {secondary indicator}
- [ ] Classify as SEV{level} based on {criteria}
2. **Establish Command**
- [ ] Page Incident Commander if SEV1/2
- [ ] Create incident tracking ticket
- [ ] Join war room: {link/bridge info}
3. **Initial Investigation**
- [ ] Check recent deployments: {deployment log location}
- [ ] Review error logs: {log location and queries}
- [ ] Verify dependencies: {dependency check commands}
## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}
**Rollback Plan:**
1. {rollback step}
2. {verification step}
### Strategy 2: {Name}
{similar structure}
## Recovery and Validation
1. **Service Restoration**
- [ ] {restoration step}
- [ ] Wait for {metric} to return to normal
- [ ] Validate end-to-end functionality
2. **Communication**
- [ ] Update status page
- [ ] Notify stakeholders
- [ ] Schedule PIR
## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}
## Reference Information
→ See references/reference-information.md for details
## Usage Examples
### Example 1: Database Connection Pool Exhaustion
```bash
# Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py
# Reconstruct timeline from logs
python scripts/timeline_reconstructor.py --input assets/sample_timeline_events.json --output timeline.md
# Generate PIR after resolution
python scripts/pir_generator.py --incident assets/sample_incident_data.json --timeline timeline.md --output pir.md
# Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text
# Build timeline from multiple sources
python scripts/timeline_reconstructor.py --input assets/simple_timeline_events.json --detect-phases --gap-analysis
# Generate comprehensive PIR
python scripts/pir_generator.py --incident assets/sample_incident_pir_data.json --rca-method fishbone --action-items
Maintain Calm Leadership
Document Everything
Effective Communication
Technical Excellence
Blameless Culture
Action Item Discipline
Knowledge Sharing
Continuous Improvement