故障事件指挥与管理框架

v20260612

incident-commander

本框架提供了一套完整的运营故障事件响应流程，覆盖从故障检测到根因分析和总结的全过程。它指导团队进行严重性分类、重建事件时间线，并明确了事件指挥官（IC）的角色职责。适用于服务中断、性能严重下降等高可靠性场景。

故障管理 SRE DevOps 宕机可靠性故障响应事后复盘运维

259 次下载

概览

Incident Commander Skill

Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026

Overview

Incident response framework for availability/reliability incidents (outages, degradations, failed deploys): severity classification, timeline reconstruction, and post-incident review.

This is NOT security incident triage. For security events (ransomware, intrusion, data exfiltration, IOC analysis, NIST SP 800-61 forensics), route to incident-response. Both skills use SEV1-SEV4 labels; this one scores operational impact (users, revenue, SLA), while incident-response classifies attack types and forensic handling.

Key Features

Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
Communication Templates - Pre-built templates for stakeholder updates and escalations
Runbook Integration - Generate actionable runbooks from incident patterns

Skills Included

Core Tools

Incident Classifier (incident_classifier.py)
- Analyzes incident descriptions and outputs severity levels
- Recommends response teams and initial actions
- Generates communication templates based on severity
Timeline Reconstructor (timeline_reconstructor.py)
- Processes timestamped events from multiple sources
- Reconstructs chronological incident timeline
- Identifies gaps and provides duration analysis
PIR Generator (pir_generator.py)
- Creates comprehensive Post-Incident Review documents
- Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
- Generates actionable follow-up items

Incident Response Framework

Severity Classification System

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Characteristics:

Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties

Response Requirements:

Immediate escalation to on-call engineer
Incident Commander assigned within 5 minutes
Executive notification within 15 minutes
Public status page update within 15 minutes
War room established
All hands on deck if needed

Communication Frequency: Every 15 minutes until resolution

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Characteristics:

Partial service degradation (>25% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience

Response Requirements:

On-call engineer response within 15 minutes
Incident Commander assigned within 30 minutes
Status page update within 30 minutes
Stakeholder notification within 1 hour
Regular team updates

Communication Frequency: Every 30 minutes during active response

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Characteristics:

Single feature or component affected
<25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts

Response Requirements:

Response within 2 hours during business hours
Next business day response acceptable outside hours
Internal team notification
Optional status page update

Communication Frequency: At key milestones only

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Characteristics:

Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues

Response Requirements:

Response within 1-2 business days
Standard ticket/issue tracking
No special escalation required

Communication Frequency: Standard development cycle updates

Incident Commander Role

Primary Responsibilities

Command and Control
- Own the incident response process
- Make critical decisions about resource allocation
- Coordinate between technical teams and stakeholders
- Maintain situational awareness across all response streams
Communication Hub
- Provide regular updates to stakeholders
- Manage external communications (status pages, customer notifications)
- Facilitate effective communication between response teams
- Shield responders from external distractions
Process Management
- Ensure proper incident tracking and documentation
- Drive toward resolution while maintaining quality
- Coordinate handoffs between team members
- Plan and execute rollback strategies if needed
Post-Incident Leadership
- Ensure thorough post-incident reviews are conducted
- Drive implementation of preventive measures
- Share learnings with broader organization

Decision-Making Framework

Emergency Decisions (SEV1/2):

Incident Commander has full authority
Bias toward action over analysis
Document decisions for later review
Consult subject matter experts but don't get blocked

Resource Allocation:

Can pull in any necessary team members
Authority to escalate to senior leadership
Can approve emergency spend for external resources
Make call on communication channels and timing

Technical Decisions:

Lean on technical leads for implementation details
Make final calls on trade-offs between speed and risk
Approve rollback vs. fix-forward strategies
Coordinate testing and validation approaches

Communication Templates

Initial Incident Notification (SEV1/2)

Subject: [SEV{severity}] {Service Name} - {Brief Description}

Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}

Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}

Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}

Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}

---
{Incident Commander Name}
{Contact Information}

Executive Summary (SEV1)

Subject: URGENT - Customer-Impacting Outage - {Service Name}

Executive Summary:
{2-3 sentence description of customer impact and business implications}

Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes} 
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}

Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination  
- [ ] Resource allocation decisions
- [ ] External vendor engagement

Incident Commander: {name} ({contact})
Next Update: {time}

---
This is an automated alert from our incident response system.

Customer Communication Template

We are currently experiencing {brief description of issue} affecting {scope of impact}. 

Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.

What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}

What we're doing:
- {primary response action}
- {secondary response action}

Workaround (if available):
{workaround steps or "No workaround currently available"}

We apologize for the inconvenience and will share more information as it becomes available.

Next update: {time}
Status page: {link}

Stakeholder Management

Stakeholder Classification

Internal Stakeholders:

Engineering Leadership - Technical decisions and resource allocation
Product Management - Customer impact assessment and feature implications
Customer Support - User communication and support ticket management
Sales/Account Management - Customer relationship management for enterprise clients
Executive Team - Business impact decisions and external communication approval
Legal/Compliance - Regulatory reporting and liability assessment

External Stakeholders:

Customers - Service availability and impact communication
Partners - API availability and integration impacts
Vendors - Third-party service dependencies and support escalation
Regulators - Compliance reporting for regulated industries
Public/Media - Transparency for public-facing outages

Communication Cadence by Stakeholder

Stakeholder	SEV1	SEV2	SEV3	SEV4
Engineering Leadership	Real-time	30min	4hrs	Daily
Executive Team	15min	1hr	EOD	Weekly
Customer Support	Real-time	30min	2hrs	As needed
Customers	15min	1hr	Optional	None
Partners	30min	2hrs	Optional	None

Runbook Generation Framework

Dynamic Runbook Components

Detection Playbooks
- Monitoring alert definitions
- Triage decision trees
- Escalation trigger points
- Initial response actions
Response Playbooks
- Step-by-step mitigation procedures
- Rollback instructions
- Validation checkpoints
- Communication checkpoints
Recovery Playbooks
- Service restoration procedures
- Data consistency checks
- Performance validation
- User notification processes

Runbook Template Structure

# {Service/Component} Incident Response Runbook

## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}

## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}

### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}

## Initial Response (0-15 minutes)
1. **Assess Severity**
   - [ ] Check {primary metric}
   - [ ] Verify {secondary indicator}
   - [ ] Classify as SEV{level} based on {criteria}

2. **Establish Command**
   - [ ] Page Incident Commander if SEV1/2
   - [ ] Create incident tracking ticket
   - [ ] Join war room: {link/bridge info}

3. **Initial Investigation**
   - [ ] Check recent deployments: {deployment log location}
   - [ ] Review error logs: {log location and queries}
   - [ ] Verify dependencies: {dependency check commands}

## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}

**Rollback Plan:**
1. {rollback step}
2. {verification step}

### Strategy 2: {Name}
{similar structure}

## Recovery and Validation
1. **Service Restoration**
   - [ ] {restoration step}
   - [ ] Wait for {metric} to return to normal
   - [ ] Validate end-to-end functionality

2. **Communication**
   - [ ] Update status page
   - [ ] Notify stakeholders
   - [ ] Schedule PIR

## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}

## Reference Information
→ See references/reference-information.md for details

## Usage Examples

### Example 1: Database Connection Pool Exhaustion

```bash
# Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py

# Reconstruct timeline from logs
python scripts/timeline_reconstructor.py --input assets/sample_timeline_events.json --output timeline.md

# Generate PIR after resolution
python scripts/pir_generator.py --incident assets/sample_incident_data.json --timeline timeline.md --output pir.md

Example 2: API Rate Limiting Incident

# Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text

# Build timeline from multiple sources
python scripts/timeline_reconstructor.py --input assets/simple_timeline_events.json --detect-phases --gap-analysis

# Generate comprehensive PIR
python scripts/pir_generator.py --incident assets/sample_incident_pir_data.json --rca-method fishbone --action-items

Best Practices

During Incident Response

Maintain Calm Leadership
- Stay composed under pressure
- Make decisive calls with incomplete information
- Communicate confidence while acknowledging uncertainty
Document Everything
- All actions taken and their outcomes
- Decision rationale, especially for controversial calls
- Timeline of events as they happen
Effective Communication
- Use clear, jargon-free language
- Provide regular updates even when there's no new information
- Manage stakeholder expectations proactively
Technical Excellence
- Prefer rollbacks to risky fixes under pressure
- Validate fixes before declaring resolution
- Plan for secondary failures and cascading effects

Post-Incident

Blameless Culture
- Focus on system failures, not individual mistakes
- Encourage honest reporting of what went wrong
- Celebrate learning and improvement opportunities
Action Item Discipline
- Assign specific owners and due dates
- Track progress publicly
- Prioritize based on risk and effort
Knowledge Sharing
- Share PIRs broadly within the organization
- Update runbooks based on lessons learned
- Conduct training sessions for common failure modes
Continuous Improvement
- Look for patterns across multiple incidents
- Invest in tooling and automation
- Regularly review and update processes

Integration with Existing Tools

Monitoring and Alerting

PagerDuty/Opsgenie integration for escalation
Datadog/Grafana for metrics and dashboards
ELK/Splunk for log analysis and correlation

Communication Platforms

Slack/Teams for war room coordination
Zoom/Meet for video bridges
Status page providers (Statuspage.io, etc.)

Documentation Systems

Confluence/Notion for PIR storage
GitHub/GitLab for runbook version control
JIRA/Linear for action item tracking

Change Management

CI/CD pipeline integration
Deployment tracking systems
Feature flag platforms for quick rollbacks

信息

Category 编程开发

Name incident-commander

版本 v20260612

大小 95.12KB

Source alirezarezvani/claude-skills

更新时间 2026-06-13