技能 编程开发 灾难恢复计划生成器

灾难恢复计划生成器

v20260618
disaster-recovery-plan
用于为任何服务或系统生成完整的灾难恢复(DR)计划。它涵盖了关键的恢复目标(RTO/RPO)、具体的故障场景手册、备份验证流程和测试周期。适用于SRE实践构建和系统级高可用性规划。
获取技能
489 次下载
概览

Disaster Recovery Plan Skill

Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.

Required Inputs

Ask for these if not already provided:

  • Service name and what it does (business function and technical role)
  • Criticality tier — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
  • Current infrastructure setup — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
  • RPO/RTO requirements — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
  • Backup strategy — what is backed up, how often, where backups are stored, retention policy
  • On-call contacts — names and contact details for the responder chain

Output Format


Disaster Recovery Plan: [Service Name]

Team: [Team name] | Tech lead: [Name] Criticality tier: [Tier 1 / Tier 2 / Tier 3] | Last tested: [Date] Next DR test: [Date] | Document owner: [Name] Last updated: [Date] | Review cycle: Quarterly

Emergency? Skip to Section 3 — Failure Scenario Runbooks. Find the scenario that matches your situation and follow the steps exactly.


1. Recovery Targets

Target Value Rationale
RPO (Recovery Point Objective) [X minutes/hours] [e.g. "Last committed transaction — database replication is synchronous"]
RTO (Recovery Time Objective) [Y minutes/hours] [e.g. "Revenue impact begins at 30 min; target recovery in 15 min"]
MTTR target (non-disaster) [Z minutes] [Operational incidents, not DR events]
Data retention (backups) [N days/weeks] [Compliance requirement or operational policy]
Backup frequency [Every X hours] [RPO-driven — backup interval must be ≤ RPO]

What these mean in practice:

  • If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
  • The service must be operational again within [Y minutes/hours] of declaring a DR event.
  • If either target cannot be met, escalate to [Engineering Manager] immediately.

2. Failure Scenario Inventory

Scenario Likelihood Impact RTO target RPO target Runbook
Single availability zone failure Medium [Partial / Full outage] [15 min] [0 — no data loss] Section 3.1
Full region failure Low Full outage [60 min] [5 min] Section 3.2
Database corruption / data loss Low Full outage [90 min] [RPO value] Section 3.3
Critical dependency outage High [Partial degradation] [30 min] [N/A] Section 3.4
Security breach / ransomware Very low Full outage + investigation [4 hours] [Last clean backup] Section 3.5
Accidental bulk data deletion Low Partial or full data loss [60 min] [RPO value] Section 3.6

3. Failure Scenario Runbooks

3.1 Single Availability Zone Failure

Trigger: One AZ becomes unreachable — pods/instances in that zone stop responding. Detection: PagerDuty alert [AlertName] fires, or cloud provider status page shows AZ degradation. Expected RTO: [15 minutes] | Expected RPO: Zero (no data loss if multi-AZ replication is working)

Step 1 — Confirm the failure

# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running

# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready

# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com

Step 2 — Assess whether auto-recovery has occurred

# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch

# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]

# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]

Step 3 — Force rescheduling if auto-recovery stalled

# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]

# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]

Step 4 — Verify service health

# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]

# Check error rate in monitoring
[dashboard link or query]

Recovery confirmed when: All pods are Running, health check returns 200, error rate is at baseline.


3.2 Full Region Failure

Trigger: The primary region is entirely unavailable. Detection: All service health checks failing, cloud provider status page confirms region-wide event. Expected RTO: [60 minutes] | Expected RPO: [5 minutes — based on cross-region replication lag]

Step 1 — Confirm regional failure (5 minutes)

# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"

# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]

Step 2 — Declare DR event and notify (2 minutes)

Post to #incidents:

🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes

Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.

Step 3 — Promote DR database (10 minutes)

# AWS RDS — promote read replica to primary
aws rds promote-read-replica \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Wait for promotion to complete
aws rds wait db-instance-available \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Record the new database endpoint
aws rds describe-db-instances \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region] \
  --query 'DBInstances[0].Endpoint.Address'

Step 4 — Deploy service in DR region (20 minutes)

# Update service configuration to point at DR database
kubectl set env deployment/[service-name] \
  DATABASE_URL=[new-dr-database-url] \
  -n [namespace] \
  --context [dr-region-context]

# Scale up the DR deployment
kubectl scale deployment/[service-name] --replicas=[N] \
  -n [namespace] \
  --context [dr-region-context]

# Verify all pods are running
kubectl get pods -n [namespace] --context [dr-region-context]

Step 5 — Cut over DNS / load balancer (5 minutes)

# Update DNS to point to DR region load balancer
# AWS Route 53:
aws route53 change-resource-record-sets \
  --hosted-zone-id [zone-id] \
  --change-batch file://dr-failover-dns.json

# Verify DNS propagation (may take up to [TTL] seconds)
dig [service-domain] @8.8.8.8

Step 6 — Verify end-to-end

# Full smoke test against DR endpoint
curl -s https://[service-url]/health
[run automated smoke test suite if available]

Recovery confirmed when: DNS resolves to DR region, smoke tests pass, error rate is at baseline.

Post-failover actions (not urgent — after service is stable):

  • Do not fail back to primary until root cause is confirmed resolved
  • Document data loss window (check replication lag at time of failure)
  • Begin post-incident review — see [incident-postmortem skill]

3.3 Database Corruption or Data Loss

Trigger: Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault. Detection: Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports. Expected RTO: [90 minutes] | Expected RPO: [Backup interval — e.g. 1 hour]

Step 1 — Stop the bleeding immediately

# Put the service into maintenance mode to prevent further writes to corrupted data
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]

# Or: scale down the service to zero to prevent writes
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

Step 2 — Assess scope of corruption

# Identify which tables/records are affected
[SQL query to check data integrity — e.g.]
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"

# Determine when corruption started (cross-reference with deploy times and error logs)
[log query to find earliest error — e.g. in Datadog:]
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc

Step 3 — Identify the correct restore point

# List available backups
[command to list backups — e.g. for RDS:]
aws rds describe-db-snapshots \
  --db-instance-identifier [db-identifier] \
  --query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
  --output table

# Choose the most recent backup BEFORE corruption started
# Record the chosen snapshot ID: [snapshot-id]

Step 4 — Restore from backup

# Restore to a NEW database instance (never overwrite production directly)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-restored-[date] \
  --db-snapshot-identifier [snapshot-id] \
  --region [region]

# Wait for restore to complete
aws rds wait db-instance-available \
  --db-instance-identifier [service-name]-restored-[date]

# Get the restored instance endpoint
aws rds describe-db-instances \
  --db-instance-identifier [service-name]-restored-[date] \
  --query 'DBInstances[0].Endpoint.Address'

Step 5 — Validate restored data

# Connect to restored database and verify integrity
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"

# Confirm record counts match expectations
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"

Step 6 — Point service at restored database

kubectl set env deployment/[service-name] \
  DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
  -n [namespace]

kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]

Recovery confirmed when: Service is running against restored database, data integrity checks pass, error rate is at baseline.


3.4 Critical Dependency Outage

Trigger: A service that [service name] depends on is unavailable or degraded. Detection: Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner. Expected RTO: Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]

Dependency map:

Dependency Criticality Degraded behaviour Mitigation
[Database] Critical — all writes fail Full outage Activate DR database (Section 3.3)
[Cache — Redis] High — latency increases Performance degradation Bypass cache, serve from DB
[Auth service] Critical — auth fails All authenticated endpoints fail Return cached tokens (if implemented)
[Message queue] Medium — async processing delays Writes succeed, async jobs queue Queue backlog — see on-call runbook
[External API — name] Low — feature X unavailable Graceful degradation Feature flag to disable feature X

Mitigation steps:

# Enable circuit breaker / fallback for [dependency] if implemented
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]

# Enable feature flag to disable [dependency-backed feature]
[feature flag CLI command or dashboard link]

# Check if dependency has a status page
# [Dependency status URL]

Escalation: Contact [dependency] on-call via [PagerDuty / Slack #[channel]]. Share your service's error rate and the time dependency errors started.


3.5 Security Breach or Ransomware

Trigger: Evidence of unauthorized access, data exfiltration, or encryption of service data. Detection: Security tooling alert, unusual access patterns, user reports of data exposure. Expected RTO: [4+ hours — prioritise containment over speed] | Expected RPO: [Last verified clean backup]

Step 1 — Isolate immediately

# Take the service offline — do not attempt to recover while breach is active
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Revoke all API keys and service account credentials immediately
[command to rotate secrets — e.g. via Vault or cloud provider]

# Block all external access at network level
[firewall/security group command to deny all inbound traffic]

Step 2 — Notify security team immediately Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.

Post to #security-incidents (private channel, not #incidents):

🔴 SECURITY INCIDENT — [Service Name]
Time detected: [Time]
Evidence: [One sentence — what was observed]
Actions taken: Service isolated, credentials revoked
Awaiting: Security team guidance

Step 3 — Preserve evidence

# Export current logs before any remediation
[log export command — preserve evidence for forensics]

# Snapshot the current state of all infrastructure
[snapshot/image command]

Steps 4+ — Follow security team guidance. Do not restore from backup until security team confirms the attack vector is closed.


3.6 Accidental Bulk Data Deletion

Trigger: An operator, script, or application bug has deleted records in bulk. Detection: Sudden drop in record counts, user reports of missing data, application errors. Expected RTO: [60 minutes] | Expected RPO: [Backup interval]

# Step 1 — Stop further writes immediately
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Step 2 — Determine what was deleted and when
psql $DATABASE_URL -c "
  SELECT schemaname, tablename,
         n_dead_tup, last_autovacuum
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC LIMIT 10;
"

# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
# Records may still be recoverable if VACUUM has not run
psql $DATABASE_URL -c "
  SELECT * FROM [table]
  WHERE xmax != 0  -- recently deleted rows
  LIMIT 100;
"

# Step 4 — If not recoverable via MVCC, restore from backup
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward

4. Backup and Restore Procedures

Backup Configuration

Data store Backup type Frequency Retention Location
[Primary database] Automated snapshots Every [N] hours [N] days [S3 bucket / cloud storage path]
[Primary database] Transaction log backups Continuous [N] days [Location]
[Secondary store — e.g. Redis] RDB dump Daily [N] days [Location]
[Blob/object storage] Cross-region replication Continuous [N] days [DR region bucket]
[Config / secrets] Terraform state + Vault backup On change Indefinite [Location]

Backup Validation (Run Weekly)

# Test restore of latest database backup to a throwaway instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --db-snapshot-identifier $(aws rds describe-db-snapshots \
    --db-instance-identifier [db-id] \
    --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
    --output text)

# Wait for restore, then run integrity checks
psql [test-instance-endpoint] -c "[integrity check query]"

# Confirm row counts match recent production values (allow ≤ RPO difference)
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"

# Destroy the test instance
aws rds delete-db-instance \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --skip-final-snapshot

5. DR Testing Cadence

Regular testing is mandatory. An untested DR plan is not a DR plan.

Test type Frequency Who runs it Pass criteria
Backup restore validation Weekly (automated) On-call rotation Restore completes, integrity checks pass
Zone failover drill Monthly Engineering team RTO target met, zero data loss
Region failover drill Quarterly Engineering + SRE RTO/RPO targets met
Full DR game day Annually Engineering + stakeholders All scenarios exercised, gaps documented
Chaos engineering (infra failures) Weekly (automated) Chaos engineering tooling Service degrades gracefully, recovers automatically

Game Day Procedure

  1. Pre-game day (1 week before): Notify all stakeholders, freeze production changes for the day, prepare DR environment.
  2. Scope definition: Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
  3. Execute: One person acts as incident commander, others execute runbook steps while another observes and times.
  4. Measure: Record actual RTO and RPO against targets for each scenario.
  5. Debrief (same day): Document gaps, runbook inaccuracies, and automation opportunities.
  6. Action items: File tickets for every gap found. Priority: P1 items must be fixed before next game day.

6. Communication Plan

Internal Communication During DR Event

Incident commander responsibilities:

  • Declare the DR event and open the incident channel
  • Post updates every 15 minutes minimum
  • Make the call to fail over (do not let the team decide by committee)
  • Notify business stakeholders of expected recovery time

Notify these people at DR event start:

Role Name Contact When to notify
Engineering manager [Name] [Slack / Phone] Immediately
CTO / VP Engineering [Name] [Phone] Tier 1 services: immediately
Customer success lead [Name] [Slack] If customer-facing impact
Security lead [Name] [Slack / PagerDuty] If breach suspected
Legal / compliance [Name] [Email / Phone] If data loss involves PII

Communication Templates

DR event declared:

🔴 DR EVENT — [Service Name]
Time: [HH:MM UTC]
Scenario: [Zone failure / Region failure / Data loss / etc.]
Impact: [Who is affected and how]
RTO target: [X minutes]
Incident commander: [Name]
War room: [Slack channel / call link]
Next update: [Time + 15 min]

Status update (every 15 minutes):

🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
Status: [Investigating / Executing recovery / Verifying]
Progress: [One sentence on current step]
Blockers: [Any — or "None"]
Updated RTO estimate: [Time]
Next update: [Time + 15 min]

Recovery confirmed:

✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
Total downtime: [X minutes]
Data loss: [None / X minutes of transactions]
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
Root cause: [One sentence]
Post-incident review: [Scheduled for / Link when created]

7. DR Readiness Checklist

Run this checklist quarterly and before any major infrastructure change:

Backups:

  • Automated backups are running and alerts fire if they fail
  • Most recent backup restore was tested within the last 7 days
  • Backup retention meets RPO and compliance requirements
  • Backups are stored in a separate region / account from primary

Failover infrastructure:

  • DR region / environment exists and is provisioned (not just documented)
  • DNS failover procedure is documented with exact commands
  • DR database replica is current (replication lag is within RPO)
  • Service can be deployed in DR region with a single command or automated pipeline

Runbooks:

  • All runbooks in Section 3 have been tested within the last quarter
  • Runbook commands have been verified against current infrastructure (no stale references)
  • Contact list is current (no departed employees)

Access:

  • On-call engineers have access to DR region console / CLI
  • Service account credentials for DR region are provisioned and tested
  • Break-glass accounts exist for emergency access if SSO is unavailable

Monitoring:

  • Monitoring exists in DR region (not just primary)
  • Alerts fire correctly when DR environment has issues

Quality Checks

  • RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
  • Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
  • DR database exists in the DR region and replication lag is monitored
  • Backup restore has been tested end-to-end within the last 7 days
  • The game day schedule is on the team calendar — not just documented here
  • Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
  • Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
  • All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time

Anti-Patterns

  • Do not write runbook commands without testing them — an untested command in a runbook is actively dangerous during a real disaster when cognitive load is highest
  • Do not set RTO/RPO targets without business sign-off — technical teams often set aspirational targets that do not reflect actual business cost tolerance for downtime
  • Do not include only the "happy path" of each failover scenario — runbooks must explicitly cover what to do when the recovery step itself fails
  • Do not list Slack handles as the only escalation contact — Slack may be unavailable during a region-wide failure; phone numbers are mandatory
  • Do not schedule DR game days without pre-committing to fix the gaps found — a game day that produces action items no one owns is theater, not preparedness
信息
Category 编程开发
Name disaster-recovery-plan
版本 v20260618
大小 22.59KB
更新时间 2026-06-19
语言