Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.
Ask for these if not provided:
Runbook: [Runbook Title] Service: [Service Name] Type: [Deployment / Incident Response / Maintenance / DR] Last Updated: [Insert today's date in YYYY-MM-DD format] Owner: [Team or person] Severity: [P1 / P2 / P3 — if incident-type]
What this runbook covers: [1–2 sentences on the scenario this runbook handles]
When to use this runbook:
high-error-rate-payment-service]main]Estimated time to complete: [X minutes / X–Y minutes depending on outcome]
Impact if not completed correctly: [e.g. Payment processing degraded / Data loss risk / Users locked out]
Access required:
production-account]vault read secret/payment-service]Tools required:
kubectl v1.28+]Before you start:
#ops-live that you're starting]Number every step. Use exact commands. Do not paraphrase tool names or flags.
Step 1: [Action name] [What you're doing and why — one sentence]
# Exact command
[command here]
Expected output: [what should appear if this worked]
If this fails: [Exact error message to look for] → [What to do, or see Troubleshooting]
Step 2: [Action name] [Same structure as Step 1]
Step 3: Verify Always include a verification step after the main procedure:
[verification command]
Expected state: [What a healthy system looks like after this runbook completes]
How to undo this procedure if something went wrong:
Step R1: [Rollback action]
[rollback command]
Verify rollback: [command to confirm rollback succeeded]
| Symptom | Likely Cause | Resolution |
|---|---|---|
| [Error message or observable symptom] | [Why this happens] | [Exact fix or next step] |
| [Another symptom] | [Cause] | [Resolution] |
If this runbook does not resolve the issue:
| Condition | Who to Contact | How |
|---|---|---|
| [e.g. DB unavailable after 10 min] | [DBA on-call] | [PagerDuty policy: db-oncall] |
| [e.g. Payment provider unresponsive] | [Vendor contact] | [Contact in 1Password: vendor-escalation] |
Always update the incident timeline in [tool] before escalating.
After completing the runbook:
#ops-live with outcome