技能 硬件工程 运维流程手册编写

运维流程手册编写

v20260618
runbook-writer
用于编写结构化、可执行的运维手册(Runbook)。无论是在服务部署、突发事件响应还是日常维护中,它都能提供包含步骤、先决条件、回滚和升级路径的详尽指南,帮助初级工程师也能在压力下快速处理问题。
获取技能
196 次下载
概览

Runbook Writer Skill

Produces operational runbooks for services, incident types, and deployment procedures — structured so an on-call engineer who's never touched the system can follow them under pressure.

Required Inputs

Ask for these if not provided:

  • What the runbook is for (e.g. deploying the payment service, responding to a database failover, rotating API keys)
  • Runbook type (Deployment / Incident Response / Maintenance / Disaster Recovery)
  • System/service name and what it does (brief description)
  • Audience (new on-call engineers / experienced SREs / DevOps team)
  • Tech stack (where relevant — e.g. Kubernetes, AWS RDS, Node.js)
  • Monitoring tools (e.g. Grafana, Datadog, CloudWatch, Splunk — used to name specific dashboards and alert links in the steps)
  • Key environment details (e.g. Kubernetes cluster name, AWS account/region, relevant namespaces or resource names — paste what's relevant for exact commands)

Output Format


Runbook: [Runbook Title] Service: [Service Name] Type: [Deployment / Incident Response / Maintenance / DR] Last Updated: [Insert today's date in YYYY-MM-DD format] Owner: [Team or person] Severity: [P1 / P2 / P3 — if incident-type]


Overview

What this runbook covers: [1–2 sentences on the scenario this runbook handles]

When to use this runbook:

  • [Specific trigger condition 1 — e.g. PagerDuty alert: high-error-rate-payment-service]
  • [Specific trigger condition 2 — e.g. Deploy needed after PR merged to main]

Estimated time to complete: [X minutes / X–Y minutes depending on outcome]

Impact if not completed correctly: [e.g. Payment processing degraded / Data loss risk / Users locked out]


Prerequisites

Access required:

  • [System/tool access — e.g. AWS Console: production-account]
  • [Credential — e.g. vault read secret/payment-service]
  • [VPN / bastion access if needed]

Tools required:

  • [Tool name and version — e.g. kubectl v1.28+]
  • [CLI or dashboard name]

Before you start:

  • [Prerequisite check — e.g. Verify current deployment is healthy in Grafana]
  • [Prerequisite action — e.g. Announce in #ops-live that you're starting]

Procedure

Number every step. Use exact commands. Do not paraphrase tool names or flags.

Step 1: [Action name] [What you're doing and why — one sentence]

# Exact command
[command here]

Expected output: [what should appear if this worked] If this fails: [Exact error message to look for] → [What to do, or see Troubleshooting]

Step 2: [Action name] [Same structure as Step 1]

Step 3: Verify Always include a verification step after the main procedure:

[verification command]

Expected state: [What a healthy system looks like after this runbook completes]


Rollback

How to undo this procedure if something went wrong:

Step R1: [Rollback action]

[rollback command]

Verify rollback: [command to confirm rollback succeeded]


Troubleshooting

Symptom Likely Cause Resolution
[Error message or observable symptom] [Why this happens] [Exact fix or next step]
[Another symptom] [Cause] [Resolution]

Escalation

If this runbook does not resolve the issue:

Condition Who to Contact How
[e.g. DB unavailable after 10 min] [DBA on-call] [PagerDuty policy: db-oncall]
[e.g. Payment provider unresponsive] [Vendor contact] [Contact in 1Password: vendor-escalation]

Always update the incident timeline in [tool] before escalating.


Post-Procedure Checklist

After completing the runbook:

  • Announce completion in #ops-live with outcome
  • Update the incident ticket / deploy log
  • Verify alerts have resolved in monitoring dashboard
  • If this revealed a gap in this runbook — update it now (link to edit process)

Quality Checks

  • Every step has an exact command (no "run the deploy script")
  • Expected output is specified for each step so engineer knows if it worked
  • Failure path is explicit for each step (not "if it fails, investigate")
  • Rollback procedure is complete and independently testable
  • Escalation table has no cells containing only "[Team name]" — every row must either have a real contact or be explicitly flagged as [FILL IN: on-call rotation link]
  • Rollback section contains at least one concrete command (not left as "[rollback command]" placeholder)
  • Runbook can be followed by someone who has never touched this system

Usage Examples

  • "Write a runbook for [service] deployment"
  • "Create an incident response runbook for [alert type]"
  • "I need a runbook for [procedure]"
  • "Document the operational procedure for [X]"
  • "Write an ops playbook for [scenario]"

Anti-Patterns

  • Do not write steps as vague actions like "run the deploy script" — every step must include the exact command
  • Do not leave the rollback section as a placeholder — a runbook without a tested rollback procedure is incomplete and dangerous
  • Do not omit expected output for each step — without it, the on-call engineer cannot tell if the step succeeded
  • Do not write escalation contacts as "[Team name]" — every escalation row must have a real contact or an explicit flag to fill in
  • Do not assume the reader knows the system — write for someone who has never touched it before
信息
Category 硬件工程
Name runbook-writer
版本 v20260618
大小 5.81KB
更新时间 2026-06-19
语言