技能 编程开发 持续集成/部署流程手册生成

持续集成/部署流程手册生成

v20260618
cicd-playbook
这是一个用于创建完整、可操作的CI/CD流水线手册的技能。它详细记录了软件从代码构建、单元/集成测试、安全扫描到分阶段部署到预发布和生产环境的完整生命周期。手册提供清晰的流程指南,包括必要的输入、自动化门禁、手动审批和回滚流程,是提升工程文档和新员工入职效率的关键工具。
获取技能
97 次下载
概览

CI/CD Playbook Skill

Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.

A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.

Required Inputs

Ask for these if not already provided:

  • Service name and brief description
  • Tech stack — language, framework, containerisation (Docker, etc.)
  • Source control — GitHub / GitLab / Bitbucket, branching strategy
  • CI platform — GitHub Actions / CircleCI / Jenkins / BuildKite / other
  • CD platform / deployment target — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
  • Environments — e.g. dev, staging, production (and any canary / feature environments)
  • Deployment frequency — how often does the team ship?
  • Any existing gates — manual approvals, smoke tests, feature flags
  • On-call setup — who's responsible during deploys?

Output Format


CI/CD Playbook: [Service Name]

Service: [Name] | Team: [Team name] Last updated: [Date] | Owner: [Name / role] Pipeline platform: [CI tool] → [CD tool / platform]


Overview

[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]

Deployment frequency: [Multiple times per day / Daily / Weekly / On-demand] Average pipeline duration: [X minutes] Rollback time (p95): [X minutes]


Pipeline Stages

[Branch push]
    │
    ▼
[1. Build & Lint] ──fail──▶ ❌ Block PR
    │
    ▼
[2. Unit Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[3. Integration Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
    │
    ▼
[5. Build Artefact / Container Image]
    │
    ▼
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
    │
    ▼
[7. Smoke Tests (Staging)]
    │
    ▼
[8. Manual Approval Gate] ──(if required)
    │
    ▼
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
    │
    ▼
[10. Post-deploy checks]

Stage Definitions

Stage 1 — Build & Lint

What runs: [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] Trigger: Every commit to any branch Blocking: Yes — PR cannot be merged if this fails Typical duration: [X minutes] Owner if it fails: PR author

Common failure causes:

  • [e.g. Missing dependency — run npm install locally before pushing]
  • [e.g. Lint rule violation — run npm run lint --fix to auto-fix most issues]

Stage 2 — Unit Tests

What runs: [Test command — e.g. npm test, go test ./..., pytest] Coverage gate: [X]% minimum — pipeline fails below this threshold Trigger: Every commit Blocking: Yes Typical duration: [X minutes]

Coverage report: [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]


Stage 3 — Integration Tests

What runs: [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] Environment: [Ephemeral test environment / shared test DB / etc.] Trigger: Every commit to main and feature branches targeting main Blocking: Yes Typical duration: [X minutes]

If slow: [e.g. "Integration tests can be skipped locally with SKIP_INTEGRATION=true — never skip in CI"]


Stage 4 — Security Scan

Tools: [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] What it checks: [Dependency vulnerabilities / SAST / secrets detection — list what applies] Blocking on: Critical and High severity findings Non-blocking on: Medium and Low (flagged, not blocking) Trigger: Every commit to main

How to handle a flagged vulnerability:

  1. Check if a fix is available — upgrade the dependency
  2. If no fix available, open a security ticket and add a suppression with justification
  3. Never suppress without a ticket and owner

Stage 5 — Build Artefact

What is produced: [Docker image / binary / zip — be specific] Registry: [ECR / GCR / Docker Hub / Artifactory — URL] Tagging convention: [service-name]:[git-sha] (also tagged :latest on main) Trigger: Commits to main only (not feature branches)


Stage 6 — Deploy to Staging

Deployment method: [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] Staging URL: [URL] Trigger: Automatic on successful artefact build from main Who can deploy to staging: Any engineer (automatic)

Environment variables: Managed in [Vault / AWS SSM / GitHub Secrets / etc.] Staging is not production: [Any differences in config, scale, or data — state them here]


Stage 7 — Smoke Tests (Staging)

What runs: [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] Tool: [e.g. Playwright / Postman / custom script] Pass criteria: All smoke tests pass within [X seconds] timeout Blocking: Yes — production deploy will not proceed if smoke tests fail

Smoke test suite location: [Link to test files or folder]


Stage 8 — Manual Approval Gate

Required for: [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] Who can approve: [e.g. Any engineer on the team / Lead engineer / On-call engineer] Approval timeout: [e.g. 24 hours — auto-cancelled if no approval] How to approve: [GitHub Actions approve step / Slack command / other — with link]

When to withhold approval:

  • Active incident in production
  • Deploy is outside the deployment window (see below)
  • On-call engineer has not been notified

Stage 9 — Deploy to Production

Deployment method: [Same as staging or different — specify] Deployment window: [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] Canary / progressive rollout: [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] Deployment notifications: [Slack channel — #deployments]

Who is on-call during deploy: Deploying engineer is responsible until post-deploy checks pass.


Stage 10 — Post-Deploy Checks

Automated checks (run for [X minutes] after deploy):

  • Error rate: <[X]% (baseline: [Y]%)
  • P99 latency: <[X]ms (baseline: [Y]ms)
  • [Key business metric]: within [X]% of baseline

Where to watch: [Datadog / Grafana / CloudWatch dashboard — link]

If a check fails: See Rollback Procedure below.


Environments

Environment Purpose Deploy trigger URL Data
Dev Local development Manual localhost Seeded test data
Staging Pre-production validation Automatic (main) [URL] Anonymised prod copy
Production Live traffic Manual approval [URL] Live data

Branching Strategy

Model: [Trunk-based / GitFlow / GitHub Flow — describe briefly]

Branch Purpose Who merges Deploy target
main Production-ready code PR + review Staging → Production
feature/* Feature development Author None (CI only)
hotfix/* Critical production fixes Lead engineer Can bypass staging gate with approval

Hotfix process: [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]


Rollback Procedure

Automated rollback: [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]

Manual rollback steps:

# 1. Identify the last known good image tag
[command to list recent deployments]

# 2. Deploy the previous version
[deployment command with previous tag]

# 3. Confirm rollback is live
[smoke test command or health check URL]

# 4. Notify the team
[Slack command or template]

Rollback decision authority: Any engineer on-call can initiate a rollback without waiting for approval.

After a rollback:

  1. Create a post-deploy incident report (see [incident-postmortem skill])
  2. Do not re-deploy the same commit without fixing the root cause
  3. Notify [stakeholder / support team] of the rollback and expected fix timeline

Secrets and Configuration Management

Secret store: [Vault / AWS SSM / GitHub Secrets / Doppler — specify] How to add a new secret:

  1. [Step 1]
  2. [Step 2] Who has access: [Role or team] Rotation policy: [How often secrets are rotated and who owns it]

Never do: Commit secrets to source control, even in .env files. The pipeline includes secret scanning (Stage 4) which will flag this.


Common Failures and Fixes

Failure Likely cause Fix
Build fails with "module not found" Dependency not installed Run [install command] and commit lock file
Integration tests timeout Test DB not seeded / external service down Check [service] status; re-run pipeline
Smoke tests fail after staging deploy Environment variable missing Check [config location]; compare staging and prod env vars
Production deploy stuck at approval Approver not notified Tag @[on-call handle] in #deployments
Post-deploy error rate spike Bad deploy / upstream dependency Check [dashboard]; initiate rollback if >5 min

On-Call Responsibilities During Deploy

  • The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
  • If you cannot monitor after deploying, hand off explicitly to another engineer in #deployments
  • For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying

Anti-Patterns

  • Do not describe a rollback procedure that has never been tested — a theoretical rollback is not a rollback plan; test it in staging before production
  • Do not allow deploys on Fridays or before holidays without an explicit on-call engineer who will monitor through the weekend
  • Do not commit secrets to source control even in non-production branches — secret scanning in the pipeline catches this, but prevention is the standard
  • Do not skip post-deploy monitoring after a production deploy — the deploying engineer must watch error rates and latency for the specified observation window
  • Do not suppress a security scan finding without a linked ticket and a named owner — suppressions without accountability accumulate into unmanaged risk

Quality Checks

  • Every stage has a clear owner when it fails
  • Rollback procedure is tested — not theoretical
  • Secrets management section names the actual tool used (not "use secrets management")
  • Deployment window is specific — not "during business hours"
  • Post-deploy check thresholds are calibrated to actual baseline metrics
信息
Category 编程开发
Name cicd-playbook
版本 v20260618
大小 11.25KB
更新时间 2026-06-19
语言