技能 编程开发 Speak 事件应急手册

Speak 事件应急手册

v20260311
speak-incident-runbook
集中式 Speak 语言学习服务故障应急流程,涵盖快速分级、缓解措施、沟通与回顾,确保运维团队迅速恢复、通知干系人并记录事后分析。
获取技能
304 次下载
概览

Speak Incident Runbook

Overview

Rapid incident response procedures for Speak language learning-related outages.

Prerequisites

  • Access to Speak dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Instructions

  1. Severity Levels
  2. Quick Triage
  3. Decision Tree
  4. Immediate Actions by Error Type
  5. Communication Templates
  6. Fallback Modes
  7. Post-Incident

For full implementation details, load: Read(${CLAUDE_SKILL_DIR}/references/implementation-guide.md)

Output

  • Issue identified and categorized
  • Mitigation applied
  • Stakeholders notified
  • Evidence collected for postmortem
  • Fallback modes enabled if needed

Error Handling

Issue Cause Solution
Can't reach status page Network issue Use mobile or VPN
kubectl fails Auth expired Re-authenticate
Metrics unavailable Prometheus down Check backup metrics
Fallback not working Cache empty Pre-warm cache

Examples

One-Line Health Check

set -euo pipefail
curl -sf https://api.yourapp.com/health | jq '.services.speak.status' || echo "UNHEALTHY"

Quick Fallback Toggle

set -euo pipefail
# Enable fallback
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE=true

# Disable fallback (restore normal)
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE-

Resources

Next Steps

For data handling, see speak-data-handling.

信息
Category 编程开发
Name speak-incident-runbook
版本 v20260311
大小 4.71KB
更新时间 2026-03-12
语言