技能 编程开发 OpenEvidence 事故响应手册

OpenEvidence 事故响应手册

v20260311
openevidence-incident-runbook
针对 OpenEvidence 临床 AI 故障提供分级评估、排障、降级和事后复盘的执行指南,涵盖快速判定、错误修复、前沿降级与临床团队通知流程。
获取技能
271 次下载
概览

OpenEvidence Incident Runbook

Table of Contents

Overview

Rapid incident response procedures for OpenEvidence clinical AI integration outages in healthcare environments. Includes severity classification, triage steps, error-specific remediation, fallback procedures, and postmortem templates.

Prerequisites

  • Access to OpenEvidence status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • PagerDuty or on-call system access

Severity Levels

Level Definition Response Time Examples
P1 Complete outage < 15 min API unreachable, all queries failing
P2 Degraded service < 1 hour High latency, partial failures
P3 Minor impact < 4 hours DeepConsult delays, webhook issues
P4 No user impact Next business day Alert noise, logging issues

Critical Note

OpenEvidence outages may affect clinical decision-making. Always communicate clearly with clinical staff, ensure fallback procedures are known, and document any clinical impact.

Instructions

Step 1: Initial Assessment (2 minutes)

Check OpenEvidence status page, your integration health endpoint, error rate metrics, and recent error logs.

Step 2: Follow Decision Tree

  • API errors + OpenEvidence status incident -> Enable fallback, wait for resolution
  • API errors + no status incident -> Check credentials, config, network
  • No API errors + unhealthy service -> Infrastructure issue (pods, memory, network)

Step 3: Apply Error-Specific Remediation

  • 401/403: Verify API key, rotate if needed, restart pods
  • 429: Enable request queuing, contact OpenEvidence for limit increase
  • 500/503: Enable graceful degradation, notify clinical staff
  • Timeout: Increase timeout temporarily, check network latency

Step 4: Enable Fallback

Return helpful message directing to UpToDate, DynaMed, or clinical guidelines directly.

Step 5: Communicate

Notify clinical staff via Slack/Teams and email. Update status page.

Step 6: Post-Incident

Collect evidence (logs, metrics, alerts), run postmortem with clinical impact assessment, create action items.

Output

  • Quick triage procedure completed
  • Issue identified and categorized
  • Remediation applied
  • Clinical staff notified
  • Evidence collected for postmortem

Error Handling

Error Type Quick Fix
401/403 Auth kubectl create secret with new key, kubectl rollout restart
429 Rate Limit kubectl set env RATE_LIMIT_MODE=queue
500/503 Server kubectl set env OPENEVIDENCE_FALLBACK=true
Timeout kubectl set env OPENEVIDENCE_TIMEOUT=60000

Examples

One-Line Health Check

set -euo pipefail
curl -sf https://api.yourhealthcare.com/health/openevidence | jq '.status' || echo "UNHEALTHY"

Enable/Disable Fallback

set -euo pipefail
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_FALLBACK=true   # Enable
kubectl set env deployment/clinical-evidence-api OPENEVIDENCE_FALLBACK=false  # Disable

See detailed implementation for advanced patterns.

Resources

信息
Category 编程开发
Name openevidence-incident-runbook
版本 v20260311
大小 3.24KB
更新时间 2026-03-12
语言