Skills Development Deepgram Incident Runbook

Deepgram Incident Runbook

v20260311
deepgram-incident-runbook
Guides engineers through triage, severity classification, fallback activation, and post-incident review for Deepgram outages, with steps for API checks, retries, and communications during production failures.
Get Skill
230 downloads
Overview

Deepgram Incident Runbook

Contents

Overview

Standardized procedures for responding to Deepgram-related incidents with initial triage script, severity-based response (SEV1-SEV4), fallback activation, degradation investigation, and post-incident review templates.

Prerequisites

  • Monitoring and alerting configured
  • On-call rotation established
  • Fallback/queueing system available
  • Communication channels defined

Instructions

Step 1: Run Initial Triage (First 5 Minutes)

Execute triage script: check Deepgram status page, query error rate from Prometheus, check P95 latency, and test API connectivity with curl.

Step 2: Classify Severity

SEV1 (immediate): 100% failure, 5xx errors. SEV2 (<15min): 50%+ error rate. SEV3 (<1hr): elevated latency. SEV4 (<24hr): single feature affected.

Step 3: Respond to SEV1 (Complete Outage)

Acknowledge in PagerDuty/Slack. Verify API key validity. Check network. Activate fallback: queue requests for later replay, or switch to backup STT provider. Notify affected customers.

Step 4: Respond to SEV2 (Major Degradation)

Test transcription across multiple samples and models. Identify if specific model, feature, or audio type is affected. Mitigate: reduce request rate, disable non-critical features, switch models, enable retries.

Step 5: Respond to SEV3 (Minor Degradation)

Increase timeouts to 60s, enable aggressive retry (5 attempts), switch to simpler model (Nova), disable diarization. Monitor for improvement.

Step 6: Conduct Post-Incident Review

Document timeline, root cause, impact (duration, failed requests, revenue). List what went well and areas for improvement. Create action items with owners and due dates.

See detailed implementation for advanced patterns.

Output

  • Automated triage script
  • Severity classification guide
  • Fallback activation procedures
  • Degradation investigation playbook
  • Post-incident review template

Error Handling

Issue Cause Solution
All transcriptions failing API outage Activate fallback queue
50%+ error rate Partial degradation Test models, reduce features
Elevated latency Overload Increase timeouts, reduce rate
Single feature broken API regression Disable feature, report to Deepgram

Examples

Quick Reference

Resource URL
Deepgram Status https://status.deepgram.com
Deepgram Console https://console.deepgram.com
Support support@deepgram.com

Severity Levels

Level Definition Response Time
SEV1 Complete outage Immediate
SEV2 Major degradation < 15 min
SEV3 Minor degradation < 1 hour
SEV4 Minor issue < 24 hours

Escalation Contacts

Level Contact When
L1 On-call engineer First response
L2 Team lead 15 min without resolution
L3 Deepgram support Confirmed Deepgram issue
L4 Engineering director SEV1 > 1 hour

Resources

Info
Category Development
Name deepgram-incident-runbook
Version v20260311
Size 4.11KB
Updated At 2026-03-12
Language