Standardized procedures for responding to Deepgram-related incidents with initial triage script, severity-based response (SEV1-SEV4), fallback activation, degradation investigation, and post-incident review templates.
Execute triage script: check Deepgram status page, query error rate from Prometheus, check P95 latency, and test API connectivity with curl.
SEV1 (immediate): 100% failure, 5xx errors. SEV2 (<15min): 50%+ error rate. SEV3 (<1hr): elevated latency. SEV4 (<24hr): single feature affected.
Acknowledge in PagerDuty/Slack. Verify API key validity. Check network. Activate fallback: queue requests for later replay, or switch to backup STT provider. Notify affected customers.
Test transcription across multiple samples and models. Identify if specific model, feature, or audio type is affected. Mitigate: reduce request rate, disable non-critical features, switch models, enable retries.
Increase timeouts to 60s, enable aggressive retry (5 attempts), switch to simpler model (Nova), disable diarization. Monitor for improvement.
Document timeline, root cause, impact (duration, failed requests, revenue). List what went well and areas for improvement. Create action items with owners and due dates.
See detailed implementation for advanced patterns.
| Issue | Cause | Solution |
|---|---|---|
| All transcriptions failing | API outage | Activate fallback queue |
| 50%+ error rate | Partial degradation | Test models, reduce features |
| Elevated latency | Overload | Increase timeouts, reduce rate |
| Single feature broken | API regression | Disable feature, report to Deepgram |
| Resource | URL |
|---|---|
| Deepgram Status | https://status.deepgram.com |
| Deepgram Console | https://console.deepgram.com |
| Support | support@deepgram.com |
| Level | Definition | Response Time |
|---|---|---|
| SEV1 | Complete outage | Immediate |
| SEV2 | Major degradation | < 15 min |
| SEV3 | Minor degradation | < 1 hour |
| SEV4 | Minor issue | < 24 hours |
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | First response |
| L2 | Team lead | 15 min without resolution |
| L3 | Deepgram support | Confirmed Deepgram issue |
| L4 | Engineering director | SEV1 > 1 hour |