Optimize Deepgram integration performance through audio preprocessing (16kHz mono PCM), connection pooling, model selection, streaming for large files, parallel processing, and result caching.
Preprocess audio to 16-bit PCM, mono channel, 16kHz sample rate WAV format using ffmpeg. This is optimal for Deepgram's speech models.
Create a pool of Deepgram clients (min 2, max 10) with acquire timeout and idle timeout. Use execute() pattern to auto-acquire and release connections.
Choose Nova-2 for best accuracy/speed balance. Use Base model for cost-sensitive batch jobs. Match model to priority: accuracy, speed, or cost.
Use live transcription WebSocket for files over 60 seconds. Stream file data in chunks (1MB) and collect final transcripts.
Use p-limit to process multiple audio files concurrently (default 5). Track per-file timing and total throughput.
Hash audio URL + options as cache key. Store in Redis with configurable TTL. Return cached results for repeated requests.
See detailed implementation for advanced patterns.
| Issue | Cause | Solution |
|---|---|---|
| Slow transcription | Wrong audio format | Preprocess to 16kHz mono WAV |
| Connection exhaustion | No pooling | Use connection pool |
| High latency | Large files | Switch to streaming |
| Redundant API calls | No caching | Enable transcription cache |
| Factor | Impact | Optimization |
|---|---|---|
| Audio Format | High | 16-bit PCM, mono, 16kHz |
| File Size | High | Stream large files |
| Model Choice | High | Balance accuracy vs speed |
| Concurrency | Medium | Pool connections |
| Network Latency | Medium | Use closest region |