Deployment architectures for Retell AI voice agents at different scales. Voice AI systems require real-time processing with strict latency budgets -- architecture choices directly impact call quality.
Best for: Prototyping, < 10 concurrent calls, single agent.
set -euo pipefail
Retell Platform -> WebSocket -> Your Webhook Server -> LLM API
|
Local State (memory)
import express from 'express';
const app = express();
const callState = new Map();
app.post('/retell-webhook', async (req, res) => {
const { call_id, transcript } = req.body;
const state = callState.get(call_id) || { history: [] };
state.history.push(transcript);
const response = await generateResponse(state);
callState.set(call_id, state);
res.json({ response }); // Must respond < 1 second
});
Best for: 10-100 concurrent calls, multiple agents, production.
set -euo pipefail
Retell Platform -> Load Balancer -> Webhook Server 1
-> Webhook Server 2
-> Webhook Server 3
|
Redis (shared state)
|
LLM API (cached)
class DistributedCallHandler {
constructor(private redis: Redis, private llm: LLMClient) {}
async handleTurn(callId: string, transcript: string) {
const state = await this.redis.get(`call:${callId}`);
const context = JSON.parse(state || '{"history":[]}');
context.history.push(transcript);
// Cache common responses for < 100ms latency
const cacheKey = `response:${this.hash(transcript)}`;
let response = await this.redis.get(cacheKey);
if (!response) {
response = await this.llm.generate(context);
await this.redis.setex(cacheKey, 3600, response); # 3600: timeout: 1 hour
}
await this.redis.setex(`call:${callId}`, 3600, JSON.stringify(context)); # timeout: 1 hour
return response;
}
}
Best for: 100+ concurrent calls, complex flows, analytics.
set -euo pipefail
Retell Platform -> API Gateway -> Webhook Service -> Redis (state)
-> Event Bus (Kafka)
|
+--------------+------------+
| | |
Analytics Transcription Escalation
Service Archive Handler
class VoicePipeline {
async handleCall(event: RetellEvent) {
// Fast response path (< 500ms budget)
const response = await this.generateFast(event);
// Async: emit events for downstream processing
await this.eventBus.emit('call.turn', {
callId: event.call_id,
transcript: event.transcript,
response: response
});
return response;
}
}
| Factor | Single Server | Distributed | Event-Driven |
|---|---|---|---|
| Concurrent Calls | < 10 | 10-100 | 100+ |
| Latency Budget | 800ms | 500ms | 300ms |
| State | In-memory | Redis | Redis + Events |
| Scaling | Vertical | Horizontal | Auto-scaling |
| Issue | Cause | Solution |
|---|---|---|
| Calls drop under load | Single server bottleneck | Scale to distributed architecture |
| Lost call state | Server restart | Move state to Redis |
| High latency | LLM response too slow | Pre-cache common responses |
Basic usage: Apply retellai architecture variants to a standard project setup with default configuration options.
Advanced scenario: Customize retellai architecture variants for production environments with multiple constraints and team-specific requirements.