Incident response procedures for Lindy AI agent failures. Covers platform outages, individual agent failures, integration breakdowns, credit exhaustion, and webhook endpoint failures.
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | All agents failing, customer impact | 15 minutes | Lindy platform outage, all webhooks failing |
| SEV2 | Critical agent down | 30 minutes | Support bot offline, phone agent unreachable |
| SEV3 | Degraded performance | 2 hours | High latency, intermittent failures |
| SEV4 | Minor issue | 24 hours | Non-critical agent misconfigured |
# Is Lindy up?
curl -s -o /dev/null -w "Lindy API: HTTP %{http_code}\n" \
"https://public.lindy.ai" --max-time 5
# Check status page
echo "Status page: https://status.lindy.ai"
# Is your webhook receiver up?
curl -s -o /dev/null -w "Our endpoint: HTTP %{http_code}\n" \
"https://api.yourapp.com/health" --max-time 5
# Is the webhook auth working?
curl -s -o /dev/null -w "Webhook auth: HTTP %{http_code}\n" \
-X POST "https://api.yourapp.com/lindy/callback" \
-H "Authorization: Bearer $LINDY_WEBHOOK_SECRET" \
-H "Content-Type: application/json" \
-d '{"test": true}' --max-time 5
Log in at https://app.lindy.ai > Settings > Billing
Symptoms: All agents failing, status.lindy.ai shows incident Impact: All Lindy-dependent workflows halted
Runbook:
Fallback code:
async function triggerLindyWithFallback(payload: any) {
try {
const response = await fetch(WEBHOOK_URL, {
method: 'POST',
headers: {
'Authorization': `Bearer ${SECRET}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(payload),
signal: AbortSignal.timeout(10000), // 10s timeout
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return { routed: 'lindy' };
} catch (error) {
console.error('Lindy unreachable, activating fallback:', error);
await queueForReplay(payload); // Store for later
await notifyTeam(`Lindy trigger failed: ${error}`);
return { routed: 'fallback' };
}
}
Symptoms: Specific agent tasks showing "Failed" status Impact: One workflow affected, others may be fine
Runbook:
Symptoms: Actions failing with "Not authorized" or "Token expired" Impact: All tasks using that integration fail
Runbook:
Symptoms: Agents stop running, no new tasks created Impact: All agents paused until credits refill
Runbook:
Symptoms: Lindy agent runs but your callback never receives data Impact: Agent completes but results are lost
Runbook:
curl -s https://api.yourapp.com/health
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Initial response, diagnostics |
| L2 | Engineering lead | After 30 min SEV1, 1 hour SEV2 |
| L3 | VP Engineering | After 1 hour SEV1 |
| Lindy Support | support@lindy.ai | Confirmed Lindy platform issue |
## Incident Report
**Date**: YYYY-MM-DD
**Severity**: SEV[1-4]
**Duration**: [start time] to [end time] ([total minutes])
**Impact**: [what was affected, customer impact]
### Timeline
- HH:MM — Issue detected via [monitoring/user report]
- HH:MM — On-call paged, diagnostics started
- HH:MM — Root cause identified: [cause]
- HH:MM — Fix applied: [what was done]
- HH:MM — Service restored, monitoring confirmed
### Root Cause
[Technical description of what failed and why]
### Resolution
[What was done to fix it]
### Prevention
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]
| Incident Type | Detection | Automated Response |
|---|---|---|
| Platform outage | Health check fails | Queue events, notify team |
| Agent failure | Task Completed trigger | Slack alert to #ops |
| Auth expiry | Action step fails | Alert + re-auth link |
| Credit exhaustion | Billing check | Pause non-critical agents |
| Endpoint down | Health check | Redirect to fallback |
Proceed to lindy-data-handling for data security and compliance.