Step-by-step procedures for responding to MaintainX integration incidents, from detection through resolution and post-mortem.
| Severity | Definition | Response Time |
|---|---|---|
| SEV-1 | Complete integration failure, no work orders processing | 15 min |
| SEV-2 | Partial failure, some endpoints degraded | 1 hour |
| SEV-3 | Performance degradation, slow responses | 4 hours |
| SEV-4 | Non-critical feature broken, workaround available | Next business day |
#!/bin/bash
echo "=== MaintainX Incident Triage ==="
echo "Time: $(date -u)"
# Check MaintainX API status
echo -e "\n--- API Health ---"
for endpoint in users workorders assets locations; do
CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://api.getmaintainx.com/v1/$endpoint?limit=1" \
-H "Authorization: Bearer $MAINTAINX_API_KEY")
echo " /$endpoint: HTTP $CODE"
done
# Check your integration service
echo -e "\n--- Integration Service ---"
curl -s http://localhost:3000/health | jq . 2>/dev/null || echo " Service unreachable"
# Check recent error logs
echo -e "\n--- Recent Errors (last 10 min) ---"
# Adjust for your log system:
# journalctl -u maintainx-sync --since "10 min ago" --no-pager | grep -i error | tail -10
| Symptom | Likely Cause | Check |
|---|---|---|
| All endpoints return 401 | API key expired | echo ${#MAINTAINX_API_KEY} and test with curl |
| All endpoints return 5xx | MaintainX platform outage | Check status.getmaintainx.com |
| 429 on all requests | Rate limit exceeded | Review request volume in last hour |
| Specific endpoint 404 | API path changed | Check MaintainX changelog |
| Timeouts | Network issue | curl -w "Total: %time_total seconds" ... |
| Your service crashes | Application error | Check container logs, OOM, disk space |
API Key Expired (SEV-1):
# Generate new key: MaintainX > Settings > Integrations > New Key
# Update in production:
# GCP Secret Manager:
echo -n "NEW_KEY_HERE" | gcloud secrets versions add maintainx-api-key --data-file=-
# Restart service to pick up new key:
gcloud run services update maintainx-integration --region us-central1 --no-traffic
Rate Limited (SEV-2):
// Immediately reduce request volume
// 1. Enable emergency rate limiting
process.env.MAINTAINX_MAX_REQUESTS_PER_SEC = '1';
// 2. Disable non-critical sync jobs
await disableScheduledJobs(['asset-sync', 'report-generator']);
// 3. Keep only critical work order processing
MaintainX Platform Outage (SEV-1):
// Switch to queue-based processing
// Buffer all outgoing requests for replay after recovery
const queue: Array<{ method: string; path: string; body: any }> = [];
function bufferRequest(method: string, path: string, body?: any) {
queue.push({ method, path, body });
console.log(`Buffered: ${method} ${path} (queue size: ${queue.length})`);
}
// When MaintainX recovers, replay buffered requests
async function replayQueue(client: MaintainXClient) {
console.log(`Replaying ${queue.length} buffered requests...`);
for (const req of queue) {
await withRetry(() => client.request(req.method, req.path, req.body));
}
queue.length = 0;
}
# Run full health check
curl -s http://localhost:3000/health | jq .
# Verify data flow
echo "Work orders created in last hour:"
curl -s "https://api.getmaintainx.com/v1/workorders?createdAtGte=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)&limit=5" \
-H "Authorization: Bearer $MAINTAINX_API_KEY" | jq '.workOrders | length'
# Check for data gaps
echo "Checking sync state..."
cat .maintainx-sync-state.json 2>/dev/null || echo "No sync state file found"
## Incident Report Template
**Date**: YYYY-MM-DD
**Severity**: SEV-X
**Duration**: X hours Y minutes
**Impact**: [What was affected - e.g., "work order sync halted for 2 hours"]
### Timeline
- HH:MM - Alert triggered
- HH:MM - Triage started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full recovery confirmed
### Root Cause
[Technical explanation]
### Resolution
[What was done to fix it]
### Action Items
- [ ] Implement [specific improvement]
- [ ] Add monitoring for [gap found]
- [ ] Update runbook with [lesson learned]
| Scenario | Immediate Action |
|---|---|
| Total API failure | Buffer requests, check status page, escalate |
| Intermittent 500s | Enable retry logic, reduce request rate |
| Data sync gap | Note gap window, schedule backfill after recovery |
| Webhook delivery failure | Fall back to polling, queue missed events |
For data handling patterns, see maintainx-data-handling.
Automated alerting on integration health:
// Check health every 5 minutes, alert on failure
import cron from 'node-cron';
cron.schedule('*/5 * * * *', async () => {
try {
const res = await fetch('http://localhost:3000/health');
const health = await res.json();
if (health.status !== 'healthy') {
await sendPagerDutyAlert({
severity: 'critical',
summary: `MaintainX integration degraded: ${JSON.stringify(health.checks)}`,
});
}
} catch {
await sendPagerDutyAlert({
severity: 'critical',
summary: 'MaintainX integration service unreachable',
});
}
});