Execute controlled chaos engineering experiments to test system resilience, fault tolerance, and recovery capabilities. Injects failures including network latency, service crashes, resource exhaustion, and dependency outages to verify that systems degrade gracefully and recover automatically.
| Error | Cause | Solution |
|---|---|---|
| Experiment caused production outage | Blast radius larger than expected or missing safeguards | Always run in staging first; reduce scope; add automatic abort triggers; require approval |
| System did not recover after experiment | Auto-healing mechanisms not configured or too slow | Add health-check-based restarts; configure auto-scaling; implement circuit breaker patterns |
| Monitoring missed the failure | Alerting thresholds too lenient or wrong metrics monitored | Tighten alert thresholds; add specific alerts for the failure mode tested; verify alert channels |
| Chaos tool cannot access target | Network segmentation or security policies blocking the tool | Deploy chaos agent inside the target network; add security group rules for the chaos controller |
| Data corruption persists after rollback | Stateful failure injection without transaction protection | Use read-only chaos first; snapshot databases before stateful experiments; implement compensating transactions |
toxiproxy network latency injection:
set -euo pipefail
# Create a proxy for the database connection
toxiproxy-cli create postgres_proxy -l 0.0.0.0:15432 -u postgres-host:5432 # 15432: PostgreSQL port
# Inject 500ms latency
toxiproxy-cli toxic add postgres_proxy -t latency -a latency=500 -a jitter=100 # HTTP 500 Internal Server Error
# Run tests while latency is active
npm test -- --grep "handles slow database"
# Remove the toxic
toxiproxy-cli toxic remove postgres_proxy -n latency_downstream
Kubernetes pod kill experiment (Litmus Chaos):
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-pod-kill
spec:
appinfo:
appns: default
applabel: "app=api-server"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"
Custom chaos script (process kill and verify recovery):
#!/bin/bash
set -euo pipefail
echo "=== Chaos Experiment: API server kill ==="
echo "Hypothesis: System recovers within 30 seconds"
# Record baseline
BASELINE=$(curl -s -o /dev/null -w '%{http_code}' http://app.test/health)
echo "Baseline health: $BASELINE"
# Kill one API instance
docker kill api-server-1
# Monitor recovery
for i in $(seq 1 30); do
STATUS=$(curl -s -o /dev/null -w '%{http_code}' --max-time 2 http://app.test/health)
echo "T+${i}s: HTTP $STATUS"
if [ "$STATUS" = "200" ]; then # HTTP 200 OK
echo "RECOVERED at T+${i}s"
break
fi
sleep 1
done