Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.
incident-response)red-team, threat-detection)The 4 Principles of Chaos Engineering (Netflix, 2016):
Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.
SKILL=engineering/chaos-engineering/skills/chaos-engineering
# 1. Design an experiment
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
# 2. Calculate blast radius
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
# 3. Generate postmortem after the experiment
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
All stdlib-only. Run with --help.
experiment_designer.pyGenerates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).
python scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"
Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.
blast_radius_calculator.pyComputes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.
python scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95
Outputs:
GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.
experiment_postmortem.pyProduces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt
Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.
Different attacks reveal different weaknesses. See references/attack_taxonomy.md for full detail.
| Attack | What it tests | Tooling |
|---|---|---|
| Latency | Timeouts, retries, circuit breakers | tc, Chaos Mesh NetworkChaos |
| Error | Error handling, fallback paths | Chaos Mesh HTTPChaos, Toxiproxy |
| Resource (CPU, memory, disk) | Saturation handling, autoscaling | Chaos Mesh StressChaos, stress-ng |
| Network partition | Split-brain, consensus, failover | Chaos Mesh NetworkChaos partition |
| Dependency failure | Graceful degradation, fallback | Service mesh fault injection |
| Time | Clock skew, NTP issues | libfaketime, Chaos Mesh TimeChaos |
| Infrastructure (kill instance) | Auto-recovery, failover | AWS FIS, Chaos Monkey |
Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.
| Tool | Best for | Pricing | Stack |
|---|---|---|---|
| Chaos Toolkit | Lightweight, language-agnostic, JSON experiments | OSS | Any |
| Chaos Mesh | Kubernetes-native, rich CRDs, in-cluster | OSS | Kubernetes |
| Litmus | Kubernetes, Argo-integrated, large library | OSS + Enterprise | Kubernetes |
| Gremlin | Enterprise SaaS, multi-cloud, audit | Paid | Any |
| AWS FIS | AWS-native, IAM-integrated, EC2/ECS/EKS | Paid (AWS) | AWS |
| Custom | Niche needs, single-cloud, low budget | None | Any |
Decision rules:
See references/tooling_landscape.md for trade-offs.
1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.
1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.
1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.
This skill explicitly composes with two others in this library:
| Skill | Composition |
|---|---|
feature-flags-architect |
Kill switches defined there are the abort triggers here |
kubernetes-operator |
Operators are common chaos targets (test reconcile under fault) |
incident-response |
Chaos experiments that escalate become incidents |
references/chaos_principles.md — the 4 principles, history, when to startreferences/experiment_design.md — hypothesis structure, steady-state metrics, abort criteriareferences/attack_taxonomy.md — 7 attack types with examples and toolingreferences/tooling_landscape.md — Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY/chaos-experiment — interactive experiment design wizard that runs all 3 tools.
assets/experiment_template.md — fill-in plan templateassets/postmortem_template.md — structured postmortem templateA team using this skill should achieve: