Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.
observability-designer
performance-profiler
incident-response
SLI ⟶ measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO ⟶ target for the SLI over a window (e.g., 99.9% over 30 days)
SLA ⟶ customer-facing commitment with consequences (separate concern)
EB ⟶ error budget: 100% − SLO target = how much "bad" you can spend
BR ⟶ burn rate: how fast you're consuming the error budget
The four cardinal mistakes:
The 3 tools below catch each of these.
SKILL=engineering/slo-architect/skills/slo-architect
# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30
# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
--target 99.9 --window-days 30
# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/
All stdlib-only.
slo_designer.pyGenerates a structured SLO definition with required fields. Refuses to render if any required field is missing (exit 1).
python scripts/slo_designer.py \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30 \
--owner team-checkout
SLI types supported:
request-success-rate — (total_requests - bad_requests) / total_requests
request-latency — count(requests < threshold) / total_requests
availability-time — (window - downtime) / window
data-freshness — count(data_age < threshold) / total_data_points
correctness — count(correct_outputs) / total_outputs
Output is markdown by default with all required fields filled or marked <must define>. JSON output (--format json) is consumed by slo_review.py.
error_budget_calculator.pyGiven target availability + window, computes:
python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json
slo_review.pyAudits a directory of SLO definitions (markdown or JSON) for the common bugs.
python scripts/slo_review.py --slo-doc docs/slos/
Checks:
target_too_high: target ≥ 99.99% (sustainable only with massive engineering investment)target_too_low: target ≤ 99.0% (probably wrong SLI; users will notice)window_too_short: window < 7 days (statistical noise dominates)window_too_long: window > 90 days (slow feedback)no_sli_definition: SLI section missing or vague ("everything OK")no_error_budget_policy: no documented action when budget burnscpu_as_sli: CPU/memory used as user-experience proxy (wrong signal)| User experience | SLI type | What you measure |
|---|---|---|
| "Did the request succeed?" | request-success-rate | 2xx / total |
| "Was the response fast?" | request-latency | count(p99 < threshold) / total |
| "Was the service up?" | availability-time | (window - downtime) / window |
| "Is the data current?" | data-freshness | count(data_age < threshold) / total |
| "Was the answer correct?" | correctness | count(correct) / total |
See references/sli_design.md for examples and anti-patterns.
For 99.9% SLO over 30 days:
0.1% × 30 × 24 × 60 = 43.2 minutes
2% × 43.2 / 60 ≈ 1.44 ratio multiplier
10% × 43.2 / 360 ≈ 0.6 ratio multiplier
error_budget_calculator.py does this math for you and emits ready-to-paste alert rules.
This skill explicitly composes with three others:
| Skill | Composition |
|---|---|
feature-flags-architect |
Rollout abort criteria reference SLO burn-rate thresholds |
chaos-engineering |
Blast-radius calculator already takes monthly error budget as input — define it here |
kubernetes-operator |
Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |
The error_budget_calculator.py output is in the same shape chaos-engineering/scripts/blast_radius_calculator.py expects on stdin.
1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
target = floor(p50 of last 30 days × 100) / 100
This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".
1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
- Was the SLO too easy (never burned budget)? Tighten target.
- Was it too hard (frequently burned)? Loosen target OR fix the system.
- Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.
1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.
references/slo_principles.md — SLI vs SLO vs SLA, Google SRE Workbook canonreferences/sli_design.md — picking the right SLI; 5 types with examplesreferences/error_budget.md — error budget math, burn-rate alerts, budget policyreferences/composition.md — how SLOs feed feature flags, chaos, operators/slo-design — interactive SLO design wizard that runs all 3 tools.
assets/slo_template.yaml — fillable SLO YAMLassets/error_budget_policy.md — fillable policy templateA team using this skill should achieve:
slo_review.py with 0 FAIL findings