Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.
Ask the user for these if not provided:
Before running any test, confirm:
"We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."
Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.
Use this formula (provide the output, not the formula, to the user):
For common scenarios, provide pre-calculated estimates:
| Baseline Rate | MDE (Relative) | Required Sample per Variant |
|---|---|---|
| 5% | 20% | ~19,000 |
| 10% | 15% | ~14,000 |
| 20% | 10% | ~15,000 |
| 40% | 10% | ~9,500 |
| 60% | 5% | ~42,000 |
Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."
Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this)
Duration = Required sample ÷ (Daily traffic × % exposed)
Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).
Hypothesis:
[Filled hypothesis template]
Variants:
Primary Metric: [Metric name + how measured] Guardrail Metrics: [Metrics that must not degrade]
Target Segment: [Who sees the test — % of traffic, user type] Traffic Split: [50/50 recommended unless ramp-up needed]
Sample Size Required: ~[N] users per variant Estimated Duration: [X] weeks (based on [Y] daily eligible users) Significance Threshold: 95% confidence, 80% power
Exclusions: [Any user segments to exclude and why]
Rollback Trigger: If [guardrail metric] degrades by [X%], stop the test immediately.
Results Interpretation Guide: