A/B Test Setup
You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
Initial Assessment
Check for product marketing context first:
If .agents/product-marketing-context.md exists (or .claude/product-marketing-context.md in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
Before designing a test, understand:
-
Test Context - What are you trying to improve? What change are you considering?
-
Current State - Baseline conversion rate? Current traffic volume?
-
Constraints - Technical complexity? Timeline? Tools available?
Core Principles
1. Start with a Hypothesis
- Not just "let's see what happens"
- Specific prediction of outcome
- Based on reasoning or data
2. Test One Thing
- Single variable per test
- Otherwise you don't know what worked
3. Statistical Rigor
- Pre-determine sample size
- Don't peek and stop early
- Commit to the methodology
4. Measure What Matters
- Primary metric tied to business value
- Secondary metrics for context
- Guardrail metrics to prevent harm
Hypothesis Framework
Structure
Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].
Example
Weak: "Changing the button color might increase clicks."
Strong: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
Test Types
| Type |
Description |
Traffic Needed |
| A/B |
Two versions, single change |
Moderate |
| A/B/n |
Multiple variants |
Higher |
| MVT |
Multiple changes in combinations |
Very high |
| Split URL |
Different URLs for variants |
Moderate |
Sample Size
Quick Reference
| Baseline |
10% Lift |
20% Lift |
50% Lift |
| 1% |
150k/variant |
39k/variant |
6k/variant |
| 3% |
47k/variant |
12k/variant |
2k/variant |
| 5% |
27k/variant |
7k/variant |
1.2k/variant |
| 10% |
12k/variant |
3k/variant |
550/variant |
Calculators:
For detailed sample size tables and duration calculations: See references/sample-size-guide.md
Metrics Selection
Primary Metric
- Single metric that matters most
- Directly tied to hypothesis
- What you'll use to call the test
Secondary Metrics
- Support primary metric interpretation
- Explain why/how the change worked
Guardrail Metrics
- Things that shouldn't get worse
- Stop test if significantly negative
Example: Pricing Page Test
-
Primary: Plan selection rate
-
Secondary: Time on page, plan distribution
-
Guardrail: Support tickets, refund rate
Designing Variants
What to Vary
| Category |
Examples |
| Headlines/Copy |
Message angle, value prop, specificity, tone |
| Visual Design |
Layout, color, images, hierarchy |
| CTA |
Button copy, size, placement, number |
| Content |
Information included, order, amount, social proof |
Best Practices
- Single, meaningful change
- Bold enough to make a difference
- True to the hypothesis
Traffic Allocation
| Approach |
Split |
When to Use |
| Standard |
50/50 |
Default for A/B |
| Conservative |
90/10, 80/20 |
Limit risk of bad variant |
| Ramping |
Start small, increase |
Technical risk mitigation |
Considerations:
- Consistency: Users see same variant on return
- Balanced exposure across time of day/week
Implementation
Client-Side
- JavaScript modifies page after load
- Quick to implement, can cause flicker
- Tools: PostHog, Optimizely, VWO
Server-Side
- Variant determined before render
- No flicker, requires dev work
- Tools: PostHog, LaunchDarkly, Split
Running the Test
Pre-Launch Checklist
During the Test
DO:
- Monitor for technical issues
- Check segment quality
- Document external factors
Avoid:
- Peek at results and stop early
- Make changes to variants
- Add traffic from new sources
The Peeking Problem
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
Analyzing Results
Statistical Significance
- 95% confidence = p-value < 0.05
- Means <5% chance result is random
- Not a guarantee—just a threshold
Analysis Checklist
-
Reach sample size? If not, result is preliminary
-
Statistically significant? Check confidence intervals
-
Effect size meaningful? Compare to MDE, project impact
-
Secondary metrics consistent? Support the primary?
-
Guardrail concerns? Anything get worse?
-
Segment differences? Mobile vs. desktop? New vs. returning?
Interpreting Results
| Result |
Conclusion |
| Significant winner |
Implement variant |
| Significant loser |
Keep control, learn why |
| No significant difference |
Need more traffic or bolder test |
| Mixed signals |
Dig deeper, maybe segment |
Documentation
Document every test with:
- Hypothesis
- Variants (with screenshots)
- Results (sample, metrics, significance)
- Decision and learnings
For templates: See references/test-templates.md
Common Mistakes
Test Design
- Testing too small a change (undetectable)
- Testing too many things (can't isolate)
- No clear hypothesis
Execution
- Stopping early
- Changing things mid-test
- Not checking implementation
Analysis
- Ignoring confidence intervals
- Cherry-picking segments
- Over-interpreting inconclusive results
Task-Specific Questions
- What's your current conversion rate?
- How much traffic does this page get?
- What change are you considering and why?
- What's the smallest improvement worth detecting?
- What tools do you have for testing?
- Have you tested this area before?
Related Skills
-
page-cro: For generating test ideas based on CRO principles
-
analytics-tracking: For setting up test measurement
-
copywriting: For creating variant copy