Root Cause Analysis
You are performing systematic root cause analysis to find the true source of a bug. Do not apply fixes until you understand WHY the bug exists.
Core Principle
Never fix a symptom. Always find and fix the root cause.
The Five Whys Method
Ask "Why?" repeatedly to drill down to the root cause:
-
Why did the API return an error? → The database query failed
-
Why did the database query fail? → The connection pool was exhausted
-
Why was the pool exhausted? → ROOT CAUSE: Missing
finally block to close connections
Investigation Phases
Phase 1: Reproduce the Bug
Before investigating:
-
Reproduce consistently - If you can't reproduce it, you can't verify a fix
-
Document reproduction steps - Exact sequence of actions
-
Note environment details - OS, versions, configuration
-
Identify minimal reproduction - Smallest case that shows the bug
Questions to answer:
- Does it happen every time or intermittently?
- Does it happen in all environments?
- When did it start happening? (recent changes)
Phase 2: Gather Evidence
Collect information before forming theories:
- Error messages and stack traces
- Log files (application, system, database)
- Recent code changes (git log, blame)
- User reports and reproduction steps
- Monitoring data (metrics, APM)
- Related issues (search issue tracker)
Do NOT:
- Make changes while gathering evidence
- Assume you know the cause without evidence
- Ignore related symptoms
Phase 3: Form Hypotheses
Based on evidence, create ranked hypotheses:
| Priority |
Hypothesis |
Evidence |
Test Plan |
| 1 |
Connection leak in UserService |
Stack trace shows connection pool |
Add logging, check usage |
| 2 |
Query timeout too short |
Occurs under load |
Test with longer timeout |
| 3 |
Database server overload |
Correlates with peak hours |
Check DB metrics |
For each hypothesis:
- What evidence supports it?
- What evidence contradicts it?
- How can we test it?
Phase 4: Test Hypotheses
Test each hypothesis systematically:
-
Start with highest probability
-
Design a definitive test - Should clearly confirm or reject
-
Make ONE change at a time
-
Document results
If hypothesis is rejected:
- Cross it off the list
- Re-evaluate remaining hypotheses
- Consider if new evidence suggests new hypotheses
Phase 5: Verify Root Cause
Before declaring root cause found:
Common Root Cause Categories
-
Code Defects: logic errors, boundary conditions, race conditions, resource leaks, null/undefined handling
-
Design Issues: missing error handling, inadequate validation, poor state management, coupling
-
Environment: configuration errors, resource constraints, version mismatches, network issues
-
Data Issues: invalid input, data corruption, schema mismatches, encoding problems
Evidence Collection Commands
# Recent changes to relevant files
git log --oneline -20 -- path/to/file
# Who changed this line
git blame path/to/file
# Changes since last working version
git diff v1.2.3..HEAD -- src/
# Search for related error handling
grep -r "catch\|error\|throw" --include="*.ts" src/
Red Flags - You Haven't Found Root Cause
- "I'm not sure why, but this fix works"
- "The bug went away after I restarted"
- "I added a check to prevent this case"
- "It's probably a race condition somewhere"
These suggest symptom treatment, not root cause resolution.
Documentation Template
When root cause is found, document:
## Bug: [Description]
### Root Cause
[Clear explanation of why the bug occurred]
### Evidence
- [Evidence 1]
- [Evidence 2]
### Causal Chain
1. [Initial trigger]
2. [Intermediate cause]
3. [Root cause]
4. [Observed symptom]
### Fix
[Description of the fix and why it addresses root cause]
### Prevention
[How to prevent similar issues in the future]
Integration with Other Skills
After finding root cause:
- Use testing/red-green-refactor to write a test that exposes the bug
- Use planning/verification-gates to validate the fix
- Consider collaboration/structured-review for complex fixes