Parallel Root Cause Analysis Coordination

v20260407

parallel-investigation

A structured methodology for coordinating multiple, simultaneous investigation threads to determine the root cause of complex system failures, performance regressions, or integration bugs. This skill decomposes a large problem into independent tracks (e.g., database, application code, infrastructure), assigns ownership, and manages structured sync points (Continue/Pivot/Converge) to converge on a solution faster than serial debugging. Essential for high-stakes production incident response.

Debugging Root-Cause-Analysis Parallel Collaboration Problem-Solving Hypothesis-Testing

Get Skill

477 downloads

Overview

Parallel Investigation

Coordinate parallel investigation threads to explore multiple hypotheses simultaneously. Most effective for production incidents, performance regressions, or integration failures where the root cause is unclear.

Core Principle

When uncertain, explore multiple paths in parallel. Converge when evidence points to an answer.

Parallel investigation reduces time-to-solution by eliminating serial bottlenecks.

Investigation Structure

Phase 1: Problem Decomposition

Break the problem into independent investigation threads:

Problem: API responses are slow

Investigation Threads:
├── Thread A: Database performance
│   └── Check slow queries, indexes, connection pool
├── Thread B: Application code
│   └── Profile endpoint handlers, check for N+1
├── Thread C: Infrastructure
│   └── Check CPU, memory, network latency
└── Thread D: External services
    └── Check third-party API response times

Each thread should be independent (no blocking dependencies), focused (clear scope), and time-boxed.

Phase 2: Thread Assignment

Assign threads with clear ownership:

## Thread A: Database Performance
**Investigator:** [Name/Agent A]
**Duration:** 30 minutes
**Scope:**
- Query execution times
- Index utilization
- Connection pool metrics
**Report Format:** Summary + evidence

Phase 3: Parallel Execution

Each thread follows this pattern:

Gather evidence specific to thread scope
Document findings as you go
Identify if thread is a lead or dead end
Prepare summary for sync point

Thread Log Template:

## Thread: [Name]
**Start:** [Time]

### Findings
- [Timestamp] [Finding]

### Evidence
- [Log/Metric/Screenshot]

### Preliminary Conclusion
[What this thread suggests about the problem]

Phase 4: Sync Points

Regular convergence to share findings:

Sync Point Agenda:
1. Each thread report (2 min each)
2. Discussion & correlation (5 min)
3. Decision: Continue, Pivot, or Converge (3 min)

Sync Point Decisions:

Continue: Threads are progressing, maintain parallel execution
Pivot: Redirect threads based on new evidence
Converge: One thread found the answer, others join to validate

Phase 5: Convergence

When a thread identifies the likely cause:

Validate — Other threads verify the finding
Deep dive — Focused investigation on identified cause
Document — Compile findings from all threads

Coordination Patterns

Hub and Spoke: One coordinator assigns threads, tracks progress, calls sync points, and makes convergence decisions. Best when one person has the most context.

Peer Network: Equal investigators post findings to a shared channel and self-organize convergence when a pattern emerges. Best when investigators have similar expertise.

Communication Protocol

During Investigation

[Thread A] [Status] Starting query analysis
[Thread B] [Finding] No N+1 patterns in user endpoint
[Thread A] [Finding] Slow query: SELECT * FROM orders WHERE...
[Thread C] [Dead End] CPU and memory within normal
[Thread A] [Hot Lead] Missing index on orders.user_id

At Sync Point

## Thread A Summary

**Status:** Hot Lead
**Key Finding:** Missing index on orders.user_id
**Evidence:** Query taking 3.2s, explain shows full table scan
**Recommendation:** Likely root cause — suggest converge

Decision Framework

Thread Status	Action
All exploring	Continue parallel
One hot lead	Validate lead, others support
Multiple leads	Prioritize by evidence strength
All dead ends	Reframe problem, new threads
Confirmed cause	Converge, begin fix

Time Management

A typical two-hour investigation:

0:00  Problem decomposition & thread assignment
0:15  Parallel investigation begins
0:45  Sync point #1 → Continue/Pivot/Converge decision
1:30  Sync point #2 (if continuing)
1:35  Final convergence & documentation

Adjust sync point cadence based on incident severity — every 20 minutes for critical outages, every 45 minutes for lower-urgency investigations.

Documentation

Final Report Structure

# Investigation: [Problem]

## Summary
[Brief description and resolution]

## Threads Explored

### Thread A: [Area]
- Investigator: [Name]
- Findings: [Summary]
- Outcome: [Lead / Dead End / Root Cause]

## Root Cause
[Detailed explanation of what was found]

## Evidence
- [Evidence 1]
- [Evidence 2]

## Resolution
[What was done to fix]

## Lessons Learned
- [Learning 1]

Integration with Other Skills

debugging/root-cause-analysis: Each thread follows RCA principles
debugging/hypothesis-testing: Threads test specific hypotheses
handoff-protocols: When passing a thread to another person

Info

Category Development

Name parallel-investigation

Version v20260407

Size 5.72KB

Source rohitg00/skillkit

Updated At 2026-07-13