Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.
You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.
This runs fully autonomously. Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.
Users arrive in different states. Determine which and proceed:
| User State | What to Do |
|---|---|
| Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap |
| Clear research question | Bootstrap directly |
| Existing plan or proposal | Review plan, set up workspace, enter loops |
| Resuming (research-state.yaml exists) | Read state, continue from where you left off |
If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.
Step 0 — before anything else: Set up the agent continuity loop. See Agent Continuity. This is MANDATORY. Without it, the research stops after one cycle.
Create this structure at the project root:
{project}/
├── research-state.yaml # Central state tracking
├── research-log.md # Decision timeline
├── findings.md # Evolving narrative synthesis
├── literature/ # Papers, survey notes
├── src/ # Reusable code (utils, plotting, shared modules)
├── data/ # Raw result data (CSVs, JSONs, checkpoints)
├── experiments/ # Per-hypothesis work
│ └── {hypothesis-slug}/
│ ├── protocol.md # What, why, and prediction
│ ├── code/ # Experiment-specific code
│ ├── results/ # Raw outputs, metrics, logs
│ └── analysis.md # What we learned
├── to_human/ # Progress presentations and reports for human review
└── paper/ # Final paper (via ml-paper-writing)
src/: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.data/: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., trajectory_H1_runs001-010.csv). Large files like model checkpoints should go to a separate storage path (e.g., /data/, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.Initialize research-state.yaml, research-log.md, and findings.md from templates/. Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.
This is the core engine. Everything else supports it.
BOOTSTRAP (once, lightweight)
Scope question → search literature → form initial hypotheses
INNER LOOP (fast, autonomous, repeating)
Pick hypothesis → experiment → measure → record → learn → next
Goal: run constrained experiments with clear measurable outcomes
OUTER LOOP (periodic, reflective)
Review results → find patterns → update findings.md →
new hypotheses → decide direction
Goal: synthesize understanding, find the story — this is where novelty comes from
FINALIZE (when concluding)
Write paper via ml-paper-writing → final presentation → archive
The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results mean? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.
There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.
The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:
literature/
21-research-ideation/ skills when you're stuck or when results open unexpected questionsThis is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.
Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.
Search literature for the research question. Use multiple sources — never stop at one:
web_search_exa) if available — best for broad discovery and finding relevant papers quicklypip install semanticscholar) — best for ML/AI papers, citation graphs, and specific paper lookup. See 20-ml-paper-writing skill's references/citation-workflow.md for complete API code examplespip install arxiv) — best for recent preprints and open-access papersSave everything to literature/: For every paper you find, save a summary to literature/ — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running literature/survey.md with all summaries. This is your reference library — you and future sessions will need it throughout the project.
Identify gaps from the literature
Form initial hypotheses — invoke 21-research-ideation/ skills
brainstorming-research-ideas for structured diverge-converge workflowcreative-thinking-for-research for deeper cognitive frameworksDefine the evaluation
Record in research-state.yaml, log the bootstrap in research-log.md
Rapid iteration with clear measurable outcomes. Two flavors:
1. Pick the highest-priority untested hypothesis
2. Write a protocol: what change, what prediction, why
Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
This creates temporal proof your plan existed before results
3. Run the experiment (invoke the relevant domain skill)
4. Sanity check before trusting results:
- Did training converge? No NaN/Inf?
- Does baseline reproduce expected performance?
- Data loading correct? (spot-check a few samples)
5. Measure the proxy metric
6. Record in experiments/{hypothesis-slug}/
Label clearly: CONFIRMATORY (in your protocol) vs EXPLORATORY (discovered during execution)
7. If positive: keep, note WHY it worked
8. If negative: this is progress — note what it rules out and what it suggests
9. Update research-state.yaml
10. If stuck: search literature or invoke ideation skills — don't just keep trying random things
Never stop. Even if something fails, find a path forward. Debug, adjust, simplify, or pivot — but keep the research moving. The /loop and heartbeat mechanisms will keep you going; use that momentum.
When you need domain-specific execution, search the skills library:
| Research Activity | Look In |
|---|---|
| Data preparation | 05-data-processing/ |
| Model training / fine-tuning | 01-model-architecture/, 03-fine-tuning/, 06-post-training/ |
| Distributed training | 08-distributed-training/ |
| Optimization (quantization, attention) | 10-optimization/ |
| Evaluation / benchmarks | 11-evaluation/ |
| Inference / serving | 12-inference-serving/ |
| Interpretability analysis | 04-mechanistic-interpretability/ |
| Experiment tracking (W&B, MLflow) | 13-mlops/ |
| Cloud compute | 09-infrastructure/ |
Read the relevant SKILL.md before starting — it has workflows, common issues, and code examples. See references/skill-routing.md for a complete guide.
Maintain a running record of measurable outcomes across experiments:
{
"experiment_id": "run_014",
"hypothesis": "H3",
"metric_value": 0.847,
"baseline": 0.812,
"delta": "+0.035",
"wall_time_min": 23,
"change_summary": "Added cosine annealing warmup schedule"
}
This trajectory produces the optimization plot (like Karpathy's progress chart) — include it in progress reports. Humans love seeing the upward curve.
Step back from individual experiments. Synthesize.
1. Review all results since last reflection
2. Cluster by type: what kinds of changes worked? Which didn't?
3. Ask WHY — identify the mechanism behind successes and failures
4. Update findings.md with current understanding
5. Search literature if results were surprising or assumptions need revisiting
6. Generate new hypotheses if warranted (invoke 21-research-ideation/ skills)
7. Decide direction (see criteria below)
8. Update research-state.yaml with new direction
9. Log the reflection in research-log.md
10. If there's something meaningful, generate a progress presentation
Don't just pick randomly — use these criteria:
DEEPEN — a supported result raises follow-up questions
BROADEN — current results are solid, but adjacent questions are untested
PIVOT — results invalidate key assumptions or something more interesting appeared
CONCLUDE — sufficient evidence for a contribution
Note: coherent negative results are a valid contribution. "X does NOT work because Y" is publishable if the reasoning is rigorous.
This file serves two purposes: it's the research narrative for humans AND your accumulated knowledge base as an agent. Read it at the start of every session, /loop tick, or heartbeat to remember what you've learned.
After every outer loop, update it to answer:
The "Lessons and Constraints" section is especially important — it captures specific actionable learnings like "weight decay > 0.1 diverges at this scale" or "baseline only reproduces with batch_size=64." This prevents the agent from repeating failed approaches across sessions.
Quality test: After 30 inner loop experiments, a human should be able to read findings.md and write a paper abstract from it. If they can't, the outer loop isn't synthesizing — it's just logging.
Before doing anything else, set up the wall-clock loop. This is what keeps the research running continuously. Without it, the agent stops after one cycle.
First action — run this immediately:
/loop 10m Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a report in to_human/ and opening the HTML/PDF. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper.
This fires every 10 minutes regardless of what's happening. It's a rhythm that keeps you working — not a research phase boundary. If your previous work isn't done, just continue it.
First action — set up a 10-minute cron job. You MUST do this before any research work. If you don't know how to set up a cron job in OpenClaw, check your OpenClaw documentation or system instructions first — figure it out, then set it up. Do not skip this step.
The cron prompt: Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a PDF report in to_human/ and sending it to the user via Telegram or WhatsApp. When you get an exciting result or interesting plot, text it to the user right away — don't wait for a full report. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper.
The /loop and cron job are purely wall-clock rhythm. They are completely separate from your research loops (inner/outer). On each tick:
research-state.yaml and findings.md — remember where you areWhen you have something meaningful to share, create a research presentation — not just a status dashboard, but a compelling story.
When to report (your judgment):
What to include (adapt to what's compelling):
For Claude Code: generate HTML and open it. If HTML fails to open or render, convert to PDF as fallback (use weasyprint, playwright pdf, or wkhtmltopdf). For OpenClaw: generate PDF directly.
See references/progress-reporting.md for template scaffolding and the optimization plot approach. Use the template as a starting point — be creative with what you show.
Commit at natural research milestones:
| When | Message Pattern |
|---|---|
| Workspace initialized | research(init): {project} — {question} |
| Experiment protocol locked | research(protocol): {hypothesis} |
| Significant results | research(results): {hypothesis} — {outcome} |
| Outer loop direction change | research(reflect): {direction} — {reason} |
| Paper draft complete | research(paper): {title} |
Hard rule: Protocol commits MUST precede result commits. Never combine them. The git history is your lightweight pre-registration — it proves what you planned before you saw results. Don't commit after every experiment — commit when there's meaningful progress.
When the outer loop decides to CONCLUDE:
20-ml-paper-writing skill — it has LaTeX templates for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, and systems venuesProceed autonomously through the writing process. If the ml-paper-writing skill suggests human collaboration points, adapt and keep going — produce the best draft you can. The human will review and provide feedback.
Principles to enforce continuously — not tied to any specific phase:
Good agent behavior:
Bad agent behavior:
Use autoresearch when:
Use individual domain skills instead when:
Inner loop stalls (no metric improvement) Run an outer loop. Is the metric the right one? Is the search space exhausted? Consider broadening or pivoting. Search literature for new approaches.
Stuck and not making progress
Don't keep trying random changes. Step back: search literature for related work, invoke 21-research-ideation/ brainstorming skills, or run an outer loop reflection. Being stuck means you need new information or a new perspective, not more experiments.
Results contradict baseline expectations Investigate, don't ignore. Return to literature — your protocol might have an error, the published baseline may be wrong, or conditions differ. Update findings.md with what you learn.
Agent loses context between ticks Ensure research-state.yaml and findings.md are updated after every action. These files are your memory across sessions.
Can't find relevant papers
Try multiple approaches in order: Exa MCP for broad search, Semantic Scholar for specific ML/AI paper lookup (pip install semanticscholar), arXiv for preprints (pip install arxiv). Check 20-ml-paper-writing skill's references/citation-workflow.md for complete API code. Note: Google Scholar has no official API — use Semantic Scholar instead for programmatic search.
No GPU available Use CPU and scale experiments down. Many research tasks (analysis, interpretability, small model training) run fine on CPU. Adjust experiment design to fit available compute rather than blocking.
Experiments take longer than /loop interval Normal. On the next tick, check if it finished. If not, keep waiting or do something else useful (update notes, search papers). Adjust interval if needed.
Not sure when to conclude Three questions: Do you have a strongly supported finding? Can you explain WHY it works? Would findings.md make a convincing paper abstract? If yes to all: conclude.