This skill provides Python tools for searching and retrieving preprints from arXiv.org via its public Atom API. It supports keyword search, author search, category filtering, arXiv ID lookup, and PDF download. Results are returned as structured JSON with titles, abstracts, authors, categories, and links.
Use this skill when:
2309.10668)cs.LG, cs.CL, stat.ML)Consider alternatives when:
Search for papers by keywords in titles, abstracts, or all fields.
python scripts/arxiv_search.py \
--keywords "sparse autoencoders" "mechanistic interpretability" \
--max-results 20 \
--output results.json
With category filter:
python scripts/arxiv_search.py \
--keywords "transformer" "attention mechanism" \
--category cs.LG \
--max-results 50 \
--output transformer_papers.json
Search specific fields:
# Title only
python scripts/arxiv_search.py \
--keywords "GRPO" \
--search-field ti \
--max-results 10
# Abstract only
python scripts/arxiv_search.py \
--keywords "reward model" "RLHF" \
--search-field abs \
--max-results 30
python scripts/arxiv_search.py \
--author "Anthropic" \
--max-results 50 \
--output anthropic_papers.json
python scripts/arxiv_search.py \
--author "Ilya Sutskever" \
--category cs.LG \
--max-results 20
Retrieve metadata for specific papers:
python scripts/arxiv_search.py \
--ids 2309.10668 2406.04093 2310.01405 \
--output sae_papers.json
Full arXiv URLs also accepted:
python scripts/arxiv_search.py \
--ids "https://arxiv.org/abs/2309.10668"
List recent papers in a category:
python scripts/arxiv_search.py \
--category cs.AI \
--max-results 100 \
--sort-by submittedDate \
--output recent_cs_ai.json
python scripts/arxiv_search.py \
--ids 2309.10668 \
--download-pdf papers/
Batch download from search results:
import json
from scripts.arxiv_search import ArxivSearcher
searcher = ArxivSearcher()
# Search first
results = searcher.search(query="ti:sparse autoencoder", max_results=5)
# Download all
for paper in results:
arxiv_id = paper["arxiv_id"]
searcher.download_pdf(arxiv_id, f"papers/{arxiv_id.replace('/', '_')}.pdf")
| Category | Description |
|---|---|
cs.AI |
Artificial Intelligence |
cs.CL |
Computation and Language (NLP) |
cs.CV |
Computer Vision |
cs.LG |
Machine Learning |
cs.NE |
Neural and Evolutionary Computing |
cs.RO |
Robotics |
cs.CR |
Cryptography and Security |
cs.DS |
Data Structures and Algorithms |
cs.IR |
Information Retrieval |
cs.SE |
Software Engineering |
| Category | Description |
|---|---|
stat.ML |
Machine Learning (Statistics) |
stat.ME |
Methodology |
math.OC |
Optimization and Control |
math.ST |
Statistics Theory |
| Category | Description |
|---|---|
q-bio.BM |
Biomolecules |
q-bio.GN |
Genomics |
q-bio.QM |
Quantitative Methods |
q-fin.ST |
Statistical Finance |
eess.SP |
Signal Processing |
physics.comp-ph |
Computational Physics |
Full list: see references/api_reference.md.
The arXiv API uses prefix-based field searches combined with Boolean operators.
Field prefixes:
ti: - Titleau: - Authorabs: - Abstractcat: - Categoryall: - All fields (default)co: - Commentjr: - Journal referenceid: - arXiv IDBoolean operators (must be UPPERCASE):
ti:transformer AND abs:attention
au:bengio OR au:lecun
cat:cs.LG ANDNOT cat:cs.CV
Grouping with parentheses:
(ti:sparse AND ti:autoencoder) AND cat:cs.LG
au:anthropic AND (abs:interpretability OR abs:alignment)
Examples:
from scripts.arxiv_search import ArxivSearcher
searcher = ArxivSearcher()
# Papers about SAEs in ML
results = searcher.search(
query="ti:sparse autoencoder AND cat:cs.LG",
max_results=50,
sort_by="submittedDate"
)
# Specific author in specific field
results = searcher.search(
query="au:neel nanda AND cat:cs.LG",
max_results=20
)
# Complex boolean query
results = searcher.search(
query="(abs:RLHF OR abs:reinforcement learning from human feedback) AND cat:cs.CL",
max_results=100
)
All searches return structured JSON:
{
"query": "ti:sparse autoencoder AND cat:cs.LG",
"result_count": 15,
"results": [
{
"arxiv_id": "2309.10668",
"title": "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning",
"authors": ["Trenton Bricken", "Adly Templeton", "..."],
"abstract": "Full abstract text...",
"categories": ["cs.LG", "cs.AI"],
"primary_category": "cs.LG",
"published": "2023-09-19T17:58:00Z",
"updated": "2023-10-04T14:22:00Z",
"doi": "10.48550/arXiv.2309.10668",
"pdf_url": "http://arxiv.org/pdf/2309.10668v1",
"abs_url": "http://arxiv.org/abs/2309.10668v1",
"comment": "42 pages, 30 figures",
"journal_ref": ""
}
]
}
from scripts.arxiv_search import ArxivSearcher
import json
searcher = ArxivSearcher()
# 1. Broad search
results = searcher.search(
query="abs:mechanistic interpretability AND cat:cs.LG",
max_results=200,
sort_by="submittedDate"
)
# 2. Save results
with open("interp_papers.json", "w") as f:
json.dump({"result_count": len(results), "results": results}, f, indent=2)
# 3. Filter and analyze
import pandas as pd
df = pd.DataFrame(results)
print(f"Total papers: {len(df)}")
print(f"Date range: {df['published'].min()} to {df['published'].max()}")
print(f"\nTop categories:")
print(df["primary_category"].value_counts().head(10))
searcher = ArxivSearcher()
groups = {
"anthropic": "au:anthropic AND (cat:cs.LG OR cat:cs.CL)",
"openai": "au:openai AND cat:cs.CL",
"deepmind": "au:deepmind AND cat:cs.LG",
}
for name, query in groups.items():
results = searcher.search(query=query, max_results=50, sort_by="submittedDate")
print(f"{name}: {len(results)} recent papers")
searcher = ArxivSearcher()
# Most recent ML papers
results = searcher.search(
query="cat:cs.LG",
max_results=50,
sort_by="submittedDate",
sort_order="descending"
)
for paper in results[:10]:
print(f"[{paper['published'][:10]}] {paper['title']}")
print(f" {paper['abs_url']}\n")
from scripts.arxiv_search import ArxivSearcher
searcher = ArxivSearcher(verbose=True)
# Free-form query (uses arXiv query syntax)
results = searcher.search(query="...", max_results=50)
# Lookup by ID
papers = searcher.get_by_ids(["2309.10668", "2406.04093"])
# Download PDF
searcher.download_pdf("2309.10668", "paper.pdf")
# Build query from components
query = ArxivSearcher.build_query(
title="sparse autoencoder",
author="anthropic",
category="cs.LG"
)
results = searcher.search(query=query, max_results=20)
cs.LG is where most ML papers live.sort_by=submittedDate for recent papers, relevance for keyword searches.start parameter.2309.10668), not full URLs, in programmatic code.