Design complete, robust data-extraction pipelines with intelligent routing, validation, and token-budget tracking — not brittle one-off scripts.
Dependency Notice: BYOK (Bring Your Own Key) pattern for Firecrawl; API keys must only be loaded via environment variables. Per-script dependencies:
| Script | Dependencies | Exact CLI |
|---|---|---|
scripts/validate_extraction.py |
stdlib only | python3 scripts/validate_extraction.py output.json --json |
scripts/firecrawl_example.py |
firecrawl, requests (template; --sample runs offline) |
python3 scripts/firecrawl_example.py --sample |
scripts/local_bs4_example.py |
beautifulsoup4, pandas (template; --sample runs offline) |
python3 scripts/local_bs4_example.py --sample |
Check for context first:
If project-context.md exists, read it before asking questions. Determine the target data format, scale of extraction, and deployment environment before writing any code.
This skill supports 3 extraction modes based on intelligent routing:
Use when the source is a public URL, heavily dynamic (JS/SPA), requires search-first discovery, or involves bulk crawling across a domain.
Use when extracting from local files (PDF, Excel, CSV), the data is private/sensitive, or the target is a simple static HTML page where Firecrawl is overkill.
Use when Firecrawl handles URL discovery/web extraction, but local Python (Pandas) is required to clean, normalize, and structure the output before saving.
When executing a scraping task, always follow this sequence:
scripts/firecrawl_example.py (Mode 1) or scripts/local_bs4_example.py (Mode 2); run each with --sample first to see the expected summary shape without network access.python3 scripts/validate_extraction.py extracted_output.json --json on every extraction result before delivering it. It exits 0 only on {"status": "ok"}; warning (empty output) or error (malformed JSON) exit 1 — fix and re-extract, never ship unvalidated data. Beyond this structural gate, also check required fields and duplicates against the pipeline spec before delivering.Surface these issues WITHOUT being asked when you notice them in context:
os.getenv('FIRECRAWL_API_KEY').| When you ask for... | You get... |
|---|---|
| "Scrape this site" | A fully validated Python extraction script with routing logic and error handling. |
| "Get data from this table" | A clean CSV/JSON dataset with a summary log of row counts and empty values. |
| "Crawl these docs" | A Markdown deliverable chunked for LLM token limits. |
div > span > ul > li:nth-child(3)). Use data attributes or robust structural anchors.robots.txt or implementing sensible rate limits.