技能 数据科学 HasData网络数据爬取提取平台

HasData网络数据爬取提取平台

v20260606
hasdata
HasData是一个全面的云平台,用于提取结构化的公共网络数据。它提供三种核心模式:针对任意页面的网页抓取(支持JS渲染)、针对已知平台(如Google、Amazon)的预解析API,以及用于批量或递归爬取的异步爬虫任务。适用于市场调研、电商数据采集、职位信息获取和本地商家信息收集。
获取技能
152 次下载
概览

HasData

Cloud platform for extracting public web data. One API key, three execution modes. All endpoints sit under https://api.hasdata.com and authenticate with x-api-key.

curl -G 'https://api.hasdata.com/scrape/google/serp' \
  --data-urlencode 'q=coffee' \
  -H 'x-api-key: <your-api-key>'

401 invalid key, 403 quota exhausted, 429 concurrency cap, 500 server error (retry).

When to Use

Use this skill when:

  • The user needs web scraping.
  • The user needs search engine results.
  • The user needs structured data extraction.
  • The user needs ecommerce, travel, jobs, or local business data.
  • The user explicitly asks about HasData.

Three execution modes

Mode Latency When Endpoint
Web Scraping API seconds Arbitrary URL — JS rendering, CSS/AI extraction, screenshots POST /scrape/web
Scraper APIs (sync) seconds Pre-parsed JSON for known platforms (Google, Amazon, Zillow, …) GET /scrape/<vertical>/<resource>
Scraper Jobs (async) minutes–hours Bulk extraction, recursive crawling, webhook fan-out POST /scrapers/<slug>/jobs

Decision rule. Default to a Scraper API when one exists for the platform (pre-parsed JSON, no selector maintenance). Use Web Scraping for arbitrary URLs not covered by an API. Reach for a Scraper Job only when no API equivalent exists — crawler, contacts, sec-edgar, amazon-bestsellers, amazon-product-reviewsor when async fan-out + webhooks save engineering time over a paginated client loop.

Always-true response shape

{ "requestMetadata": { "id": "…", "status": "ok", "url": "…" }, "...": "endpoint-specific" }

Treat data as valid only if requestMetadata.status === "ok". HTTP 200 alone isn't enough.

High-leverage patterns

  • SERP-first enrichment. Google SERP can surface public snippets for company and professional-profile lookup. Use it for business or authorized research, avoid unnecessary direct scraping, and treat personal email/phone lookup as allowed only with a legitimate purpose and user authorization.
  • AI Mode + verify. /scrape/google/ai-mode for the answer + references → /scrape/web (markdown) on each reference URL → cited RAG context, no vector DB.
  • Maps → leads. /scrape/google-maps/search returns business websites and phones; collect contact details only from public, permitted sources and apply opt-out, rate, and privacy-law constraints before any outreach use.
  • Crawler → corpus. crawler Scraper Job with outputFormat: ["markdown"] + includePaths: "/docs/.+" produces an LLM-ready corpus in one submission.
  • Pre-extracted via SERP rich snippets. knowledgeGraph, localResults, inlineShoppingResults, relatedQuestions carry pre-parsed public facts. Always check them before considering direct page access.

When to call from code (the wiring)

  • Auth: x-api-key header on every request. Read from HASDATA_API_KEY env. Never hardcode, never log.
  • Timeouts: set client timeout ≥ 300 s. HasData's own deadline is 300 s; shorter clients produce phantom failures while still being billed on completion.
  • Retries: 429 and 5xx only — exponential backoff, jitter. Never retry 4xx (auth, validation).
  • Concurrency: cap at your plan limit. The free tier is 1; anything higher just generates 429s.
  • Async jobs: the submit response handle is body.id (integer), not jobId. Persist it immediately. Poll GET /scrapers/jobs/<id> every 10–30 s with backoff; treat webhooks as best-effort and always pair with polling. On finished the status carries data: {csv, json, xlsx} short-lived URLs — download immediately.

See references/code-recipes.md for ready-to-paste Python and TypeScript clients with retry, backoff, bounded concurrency, and the full job lifecycle.

Common gotchas

  • 300 s server deadline. Match client timeout.
  • Disable jsRendering first, enable only if the page needs it — most static pages parse fine without a headless browser.
  • No cookies parameter — cookies go through headers["Cookie"].
  • includePaths regex is case-sensitive. /blog/.+ won't match /Blog/....
  • Scraper Job data is double-wrapped. Each row is body.data[i].data; outer wraps with id, jobId, dataId, createdAt, updatedAt.
  • requestMetadata.status === "ok" is the only success signal. HTTP 200 alone isn't enough.
  • Webhooks are best-effort with 3 retries. Always have a polling fallback.

References

Resources

Limitations

  • Requires access to HasData services and valid credentials.
  • Data quality and available fields depend on the target website and extraction method used.
  • JavaScript-heavy websites may require rendering, which can affect performance and cost.
  • Use only for public data or content the user is authorized to access; respect site terms, robots/access controls, privacy law, and rate limits.
  • Rate limits, quotas, and account restrictions may apply depending on the endpoint and subscription plan.
信息
Category 数据科学
Name hasdata
版本 v20260606
大小 27.77KB
更新时间 2026-06-07
语言