技能 编程开发 爬虫迁移:从 Puppeteer 到 Firecrawl

爬虫迁移:从 Puppeteer 到 Firecrawl

v20260423
firecrawl-migration-deep-dive
本指南提供了一个完整的框架,用于将传统的、依赖浏览器或CSS选择器的爬虫代码(如Puppeteer, Playwright)迁移到Firecrawl API。它展示了如何简化爬取流程,实现单页抓取、使用LLM进行结构化数据提取,以及复杂的全站爬取,从而彻底摆脱复杂的浏览器管理和反爬虫机制。
获取技能
155 次下载
概览

Firecrawl Migration Deep Dive

Current State

!npm list puppeteer playwright cheerio 2>/dev/null | grep -E "puppeteer|playwright|cheerio" || echo 'No scraping libs found'

Overview

Migrate from custom scraping (Puppeteer, Playwright, Cheerio) or competing APIs to Firecrawl. Firecrawl eliminates browser management, anti-bot handling, and JS rendering infrastructure. This skill shows equivalent code for common scraping patterns.

Migration Comparison

Feature Puppeteer/Playwright Cheerio Firecrawl
JS rendering Manual browser No Automatic
Anti-bot bypass DIY (stealth plugin) No Built-in
Output format Raw HTML Parsed HTML Markdown/JSON/HTML
Infrastructure Browser instances None API call
Concurrent scraping Manage browser pool Simple Managed by Firecrawl
Cost model Compute (CPU/RAM) Free Credits per page

Instructions

Step 1: Replace Puppeteer Single-Page Scrape

// BEFORE: Puppeteer (20+ lines, browser management)
import puppeteer from "puppeteer";

async function scrapePuppeteer(url: string) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle2" });
  const html = await page.content();
  const title = await page.title();
  await browser.close();
  return { html, title };
}

// AFTER: Firecrawl (5 lines, no browser needed)
import FirecrawlApp from "@mendable/firecrawl-js";

const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });

async function scrapeFirecrawl(url: string) {
  const result = await firecrawl.scrapeUrl(url, {
    formats: ["markdown"],
    onlyMainContent: true,
    waitFor: 2000,
  });
  return { markdown: result.markdown, title: result.metadata?.title };
}

Step 2: Replace Cheerio HTML Parsing

// BEFORE: fetch + cheerio (manual parsing)
import * as cheerio from "cheerio";

async function scrapeCheerio(url: string) {
  const html = await fetch(url).then(r => r.text());
  const $ = cheerio.load(html);
  return {
    title: $("h1").first().text(),
    content: $("main").text(),
    links: $("a").map((_, el) => $(el).attr("href")).get(),
  };
}

// AFTER: Firecrawl with extract (LLM-powered, no CSS selectors)
async function extractFirecrawl(url: string) {
  const result = await firecrawl.scrapeUrl(url, {
    formats: ["extract", "links"],
    extract: {
      schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          content: { type: "string" },
        },
      },
    },
  });
  return {
    title: result.extract?.title,
    content: result.extract?.content,
    links: result.links,
  };
}

Step 3: Replace Crawl Pipeline

// BEFORE: Playwright crawler (100+ lines, queue, browser pool)
// - launch browser pool
// - manage visited URLs set
// - extract links, enqueue
// - handle errors per page
// - close browsers on exit

// AFTER: Firecrawl crawl (10 lines)
async function crawlSite(baseUrl: string) {
  const result = await firecrawl.crawlUrl(baseUrl, {
    limit: 100,
    maxDepth: 3,
    includePaths: ["/docs/*", "/api/*"],
    excludePaths: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
    },
  });

  return result.data?.map(page => ({
    url: page.metadata?.sourceURL,
    title: page.metadata?.title,
    content: page.markdown,
  }));
}

Step 4: Gradual Migration with Adapter Pattern

// Adapter interface for gradual migration
interface ScrapeAdapter {
  scrape(url: string): Promise<{ title: string; content: string }>;
  crawl(url: string, maxPages: number): Promise<Array<{ url: string; content: string }>>;
}

class FirecrawlAdapter implements ScrapeAdapter {
  private client: FirecrawlApp;

  constructor() {
    this.client = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
  }

  async scrape(url: string) {
    const result = await this.client.scrapeUrl(url, {
      formats: ["markdown"],
      onlyMainContent: true,
    });
    return {
      title: result.metadata?.title || "",
      content: result.markdown || "",
    };
  }

  async crawl(url: string, maxPages: number) {
    const result = await this.client.crawlUrl(url, {
      limit: maxPages,
      scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
    });
    return (result.data || []).map(page => ({
      url: page.metadata?.sourceURL || url,
      content: page.markdown || "",
    }));
  }
}

// Feature flag controlled migration
function getScrapeAdapter(): ScrapeAdapter {
  if (process.env.USE_FIRECRAWL === "true") {
    return new FirecrawlAdapter();
  }
  return new LegacyPuppeteerAdapter();
}

Step 5: Remove Old Dependencies

set -euo pipefail
# After migration is complete and verified
npm uninstall puppeteer puppeteer-core
npm uninstall playwright @playwright/test
npm uninstall cheerio

# Remove browser downloads
npx playwright uninstall --all 2>/dev/null || true

# Verify no lingering references
grep -r "puppeteer\|playwright\|cheerio" src/ --include="*.ts" || echo "Clean!"

Migration Checklist

  • Install @mendable/firecrawl-js
  • Create adapter layer wrapping Firecrawl
  • Replace single-page scrapes with scrapeUrl
  • Replace crawl loops with crawlUrl
  • Replace HTML parsing with extract or markdown
  • Feature flag to switch between old and new
  • Run both in parallel, compare outputs
  • Remove old scraping dependencies
  • Delete browser management code

Error Handling

Issue Cause Solution
Different output format Puppeteer returns HTML, Firecrawl markdown Adjust downstream consumers
Missing CSS selector data Firecrawl doesn't use selectors Use extract with JSON schema
Higher latency for single pages API call vs local browser Acceptable trade-off for zero infra
Content differences Different JS wait timing Tune waitFor parameter

Resources

Next Steps

For advanced troubleshooting, see firecrawl-advanced-troubleshooting.

信息
Category 编程开发
Name firecrawl-migration-deep-dive
版本 v20260423
大小 6.92KB
更新时间 2026-04-28
语言