技能 编程开发 Firecrawl 策略守护

Firecrawl 策略守护

v20260311
firecrawl-policy-guardrails
为 Firecrawl 爬虫项目提供策略 lint、域名过滤、信用预算、内容校验与限速等守护,确保 CI 与流水线遵守法律与成本约束。
获取技能
475 次下载
概览

Firecrawl Policy Guardrails

Overview

Policy enforcement for Firecrawl web scraping pipelines. Web scraping raises legal (robots.txt, ToS), ethical (rate limiting, attribution), and cost (credit burn) concerns that need automated guardrails.

Prerequisites

  • Firecrawl API configured
  • Understanding of web scraping legal considerations
  • Credit monitoring setup

Instructions

Step 1: Enforce Domain-Level Scraping Policies

Block scraping of sensitive or prohibited domains.

const SCRAPE_POLICY = {
  blockedDomains: [
    'facebook.com', 'linkedin.com',   // ToS prohibit scraping
    'bank*.com', 'healthcare*.com',   // sensitive data
  ],
  maxPagesPerDomain: 500,  # HTTP 500 Internal Server Error
  requireRobotsTxt: true,
};

function validateScrapeTarget(url: string): void {
  const domain = new URL(url).hostname;
  for (const blocked of SCRAPE_POLICY.blockedDomains) {
    const pattern = new RegExp('^' + blocked.replace('*', '.*') + '$');
    if (pattern.test(domain)) {
      throw new PolicyViolation(`Domain ${domain} is blocked by scraping policy`);
    }
  }
}

Step 2: Credit Budget Enforcement

Prevent crawls from exceeding allocated credit budgets.

class CrawlBudget {
  private dailyLimit: number;
  private usage: Map<string, number> = new Map();

  constructor(dailyLimit = 5000) { this.dailyLimit = dailyLimit; }  # 5000: 5 seconds in ms

  authorize(estimatedPages: number): boolean {
    const today = new Date().toISOString().split('T')[0];
    const used = this.usage.get(today) || 0;
    if (used + estimatedPages > this.dailyLimit) {
      throw new PolicyViolation(
        `Daily limit exceeded: ${used} + ${estimatedPages} > ${this.dailyLimit}`
      );
    }
    return true;
  }

  record(pagesScraped: number) {
    const today = new Date().toISOString().split('T')[0];
    this.usage.set(today, (this.usage.get(today) || 0) + pagesScraped);
  }
}

Step 3: Content Type Filtering

Only retain scraped content that matches expected types; discard binary files, media, and error pages.

function validateScrapedContent(result: any): boolean {
  if (!result.markdown || result.markdown.length < 50) return false;
  const lower = result.markdown.toLowerCase();
  // Reject error pages
  if (lower.includes('403 forbidden') || lower.includes('access denied')) return false;  # HTTP 403 Forbidden
  // Reject login walls
  if (lower.includes('sign in to continue') || lower.includes('create an account')) return false;
  return true;
}

Step 4: Rate Limiting Per Target Domain

Respect target site capacity even when Firecrawl allows faster crawling.

const DOMAIN_RATE_LIMITS: Record<string, number> = {
  'docs.example.com': 2,    // 2 pages/second
  'blog.example.com': 1,    // 1 page/second
  'default': 5              // default rate
};

function getCrawlDelay(domain: string): number {
  const rate = DOMAIN_RATE_LIMITS[domain] || DOMAIN_RATE_LIMITS['default'];
  return 1000 / rate;  // milliseconds between requests  # 1000: 1 second in ms
}

Error Handling

Issue Cause Solution
Legal risk from scraping Blocked domain not filtered Enforce domain blocklist
Credit overrun No budget tracking Implement daily credit caps
Junk data in pipeline Error pages scraped Validate content quality
Target site blocking IP Too aggressive crawling Enforce per-domain rate limits

Examples

Policy-Checked Crawl

validateScrapeTarget(url);
budget.authorize(estimatedPages);
const results = await firecrawl.crawlUrl(url, { limit: estimatedPages });
const valid = results.filter(validateScrapedContent);
budget.record(valid.length);

Resources

Output

  • Configuration files or code changes applied to the project
  • Validation report confirming correct implementation
  • Summary of changes made and their rationale
信息
Category 编程开发
Name firecrawl-policy-guardrails
版本 v20260311
大小 4.57KB
更新时间 2026-03-12
语言