技能 编程开发 LangChain生产事故处理手册

LangChain生产事故处理手册

v20260423
langchain-incident-runbook
本手册提供了LangChain及大型语言模型(LLM)生产环境的标准化操作流程(SOP)。它详细指导运维工程师如何处理常见的突发故障,包括服务商中断、错误率飙升、延迟增加和成本超支等问题,并提供了相应的检测、诊断和缓解措施,是维护LLM应用的关键运维工具。
获取技能
185 次下载
概览

LangChain Incident Runbook

Overview

Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.

Severity Classification

Level Description Response Time Example
SEV1 Complete outage 15 min All LLM calls failing
SEV2 Major degradation 30 min >50% error rate, >10s latency
SEV3 Minor degradation 2 hours <10% errors, slow responses
SEV4 Low impact 24 hours Intermittent issues, warnings

Runbook 1: LLM Provider Outage

Detect

# Check provider status pages
curl -s https://status.openai.com/api/v2/status.json | jq '.status'
curl -s https://status.anthropic.com/api/v2/status.json | jq '.status'

Diagnose

async function diagnoseProviders() {
  const results: Record<string, string> = {};

  try {
    const openai = new ChatOpenAI({ model: "gpt-4o-mini", timeout: 10000 });
    await openai.invoke("ping");
    results.openai = "OK";
  } catch (e: any) {
    results.openai = `FAIL: ${e.message.slice(0, 100)}`;
  }

  try {
    const anthropic = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });
    await anthropic.invoke("ping");
    results.anthropic = "OK";
  } catch (e: any) {
    results.anthropic = `FAIL: ${e.message.slice(0, 100)}`;
  }

  console.table(results);
  return results;
}

Mitigate

// Enable fallback — switch to healthy provider
const primary = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 1,
  timeout: 5000,
});

const fallback = new ChatAnthropic({
  model: "claude-sonnet-4-20250514",
  maxRetries: 1,
});

const resilientModel = primary.withFallbacks({
  fallbacks: [fallback],
});

// All chains using resilientModel auto-failover

Recover

  1. Monitor provider status page for resolution
  2. Verify primary provider works: await diagnoseProviders()
  3. Remove fallback config (or keep it for resilience)
  4. Document incident timeline for post-mortem

Runbook 2: High Error Rate

Detect

# Check LangSmith for error spike
# https://smith.langchain.com/o/YOUR_ORG/projects/YOUR_PROJECT/runs?filter=error:true

# Check application logs
grep -c "Error\|error\|ERROR" /var/log/app/langchain.log | tail -5

Diagnose

// Common error patterns
const ERROR_CAUSES: Record<string, string> = {
  "RateLimitError":     "API quota exceeded -> reduce concurrency",
  "AuthenticationError": "API key invalid -> check secrets",
  "Timeout":            "Provider slow -> increase timeout",
  "OutputParserException": "LLM output format changed -> check prompts",
  "ValidationError":    "Schema mismatch -> update Zod schemas",
  "ContextLengthExceeded": "Input too long -> truncate or chunk",
};

Mitigate

// 1. Reduce load
// Lower maxConcurrency on batch operations

// 2. Enable caching for repeated queries
const cache = new Map();
async function withCache(chain: any, input: any) {
  const key = JSON.stringify(input);
  if (cache.has(key)) return cache.get(key);
  const result = await chain.invoke(input);
  cache.set(key, result);
  return result;
}

// 3. Enable fallback model
const model = primary.withFallbacks({ fallbacks: [fallback] });

Runbook 3: Latency Spike

Detect

# Prometheus query
histogram_quantile(0.95, rate(langchain_llm_latency_seconds_bucket[5m])) > 5

Diagnose

// Measure per-component latency
const tracer = new MetricsCallback();
await chain.invoke({ input: "test" }, { callbacks: [tracer] });
console.table(tracer.getReport());
// Check: is it the LLM, retriever, or tool that's slow?

Mitigate

  1. Switch to faster model: gpt-4o-mini (200ms TTFT) vs gpt-4o (400ms)
  2. Enable streaming to reduce perceived latency
  3. Enable caching for repeated queries
  4. Reduce context length (shorter prompts)

Runbook 4: Cost Overrun

Detect

# Check OpenAI usage dashboard
# https://platform.openai.com/usage

Mitigate

// 1. Emergency model downgrade
// gpt-4o ($2.50/1M) -> gpt-4o-mini ($0.15/1M) = 17x cheaper

// 2. Enable budget enforcement
const budget = new BudgetEnforcer(50.0); // $50 daily limit
const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  callbacks: [budget],
});

// 3. Enable aggressive caching
// (see langchain-cost-tuning skill)

Runbook 5: Memory/OOM Issues

Detect

# Check process memory
ps aux --sort=-%mem | head -5

# Node.js heap stats
node -e "console.log(process.memoryUsage())"

Mitigate

  1. Clear caches: reset in-memory caches
  2. Reduce batch sizes: lower maxConcurrency
  3. Use streaming instead of accumulating full responses
  4. Restart pods: kubectl rollout restart deployment/langchain-api

Incident Response Checklist

During Incident

  • Acknowledge in incident channel
  • Classify severity (SEV1-4)
  • Check provider status pages
  • Run diagnostic script
  • Apply mitigation (fallback/cache/throttle)
  • Communicate status to stakeholders
  • Document timeline

Post-Incident

  • Verify full recovery
  • Schedule post-mortem (within 48h)
  • Write incident report
  • Create follow-up tickets
  • Update monitoring/alerting rules
  • Update this runbook if needed

Resources

Next Steps

Use langchain-debug-bundle for detailed evidence collection during incidents.

信息
Category 编程开发
Name langchain-incident-runbook
版本 v20260423
大小 6.1KB
更新时间 2026-04-28
语言