技能 编程开发 AI API速率限制与重试处理

AI API速率限制与重试处理

v20260423
together-rate-limits
本指南详细介绍了如何处理与AI服务(如Together AI)对接时的API速率限制问题。它展示了基于令牌桶的客户端限流实现,并提供了一套完整的重试策略(包括指数退避),确保即使在遇到速率限制(429)或服务器过载(5xx)等错误时,高吞吐量的批处理任务也能稳定可靠地执行。
获取技能
246 次下载
概览

Together AI Rate Limits

Overview

Together AI's OpenAI-compatible inference API enforces per-key rate limits that vary by model tier and operation type. Chat completions and embeddings share a global request quota, while fine-tuning jobs and batch inference have separate concurrency caps. High-throughput workloads like embedding entire document corpora or running evaluations across 100+ prompts require client-side token bucket limiting. Together's batch inference endpoint offers 50% cost savings but has its own queue depth limits that differ from real-time inference.

Rate Limit Reference

Endpoint Limit Window Scope
Chat completions 600 req 1 minute Per API key
Embeddings 300 req 1 minute Per API key
Image generation (FLUX) 60 req 1 minute Per API key
Fine-tune jobs (concurrent) 3 jobs Rolling Per API key
Batch inference 100 req/batch, 10 batches Rolling Per API key

Rate Limiter Implementation

class TogetherRateLimiter {
  private tokens: number;
  private lastRefill: number;
  private readonly max: number;
  private readonly refillRate: number;
  private queue: Array<{ resolve: () => void }> = [];

  constructor(maxPerMinute: number) {
    this.max = maxPerMinute;
    this.tokens = maxPerMinute;
    this.lastRefill = Date.now();
    this.refillRate = maxPerMinute / 60_000;
  }

  async acquire(): Promise<void> {
    this.refill();
    if (this.tokens >= 1) { this.tokens -= 1; return; }
    return new Promise(resolve => this.queue.push({ resolve }));
  }

  private refill() {
    const now = Date.now();
    this.tokens = Math.min(this.max, this.tokens + (now - this.lastRefill) * this.refillRate);
    this.lastRefill = now;
    while (this.tokens >= 1 && this.queue.length) {
      this.tokens -= 1;
      this.queue.shift()!.resolve();
    }
  }
}

const chatLimiter = new TogetherRateLimiter(500);  // buffer under 600
const embedLimiter = new TogetherRateLimiter(250);

Retry Strategy

async function togetherRetry<T>(
  limiter: TogetherRateLimiter, fn: () => Promise<Response>, maxRetries = 4
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    await limiter.acquire();
    const res = await fn();
    if (res.ok) return res.json();
    if (res.status === 429) {
      const retryAfter = parseInt(res.headers.get("Retry-After") || "5", 10);
      const jitter = Math.random() * 2000;
      await new Promise(r => setTimeout(r, retryAfter * 1000 + jitter));
      continue;
    }
    if (res.status >= 500 && attempt < maxRetries) {
      await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
      continue;
    }
    throw new Error(`Together API ${res.status}: ${await res.text()}`);
  }
  throw new Error("Max retries exceeded");
}

Batch Processing

async function batchEmbedDocuments(texts: string[], model: string, batchSize = 20) {
  const results: any[] = [];
  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const result = await togetherRetry(embedLimiter, () =>
      fetch("https://api.together.xyz/v1/embeddings", {
        method: "POST", headers,
        body: JSON.stringify({ model, input: batch }),
      })
    );
    results.push(result);
    if (i + batchSize < texts.length) await new Promise(r => setTimeout(r, 3000));
  }
  return results;
}

Error Handling

Issue Cause Fix
429 on chat completions Exceeded 600 req/min key limit Use token bucket, avoid burst patterns
429 on embeddings Embedding limit is half of chat Batch inputs (up to 20 texts per request)
Model not found Wrong model ID string Verify with GET /v1/models endpoint
503 model overloaded Popular model at peak demand Retry with backoff, or use fallback model
Fine-tune 409 3 concurrent job limit reached Wait for running job to complete first

Resources

Next Steps

See together-performance-tuning.

信息
Category 编程开发
Name together-rate-limits
版本 v20260423
大小 4.59KB
更新时间 2026-04-28
语言