Skills Development Implementing Robust API Rate Limit Handling

Implementing Robust API Rate Limit Handling

v20260423
cohere-rate-limits
This guide details advanced patterns for managing API rate limits, particularly for external AI services like Cohere. It covers implementing exponential backoff with jitter for handling transient 429 errors, utilizing concurrency-limited request queues, and building proactive rate limit trackers. These techniques ensure highly reliable and optimized API throughput in production systems.
Get Skill
422 downloads
Overview

Cohere Rate Limits

Overview

Handle Cohere rate limits with exponential backoff, request queuing, and proactive throttling. Real rate limits from Cohere's documentation.

Prerequisites

  • cohere-ai SDK installed
  • Understanding of async/await patterns

Actual Cohere Rate Limits

Key Type Endpoint Rate Limit Monthly Limit
Trial Chat 20 calls/min 1,000 total
Trial Embed 5 calls/min 1,000 total
Trial Rerank 5 calls/min 1,000 total
Trial Classify 5 calls/min 1,000 total
Production All endpoints 1,000 calls/min Unlimited

Trial keys are free. Production keys require billing at dashboard.cohere.com.

Instructions

Step 1: Exponential Backoff with Jitter

import { CohereError, CohereTimeoutError } from 'cohere-ai';

interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
}

const DEFAULT_RETRY: RetryConfig = {
  maxRetries: 5,
  baseDelayMs: 1000,
  maxDelayMs: 60_000,
};

async function withBackoff<T>(
  operation: () => Promise<T>,
  config = DEFAULT_RETRY
): Promise<T> {
  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (err) {
      if (attempt === config.maxRetries) throw err;

      // Only retry on rate limits (429) and server errors (5xx)
      let shouldRetry = false;
      let retryAfterMs: number | undefined;

      if (err instanceof CohereError) {
        if (err.statusCode === 429) {
          shouldRetry = true;
          // Cohere returns Retry-After header (seconds)
          // SDK may expose this via err.headers
        } else if (err.statusCode && err.statusCode >= 500) {
          shouldRetry = true;
        }
      } else if (err instanceof CohereTimeoutError) {
        shouldRetry = true;
      }

      if (!shouldRetry) throw err;

      // Exponential delay with jitter
      const exponential = config.baseDelayMs * Math.pow(2, attempt);
      const jitter = Math.random() * config.baseDelayMs;
      const delay = Math.min(exponential + jitter, config.maxDelayMs);

      console.warn(`Cohere retry ${attempt + 1}/${config.maxRetries} in ${delay.toFixed(0)}ms`);
      await new Promise(r => setTimeout(r, retryAfterMs ?? delay));
    }
  }
  throw new Error('Unreachable');
}

Step 2: Request Queue (Concurrency-Limited)

import PQueue from 'p-queue';

// Match rate limits: trial=20/min for chat, production=1000/min
function createCohereQueue(callsPerMinute: number) {
  return new PQueue({
    concurrency: 5,
    interval: 60_000,
    intervalCap: callsPerMinute,
  });
}

// Trial key queue
const trialChatQueue = createCohereQueue(20);
const trialEmbedQueue = createCohereQueue(5);

// Production key queue
const prodQueue = createCohereQueue(1000);

// Usage
async function queuedChat(params: any) {
  return trialChatQueue.add(() =>
    withBackoff(() => cohere.chat(params))
  );
}

Step 3: Proactive Rate Tracking

class RateLimitTracker {
  private windows: Map<string, number[]> = new Map();

  constructor(private limitsPerMinute: Record<string, number>) {}

  canProceed(endpoint: string): boolean {
    const limit = this.limitsPerMinute[endpoint] ?? 1000;
    const now = Date.now();
    const window = this.windows.get(endpoint) ?? [];

    // Remove entries older than 1 minute
    const active = window.filter(t => now - t < 60_000);
    this.windows.set(endpoint, active);

    return active.length < limit;
  }

  record(endpoint: string): void {
    const window = this.windows.get(endpoint) ?? [];
    window.push(Date.now());
    this.windows.set(endpoint, window);
  }

  waitTime(endpoint: string): number {
    const limit = this.limitsPerMinute[endpoint] ?? 1000;
    const window = this.windows.get(endpoint) ?? [];
    const now = Date.now();
    const active = window.filter(t => now - t < 60_000);

    if (active.length < limit) return 0;
    return 60_000 - (now - active[0]); // Wait until oldest entry expires
  }
}

// Trial key tracker
const tracker = new RateLimitTracker({
  chat: 20,
  embed: 5,
  rerank: 5,
  classify: 5,
});

// Use before each call
async function trackedEmbed(params: any) {
  const wait = tracker.waitTime('embed');
  if (wait > 0) {
    console.log(`Throttling embed: waiting ${wait}ms`);
    await new Promise(r => setTimeout(r, wait));
  }
  tracker.record('embed');
  return withBackoff(() => cohere.embed(params));
}

Step 4: Batch-Aware Embedding

// Embed supports up to 96 texts per call — maximize batch size to reduce calls
async function efficientEmbed(
  texts: string[],
  inputType: 'search_document' | 'search_query' = 'search_document'
): Promise<number[][]> {
  const BATCH_SIZE = 96; // Cohere max per request
  const allVectors: number[][] = [];

  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE);

    const response = await trackedEmbed({
      model: 'embed-v4.0',
      texts: batch,
      inputType,
      embeddingTypes: ['float'],
    });

    allVectors.push(...response.embeddings.float);
  }

  return allVectors;
}

// 960 texts = 10 API calls (not 960)
const vectors = await efficientEmbed(largeTextArray);

Cost-Aware Rate Limiting

For production keys, rate limits are per-minute but costs are per-token:

class TokenBudget {
  private tokensUsed = 0;
  private readonly resetInterval: NodeJS.Timer;

  constructor(
    private maxTokensPerMinute: number,
    private alertCallback?: (used: number) => void
  ) {
    // Reset every minute
    this.resetInterval = setInterval(() => { this.tokensUsed = 0; }, 60_000);
  }

  canAfford(estimatedTokens: number): boolean {
    return this.tokensUsed + estimatedTokens <= this.maxTokensPerMinute;
  }

  record(actualTokens: number): void {
    this.tokensUsed += actualTokens;
    if (this.tokensUsed > this.maxTokensPerMinute * 0.8) {
      this.alertCallback?.(this.tokensUsed);
    }
  }

  dispose(): void {
    clearInterval(this.resetInterval);
  }
}

Output

  • Automatic retry with exponential backoff + jitter
  • Concurrency-limited request queue matching Cohere rate limits
  • Proactive throttling before hitting limits
  • Batch-optimized embedding to minimize API calls

Error Handling

Scenario Detection Action
429 from trial key CohereError.statusCode === 429 Wait 60s, retry
429 from prod key Same Backoff, check concurrency
Monthly limit hit (trial) 429 with limit message Upgrade to production key
Burst of requests Queue depth > threshold Add backpressure

Resources

Next Steps

For security configuration, see cohere-security-basics.

Info
Category Development
Name cohere-rate-limits
Version v20260423
Size 7.4KB
Updated At 2026-04-28
Language