Mistral AI速率限制与重试机制

v20260423

mistral-rate-limits

本教程详细介绍了如何为Mistral AI API实现高效的速率限制（RPM/TPM）管理。它提供了基于Token消耗和请求频率的限制器实现，并演示了如何利用指数退避和`Retry-After`头部进行重试，确保应用在高并发和API限制场景下的稳定性和吞吐量。

Mistral API 速率限制退避重试大型语言模型 TypeScript Python

获取技能

196 次下载

概览

Mistral Rate Limits

Overview

Rate limit management for Mistral AI API. Mistral enforces per-workspace RPM (requests/minute) and TPM (tokens/minute) limits that vary by usage tier (Experiment free tier vs Scale pay-as-you-go). View your workspace limits at admin.mistral.ai/plateforme/limits.

Prerequisites

Mistral API key configured
Understanding of workspace tier (Experiment vs Scale)
Application with retry infrastructure

Mistral Rate Limit Architecture

Limits are set at the workspace level, not per key. All API keys in a workspace share the same RPM/TPM budget.

Endpoint	What's limited
`/v1/chat/completions`	RPM + TPM (input + output)
`/v1/embeddings`	RPM + TPM (input only)
`/v1/fim/completions`	RPM + TPM
`/v1/moderations`	RPM

Headers returned on every response:

x-ratelimit-limit-requests — your RPM cap
x-ratelimit-remaining-requests — remaining RPM
x-ratelimit-limit-tokens — your TPM cap
x-ratelimit-remaining-tokens — remaining TPM
Retry-After — seconds to wait (on 429 only)

Instructions

Step 1: Token-Aware Rate Limiter

class MistralRateLimiter {
  private requestTimes: number[] = [];
  private tokenBuckets: Array<{ time: number; tokens: number }> = [];
  private readonly rpm: number;
  private readonly tpm: number;

  constructor(rpm: number, tpm: number) {
    this.rpm = rpm;
    this.tpm = tpm;
  }

  async waitIfNeeded(estimatedTokens: number): Promise<void> {
    const now = Date.now();
    const windowStart = now - 60_000;

    // Prune old entries
    this.requestTimes = this.requestTimes.filter(t => t > windowStart);
    this.tokenBuckets = this.tokenBuckets.filter(b => b.time > windowStart);

    // Check RPM
    if (this.requestTimes.length >= this.rpm) {
      const waitMs = this.requestTimes[0] - windowStart + 100;
      console.warn(`RPM limit (${this.rpm}), waiting ${waitMs}ms`);
      await new Promise(r => setTimeout(r, waitMs));
    }

    // Check TPM
    const currentTPM = this.tokenBuckets.reduce((sum, b) => sum + b.tokens, 0);
    if (currentTPM + estimatedTokens > this.tpm) {
      const waitMs = this.tokenBuckets[0].time - windowStart + 100;
      console.warn(`TPM limit (${this.tpm}), waiting ${waitMs}ms`);
      await new Promise(r => setTimeout(r, waitMs));
    }

    this.requestTimes.push(Date.now());
  }

  recordUsage(tokens: number): void {
    this.tokenBuckets.push({ time: Date.now(), tokens });
  }
}

Step 2: Retry with Retry-After Header

import { Mistral } from '@mistralai/mistralai';

async function chatWithRetry(
  client: Mistral,
  params: { model: string; messages: any[] },
  maxRetries = 5,
): Promise<any> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await client.chat.complete(params);
    } catch (error: any) {
      if (error.status !== 429 || attempt === maxRetries) throw error;

      // Respect Retry-After header from Mistral
      const retryAfter = error.headers?.get?.('retry-after');
      const waitSec = retryAfter ? parseInt(retryAfter) : Math.min(2 ** attempt, 60);
      console.warn(`429 — retrying in ${waitSec}s (attempt ${attempt + 1}/${maxRetries})`);
      await new Promise(r => setTimeout(r, waitSec * 1000));
    }
  }
}

Step 3: Rate-Limited Client Wrapper

const limiter = new MistralRateLimiter(100, 500_000);
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });

async function rateLimitedChat(messages: any[], model = 'mistral-small-latest') {
  const estimatedTokens = messages.reduce(
    (sum, m) => sum + Math.ceil((m.content?.length ?? 0) / 4), 0
  );

  await limiter.waitIfNeeded(estimatedTokens);
  const response = await client.chat.complete({ model, messages });

  if (response.usage) {
    limiter.recordUsage(
      (response.usage.promptTokens ?? 0) + (response.usage.completionTokens ?? 0)
    );
  }
  return response;
}

Step 4: Model Fallback for Throughput

class ModelRouter {
  private limiters: Record<string, MistralRateLimiter>;

  constructor() {
    this.limiters = {
      'mistral-large-latest': new MistralRateLimiter(30, 200_000),
      'mistral-small-latest': new MistralRateLimiter(120, 500_000),
    };
  }

  async chat(messages: any[], preferred = 'mistral-large-latest') {
    try {
      return await rateLimitedChat(messages, preferred);
    } catch (error: any) {
      if (error.status === 429 && preferred !== 'mistral-small-latest') {
        console.warn(`Falling back to mistral-small-latest`);
        return rateLimitedChat(messages, 'mistral-small-latest');
      }
      throw error;
    }
  }
}

Step 5: Batch Embedding with Rate Awareness

import time
from mistralai import Mistral

def batch_embed(client: Mistral, texts: list[str], batch_size: int = 32) -> list:
    """Batch embed with automatic rate limiting."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        try:
            response = client.embeddings.create(
                model="mistral-embed", inputs=batch
            )
            all_embeddings.extend([d.embedding for d in response.data])
        except Exception as e:
            if hasattr(e, "status_code") and e.status_code == 429:
                time.sleep(10)
                response = client.embeddings.create(
                    model="mistral-embed", inputs=batch
                )
                all_embeddings.extend([d.embedding for d in response.data])
            else:
                raise
    return all_embeddings

Step 6: Usage Dashboard

function rateLimitStatus(limiter: MistralRateLimiter) {
  const now = Date.now();
  const windowStart = now - 60_000;
  const activeRequests = limiter['requestTimes'].filter(t => t > windowStart).length;
  const activeTokens = limiter['tokenBuckets']
    .filter(b => b.time > windowStart)
    .reduce((sum, b) => sum + b.tokens, 0);

  return {
    rpm: { used: activeRequests, limit: limiter['rpm'], pct: (activeRequests / limiter['rpm'] * 100).toFixed(1) },
    tpm: { used: activeTokens, limit: limiter['tpm'], pct: (activeTokens / limiter['tpm'] * 100).toFixed(1) },
  };
}

Error Handling

Issue	Cause	Solution
`429` errors	Exceeded RPM or TPM	Use rate limiter + exponential backoff
Inconsistent limits	All keys share workspace budget	Coordinate across services
Batch failures	Too many tokens per batch	Reduce batch size for embeddings
Spike traffic blocked	No request smoothing	Queue requests, spread over window

Resources

Rate Limits & Usage Tiers
Pricing
Batch Inference — 50% cheaper, no rate limits

Output

Token-aware rate limiter with RPM + TPM tracking
Retry logic respecting Retry-After headers
Model fallback routing for throughput
Rate limit dashboard for monitoring

信息

Category 编程开发

Name mistral-rate-limits

版本 v20260423

大小 7.59KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-28