Together AI's OpenAI-compatible inference API enforces per-key rate limits that vary by model tier and operation type. Chat completions and embeddings share a global request quota, while fine-tuning jobs and batch inference have separate concurrency caps. High-throughput workloads like embedding entire document corpora or running evaluations across 100+ prompts require client-side token bucket limiting. Together's batch inference endpoint offers 50% cost savings but has its own queue depth limits that differ from real-time inference.
| Endpoint | Limit | Window | Scope |
|---|---|---|---|
| Chat completions | 600 req | 1 minute | Per API key |
| Embeddings | 300 req | 1 minute | Per API key |
| Image generation (FLUX) | 60 req | 1 minute | Per API key |
| Fine-tune jobs (concurrent) | 3 jobs | Rolling | Per API key |
| Batch inference | 100 req/batch, 10 batches | Rolling | Per API key |
class TogetherRateLimiter {
private tokens: number;
private lastRefill: number;
private readonly max: number;
private readonly refillRate: number;
private queue: Array<{ resolve: () => void }> = [];
constructor(maxPerMinute: number) {
this.max = maxPerMinute;
this.tokens = maxPerMinute;
this.lastRefill = Date.now();
this.refillRate = maxPerMinute / 60_000;
}
async acquire(): Promise<void> {
this.refill();
if (this.tokens >= 1) { this.tokens -= 1; return; }
return new Promise(resolve => this.queue.push({ resolve }));
}
private refill() {
const now = Date.now();
this.tokens = Math.min(this.max, this.tokens + (now - this.lastRefill) * this.refillRate);
this.lastRefill = now;
while (this.tokens >= 1 && this.queue.length) {
this.tokens -= 1;
this.queue.shift()!.resolve();
}
}
}
const chatLimiter = new TogetherRateLimiter(500); // buffer under 600
const embedLimiter = new TogetherRateLimiter(250);
async function togetherRetry<T>(
limiter: TogetherRateLimiter, fn: () => Promise<Response>, maxRetries = 4
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
await limiter.acquire();
const res = await fn();
if (res.ok) return res.json();
if (res.status === 429) {
const retryAfter = parseInt(res.headers.get("Retry-After") || "5", 10);
const jitter = Math.random() * 2000;
await new Promise(r => setTimeout(r, retryAfter * 1000 + jitter));
continue;
}
if (res.status >= 500 && attempt < maxRetries) {
await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
continue;
}
throw new Error(`Together API ${res.status}: ${await res.text()}`);
}
throw new Error("Max retries exceeded");
}
async function batchEmbedDocuments(texts: string[], model: string, batchSize = 20) {
const results: any[] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const result = await togetherRetry(embedLimiter, () =>
fetch("https://api.together.xyz/v1/embeddings", {
method: "POST", headers,
body: JSON.stringify({ model, input: batch }),
})
);
results.push(result);
if (i + batchSize < texts.length) await new Promise(r => setTimeout(r, 3000));
}
return results;
}
| Issue | Cause | Fix |
|---|---|---|
| 429 on chat completions | Exceeded 600 req/min key limit | Use token bucket, avoid burst patterns |
| 429 on embeddings | Embedding limit is half of chat | Batch inputs (up to 20 texts per request) |
| Model not found | Wrong model ID string | Verify with GET /v1/models endpoint |
| 503 model overloaded | Popular model at peak demand | Retry with backoff, or use fallback model |
| Fine-tune 409 | 3 concurrent job limit reached | Wait for running job to complete first |
See together-performance-tuning.