Cohere Cost Tuning
Overview
Optimize Cohere costs through model selection, token budgets, embedding compression, and usage monitoring. Cohere pricing is token-based with separate input/output rates.
Prerequisites
Cohere Pricing Model
Key principle: Cohere charges per token. Input tokens and output tokens have different rates. Embed, Rerank, and Classify have separate pricing based on search units.
| Tier |
Access |
Rate Limits |
Cost |
| Trial |
Free |
5-20 calls/min, 1000/month |
$0 |
| Production |
Metered |
1000 calls/min, unlimited |
Per-token |
Model Cost Comparison
| Model |
Input (per 1M tokens) |
Output (per 1M tokens) |
Best For |
command-r7b-12-2024 |
Lowest |
Lowest |
High-volume, simple tasks |
command-r-08-2024 |
Low |
Low |
RAG, cost-effective |
command-r-plus-08-2024 |
Medium |
Medium |
Complex reasoning |
command-a-03-2025 |
Higher |
Higher |
Best quality |
Non-Chat Pricing
| Endpoint |
Pricing Unit |
Notes |
| Embed |
Per input token |
Batch 96 texts to minimize calls |
| Rerank |
Per search unit |
1 query + N docs = 1 search unit |
| Classify |
Per classification |
Charges per input classified |
Instructions
Strategy 1: Model Tiering
type CostTier = 'economy' | 'standard' | 'premium';
function selectModel(tier: CostTier): string {
switch (tier) {
case 'economy': return 'command-r7b-12-2024'; // ~5x cheaper
case 'standard': return 'command-r-08-2024'; // Good balance
case 'premium': return 'command-a-03-2025'; // Best quality
}
}
// Route by use case
function routeModel(task: string): string {
// High-volume, simple tasks → cheapest model
if (['classify', 'extract', 'summarize-short'].includes(task)) {
return selectModel('economy');
}
// RAG, moderate complexity
if (['rag', 'search', 'qa'].includes(task)) {
return selectModel('standard');
}
// Complex reasoning, user-facing
return selectModel('premium');
}
Strategy 2: Token Budget Controls
import { CohereClientV2 } from 'cohere-ai';
const cohere = new CohereClientV2();
// Set maxTokens to prevent runaway generation costs
async function budgetedChat(message: string, maxOutputTokens = 500) {
const response = await cohere.chat({
model: 'command-r-08-2024',
messages: [{ role: 'user', content: message }],
maxTokens: maxOutputTokens, // Hard limit on output tokens
});
// Track actual usage
const usage = response.usage?.billedUnits;
console.log(`Tokens: in=${usage?.inputTokens} out=${usage?.outputTokens}`);
return response;
}
Strategy 3: Embedding Cost Reduction
// 1. Use int8 embeddings (same quality, cheaper storage)
const response = await cohere.embed({
model: 'embed-v4.0',
texts: documents,
inputType: 'search_document',
embeddingTypes: ['int8'], // 75% less storage than float
});
// 2. Batch to 96 per call (minimize API calls)
// 3. Cache embeddings (they're deterministic — embed once, use forever)
// 4. Use embed-multilingual-v3.0 if you don't need v4 features
Strategy 4: Usage Monitoring
class CohereUsageTracker {
private usage: Record<string, { inputTokens: number; outputTokens: number; calls: number }> = {};
private dailyBudget: number;
constructor(dailyBudgetUSD: number) {
this.dailyBudget = dailyBudgetUSD;
}
track(endpoint: string, billedUnits: { inputTokens?: number; outputTokens?: number }) {
if (!this.usage[endpoint]) {
this.usage[endpoint] = { inputTokens: 0, outputTokens: 0, calls: 0 };
}
this.usage[endpoint].inputTokens += billedUnits.inputTokens ?? 0;
this.usage[endpoint].outputTokens += billedUnits.outputTokens ?? 0;
this.usage[endpoint].calls++;
}
getReport(): string {
return Object.entries(this.usage)
.map(([ep, u]) =>
`${ep}: ${u.calls} calls, ${u.inputTokens} in, ${u.outputTokens} out`
)
.join('\n');
}
estimateDailyCost(): number {
// Rough estimate — check cohere.com/pricing for exact rates
const chatIn = (this.usage['chat']?.inputTokens ?? 0) / 1_000_000;
const chatOut = (this.usage['chat']?.outputTokens ?? 0) / 1_000_000;
const embedIn = (this.usage['embed']?.inputTokens ?? 0) / 1_000_000;
// Multiply by per-million-token rates from pricing page
return (chatIn * 0.5) + (chatOut * 1.5) + (embedIn * 0.1); // example rates
}
}
// Wrap all API calls
const tracker = new CohereUsageTracker(10); // $10/day budget
async function trackedChat(params: any) {
const response = await cohere.chat(params);
tracker.track('chat', response.usage?.billedUnits ?? {});
if (tracker.estimateDailyCost() > tracker['dailyBudget'] * 0.8) {
console.warn('WARNING: Approaching daily Cohere budget limit');
}
return response;
}
Strategy 5: Rerank Before RAG (Skip Embed for Small Corpora)
// If you have < 1000 documents, skip embedding entirely
// Rerank is cheaper than Embed + vector search for small collections
async function cheapRAG(query: string, corpus: string[]) {
// 1 search unit instead of N embed calls
const ranked = await cohere.rerank({
model: 'rerank-v3.5',
query,
documents: corpus,
topN: 3,
});
const docs = ranked.results.map((r, i) => ({
id: `doc-${i}`,
data: { text: corpus[r.index] },
}));
// Use cheaper model for generation
return cohere.chat({
model: 'command-r-08-2024', // Not command-a (cheaper)
messages: [{ role: 'user', content: query }],
documents: docs,
maxTokens: 300,
});
}
Cost Optimization Checklist
Output
- Model tiering by cost/quality
- Token budget controls preventing runaway costs
- Usage tracking with daily budget alerts
- Cost-effective RAG with rerank pre-filtering
Error Handling
| Issue |
Cause |
Solution |
| Unexpected bill spike |
No maxTokens |
Set maxTokens on all chat calls |
| High embed costs |
Individual texts |
Batch to 96 per call |
| Budget exceeded |
No monitoring |
Track billedUnits per response |
| Over-provisioned model |
Using premium everywhere |
Tier models by task complexity |
Resources
Next Steps
For architecture patterns, see cohere-reference-architecture.