Optimize Together AI costs with model selection, batching, and caching.
| Model Category | Price (per 1M tokens) | Example Models |
|---|---|---|
| Small (< 10B) | $0.10-0.30 | Llama-3.2-3B, Qwen-2.5-7B |
| Medium (10-40B) | $0.60-1.20 | Mixtral-8x7B, Llama-3.3-70B-Turbo |
| Large (40B+) | $2.00-5.00 | Llama-3.1-405B, DeepSeek-V3 |
| Image gen | $0.003-0.05/image | FLUX.1-schnell, SDXL |
| Embeddings | $0.008/1M tokens | M2-BERT |
| Fine-tuning | ~$5-25/hour | Depends on model + GPU |
| Batch inference | 50% off | Same models, async |
# 1. Use Turbo variants (faster, cheaper, similar quality)
# meta-llama/Llama-3.3-70B-Instruct-Turbo vs Llama-3.1-70B-Instruct
# 2. Batch inference (50% cost reduction)
batch_response = client.batch.create(
input_file_id=file_id,
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
completion_window="24h",
)
# 3. Cache responses for identical prompts
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_completion(prompt: str, model: str) -> str:
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# 4. Use smallest model that works
# Test with 3B first, upgrade to 70B only if quality insufficient
| Issue | Cause | Solution |
|---|---|---|
| High costs | Wrong model tier | Downsize model |
| Batch failures | Invalid input format | Validate JSONL |
| Fine-tuning expensive | Too many epochs | Start with 1-2 epochs |
For architecture patterns, see together-reference-architecture.