OpenRouter大模型缓存与成本优化

v20260423

openrouter-caching-strategy

本技能提供了全面的OpenRouter API调用缓存策略，旨在降低成本和延迟。涵盖了内存缓存、基于Redis的持久化缓存，以及Anthropic等提供商的特殊缓存技巧。通过实现确定性缓存键和管理TTL，开发者可以显著优化重复或相似的LLM查询，特别适用于RAG系统和高并发场景，实现成本和性能双重优化。

OpenRouter API 缓存大模型成本优化 Redis Python RAG

获取技能

316 次下载

概览

OpenRouter Caching Strategy

Overview

OpenRouter charges per token, so caching identical or similar requests can dramatically cut costs. Deterministic requests (temperature=0) with the same model and messages produce identical outputs -- these are safe to cache. This skill covers in-memory caching, persistent caching with TTL, and Anthropic prompt caching via OpenRouter.

In-Memory Cache

import os, hashlib, json, time
from typing import Optional
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
    default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)

class LLMCache:
    def __init__(self, ttl_seconds: int = 3600):
        self._cache: dict[str, tuple[dict, float]] = {}
        self._ttl = ttl_seconds
        self.hits = 0
        self.misses = 0

    def _key(self, model: str, messages: list, **kwargs) -> str:
        blob = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.sha256(blob.encode()).hexdigest()

    def get(self, model: str, messages: list, **kwargs) -> Optional[dict]:
        k = self._key(model, messages, **kwargs)
        if k in self._cache:
            data, ts = self._cache[k]
            if time.time() - ts < self._ttl:
                self.hits += 1
                return data
            del self._cache[k]
        self.misses += 1
        return None

    def set(self, model: str, messages: list, response: dict, **kwargs):
        k = self._key(model, messages, **kwargs)
        self._cache[k] = (response, time.time())

cache = LLMCache(ttl_seconds=1800)

def cached_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs):
    """Only cache deterministic requests (temperature=0)."""
    kwargs.setdefault("temperature", 0)
    kwargs.setdefault("max_tokens", 1024)

    cached = cache.get(model, messages, **kwargs)
    if cached:
        return cached

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "usage": {"prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens},
    }
    cache.set(model, messages, result, **kwargs)
    return result

Persistent Cache with Redis

import redis, json, hashlib

r = redis.Redis(host="localhost", port=6379, db=0)

def redis_cached_completion(messages, model="openai/gpt-4o-mini", ttl=3600, **kwargs):
    """Cache in Redis with automatic TTL expiry."""
    kwargs["temperature"] = 0  # Must be deterministic
    key = f"or:{hashlib.sha256(json.dumps({'m': model, 'msgs': messages, **kwargs}, sort_keys=True).encode()).hexdigest()}"

    cached = r.get(key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "tokens": response.usage.prompt_tokens + response.usage.completion_tokens,
    }
    r.setex(key, ttl, json.dumps(result))
    return result

Anthropic Prompt Caching via OpenRouter

Anthropic models on OpenRouter support prompt caching -- large system prompts are cached server-side, reducing input cost by 90% on cache hits.

# Mark large static content blocks with cache_control
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert. Here is the full source:\n" + large_context,
                    "cache_control": {"type": "ephemeral"},  # Cache this block
                }
            ],
        },
        {"role": "user", "content": "What does the main() function do?"},
    ],
    max_tokens=1024,
)
# First call: cache_creation_input_tokens charged at 1.25x
# Subsequent: cache_read_input_tokens charged at 0.1x (90% savings)

Cache Key Design

def cache_key(model: str, messages: list, **params) -> str:
    """Deterministic cache key. Include everything that affects output.

    Include: model ID (with variant like :floor), messages, temperature,
    max_tokens, top_p, transforms, provider routing.
    Exclude: stream (doesn't affect content), HTTP-Referer, X-Title.
    """
    canonical = json.dumps({
        "model": model, "messages": messages,
        "temperature": params.get("temperature", 0),
        "max_tokens": params.get("max_tokens"),
        "top_p": params.get("top_p"),
    }, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

Cache Invalidation

Trigger	Action	Why
Model version update	Flush keys for that model	New version may give different outputs
System prompt change	Flush all keys	Output semantics changed
TTL expiry	Automatic eviction	Prevents stale data
Manual purge	`r.delete(key)` or clear by prefix	Debugging or policy change

Error Handling

Error	Cause	Fix
Stale cache response	TTL too long	Reduce TTL or version cache keys
Cache miss storm	Cold start or invalidation	Warm cache with common queries at deploy
Redis connection error	Redis down	Fall through to direct API call
Non-deterministic cache	`temperature > 0` cached	Only cache when `temperature=0`

Enterprise Considerations

Only cache deterministic requests (temperature=0) -- non-zero temperatures produce different outputs each time
Use Anthropic prompt caching for large system prompts (RAG context) -- 90% cost reduction on cache hits
Set TTL based on content freshness needs (30 min for dynamic, 24h for reference data)
Track cache hit rate to justify caching infrastructure cost
Use Redis or Memcached for multi-instance deployments; in-memory only works for single-process
Version cache keys when updating system prompts or switching model versions

References

Examples | Errors
Prompt Caching | Models API

信息

Category 编程开发

Name openrouter-caching-strategy

版本 v20260423

大小 10.38KB

Source jeremylongshore/claude-code-plugins-plus-skills

更新时间 2026-04-28