Login
Download
Skill UI
Browse and discover
6191+
curated skills
All
Development
Artificial Intelligence
Design & Creative
Product & Business
Data Science
Marketing
Soft Skills
Productivity
Engineering
Languages
Search
Throughput
, found
6
results
Default
Newest
Most Downloaded
Enterprise Miles RL
miles-rl-training
Orchestra-Research/AI-Research-SKILLs
303
Guidance for training large multimodal MoE models with miles, covering FP8/INT4 quantization-aware regimes, train-inference alignment, speculative RL throughput tricks, and enterprise stability practices for production deployments.
View Details
Mistral Performance Tuning
mistral-performance-tuning
jeremylongshore/claude-code-plugins-plus-skills
134
Guides teams on reducing latency and improving throughput when integrating Mistral AI by selecting low-latency models, enabling streaming, caching deterministic requests, trimming prompts, and managing request concurrency.
View Details
Flash Attention Optimization
optimizing-attention-flash
Orchestra-Research/AI-Research-SKILLs
399
Speeds up transformer attention with Flash Attention techniques, offering 2-4× throughput gains and 10-20× memory reduction. Ideal for long-context models on PyTorch (native or flash-attn lib), H100 FP8, and sliding window scenarios to resolve GPU memory bottlenecks and improve inference.
View Details
HighThroughput LLMS
serving-llms-vllm
Orchestra-Research/AI-Research-SKILLs
332
Deploy LLMs with vLLM to maximize throughput and minimize latency through PagedAttention, continuous batching, quantization, and tensor parallelism for production APIs or batch inference on constrained GPUs.
View Details
Speculative Decoding Acceleration
speculative-decoding
Orchestra-Research/AI-Research-SKILLs
418
Speeds up LLM inference by combining speculative decoding with Medusa multi-heads and lookahead Jacobi techniques, letting production services hit 1.5-3.6× throughput while keeping quality intact on limited hardware.
View Details
TensorRT LLM Optimizer
tensorrt-llm
Orchestra-Research/AI-Research-SKILLs
307
Optimizes LLM inference on NVIDIA GPUs, delivering 10-100× faster throughput, sub-10ms latency, and multi-GPU scaling with FP8/INT4 quantization, in-flight batching, and production-ready serving for real-time deployments.
View Details
1
Language
简体中文
English