Download

Skill UI

Browse and discover 6191+ curated skills

All Development Artificial Intelligence Design & Creative Product & Business Data Science Marketing Soft Skills Productivity Engineering Languages

Search Throughput , found 6 results

Default Newest Most Downloaded

Enterprise Miles RL

miles-rl-training

Orchestra-Research/AI-Research-SKILLs

Guidance for training large multimodal MoE models with miles, covering FP8/INT4 quantization-aware regimes, train-inference alignment, speculative RL throughput tricks, and enterprise stability practices for production deployments.

Mistral Performance Tuning

mistral-performance-tuning

jeremylongshore/claude-code-plugins-plus-skills

Guides teams on reducing latency and improving throughput when integrating Mistral AI by selecting low-latency models, enabling streaming, caching deterministic requests, trimming prompts, and managing request concurrency.

Flash Attention Optimization

optimizing-attention-flash

Orchestra-Research/AI-Research-SKILLs

Speeds up transformer attention with Flash Attention techniques, offering 2-4× throughput gains and 10-20× memory reduction. Ideal for long-context models on PyTorch (native or flash-attn lib), H100 FP8, and sliding window scenarios to resolve GPU memory bottlenecks and improve inference.

HighThroughput LLMS

serving-llms-vllm

Orchestra-Research/AI-Research-SKILLs

Deploy LLMs with vLLM to maximize throughput and minimize latency through PagedAttention, continuous batching, quantization, and tensor parallelism for production APIs or batch inference on constrained GPUs.

Speculative Decoding Acceleration

speculative-decoding

Orchestra-Research/AI-Research-SKILLs

Speeds up LLM inference by combining speculative decoding with Medusa multi-heads and lookahead Jacobi techniques, letting production services hit 1.5-3.6× throughput while keeping quality intact on limited hardware.

TensorRT LLM Optimizer

Orchestra-Research/AI-Research-SKILLs

Optimizes LLM inference on NVIDIA GPUs, delivering 10-100× faster throughput, sub-10ms latency, and multi-GPU scaling with FP8/INT4 quantization, in-flight batching, and production-ready serving for real-time deployments.

1

Language