Login
Download
Skill UI
Browse and discover
5145+
curated skills
All
Development
Artificial Intelligence
Design & Creative
Product & Business
Data Science
Marketing
Soft Skills
Productivity
Engineering
Languages
Search
inference
, found
33
results
Default
Newest
Most Downloaded
AWQ Weight Quantization
awq-quantization
Orchestra-Research/AI-Research-SKILLs
151
AWQ provides activation-aware 4-bit quantization for large language models, delivering ~3x inference speedup and sub-5% accuracy loss to deploy instruction-tuned or multimodal models on memory-constrained GPUs with vLLM integration and Marlin kernels.
View Details
Batch Inference Pipeline
batch-inference-pipeline
jeremylongshore/claude-code-plugins-plus-skills
50
Guides ML teams through automated batch inference pipelines, suggesting best practices, monitoring and production readiness, and generating code and configs for deployment.
View Details
Groq Cost Tuning
groq-cost-tuning
jeremylongshore/claude-code-plugins-plus-skills
375
Guide to reducing Groq inference spend by routing requests to cost-effective models, trimming token usage, caching repeated calls, batching requests, and setting spend limits for Groq Cloud billing scenarios.
View Details
Groq Reference Architecture
groq-reference-architecture
jeremylongshore/claude-code-plugins-plus-skills
54
Defines best-practice Groq deployment with tiered model routing, middleware, streaming pipelines, and fallback chains for ultra-fast LLM inference and production monitoring when launching new Groq integrations.
View Details
GroqCloud Automation Suite
groqcloud-automation
ComposioHQ/awesome-claude-skills
302
GroqCloud Automation orchestrates high-performance GroqCloud APIs through Composio, covering inference, chat completions, audio translation, and TTS voice selection for production workflows.
View Details
Inference Latency Profiler
inference-latency-profiler
jeremylongshore/claude-code-plugins-plus-skills
208
Automates inference latency profiler tasks in ML deployment scenarios, offering step-by-step guidance on model serving, MLOps pipelines, monitoring, and production optimization, generating production-ready code and validating outputs against best practices.
View Details
LLM Knowledge Distillation
knowledge-distillation
Orchestra-Research/AI-Research-SKILLs
81
Compress large language models via teacher-student distillation, covering temperature scaling, soft targets, reverse KLD, and response distillation so you can deploy smaller LLMs with GPT-4-level behavior and lower inference cost.
View Details
Lambda Labs GPU Cloud
lambda-labs-gpu-cloud
Orchestra-Research/AI-Research-SKILLs
160
Lambda Labs GPU cloud offers reserved and on-demand instances with SSH access, persistent filesystems, and 1-Click multi-node clusters, making it ideal for long-running training and inference workloads that need high-performance GPUs.
View Details
Llama cpp CPU Inference
llama-cpp
Orchestra-Research/AI-Research-SKILLs
382
Deploy llama.cpp to run LLM inference across CPUs, Apple Silicon, and non-NVIDIA GPUs, making it ideal for edge devices or CUDA-free setups with GGUF quantization for faster, lower-memory results.
View Details
LlamaGuard Content Moderation
llamaguard
Orchestra-Research/AI-Research-SKILLs
441
LlamaGuard is Meta’s 7–8B safety-specialized LLM that filters both prompts and responses by classifying six threat categories, enabling fast inference via vLLM/SageMaker and integration into NeMo Guardrails for end-to-end moderation.
View Details
Long Context Extensions
long-context
Orchestra-Research/AI-Research-SKILLs
438
Extends transformer models’ context windows using RoPE, YaRN, ALiBi, and interpolation so LLMs can process documents of 32k–128k+ tokens, extrapolate to longer lengths, and deploy efficient positional encodings and bias strategies for fine-tuning or inference.
View Details
Mamba Selective State Models
mamba-architecture
Orchestra-Research/AI-Research-SKILLs
491
Mamba provides selective state-space models with O(n) inference complexity, letting you handle million-token sequences faster than transformers while skipping KV caches and benefiting from a hardware-aware design. Use it for long-context language modeling, streaming applications, and scalable low-memory sequence learners.
View Details
1
2
3
Next
Language
简体中文
English