Use this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.
This skill is for running evaluations against models on the Hugging Face Hub on local hardware.
It covers:
inspect-ai with local inferencelighteval with local inferencevllm, Hugging Face Transformers, and accelerate
It does not cover:
model-index edits.eval_results generation or publishingIf the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the hugging-face-jobs skill and pass it one of the local scripts in this skill.
If the user wants to publish results into the community evals workflow, stop after generating the evaluation run and hand off that publishing step to ~/code/community-evals.
All paths below are relative to the directory containing this
SKILL.md.
| Use case | Script |
|---|---|
Local inspect-ai eval on a Hub model via inference providers |
scripts/inspect_eval_uv.py |
Local GPU eval with inspect-ai using vllm or Transformers |
scripts/inspect_vllm_uv.py |
Local GPU eval with lighteval using vllm or accelerate |
scripts/lighteval_vllm_uv.py |
| Extra command patterns | examples/USAGE_EXAMPLES.md |
uv run for local execution.HF_TOKEN for gated/private models.uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
If nvidia-smi is unavailable, either:
scripts/inspect_eval_uv.py for lighter provider-backed evaluation, orhugging-face-jobs skill if the user wants remote compute.inspect-ai when you want explicit task control and inspect-native flows.lighteval when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.vllm for throughput on supported architectures.--backend hf) or accelerate as compatibility fallbacks.inspect-ai: add --limit 10 or similar.lighteval: add --max-samples 10.hugging-face-jobs with the same script + args.Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20
Use this path when:
inspect-evals
Best when you need to load the Hub model directly, use vllm, or fall back to Transformers for unsupported architectures.
Local GPU:
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20
Transformers fallback:
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20
Best when the task is naturally expressed as a lighteval task string, especially Open LLM Leaderboard style benchmarks.
Local GPU:
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-template
accelerate fallback:
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20
This skill intentionally stops at local execution and backend selection.
If the user wants to:
then switch to the hugging-face-jobs skill and pass it one of these scripts plus the chosen arguments.
inspect-ai examples:
mmlu
gsm8k
hellaswag
arc_challenge
truthfulqa
winogrande
humaneval
lighteval task strings use suite|task|num_fewshot:
leaderboard|mmlu|5
leaderboard|gsm8k|5
leaderboard|arc_challenge|25
lighteval|hellaswag|0
Multiple lighteval tasks can be comma-separated in --tasks.
inspect_vllm_uv.py --backend vllm for fast GPU inference on supported architectures.inspect_vllm_uv.py --backend hf when vllm does not support the model.lighteval_vllm_uv.py --backend vllm for throughput on supported models.lighteval_vllm_uv.py --backend accelerate as the compatibility fallback.inspect_eval_uv.py when Inference Providers already cover the model and you do not need direct GPU control.| Model size | Suggested local hardware |
|---|---|
< 3B |
consumer GPU / Apple Silicon / small dev GPU |
3B - 13B |
stronger local GPU |
13B+ |
high-memory local GPU or hand off to hugging-face-jobs |
For smoke tests, prefer cheaper local runs plus --limit or --max-samples.
--batch-size
--gpu-memory-utilization
hugging-face-jobs
vllm:
--backend hf for inspect-ai
--backend accelerate for lighteval
HF_TOKEN
--trust-remote-code
See:
examples/USAGE_EXAMPLES.md for local command patternsscripts/inspect_eval_uv.py
scripts/inspect_vllm_uv.py
scripts/lighteval_vllm_uv.py