evaluating-llms-harness
Orchestra-Research/AI-Research-SKILLs
Runs lm-evaluation-harness across 60+ academic benchmarks such as MMLU, HumanEval, GSM8K, TruthfulQA, and HellaSwag to benchmark HuggingFace/vLLM/API models, compare variants, track training checkpoints, and save standardized reports for research labs.