evaluating-llms-harness
Orchestra-Research/AI-Research-SKILLs
Evaluates large language models across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag) with standardized prompts and metrics, making it easy to benchmark HuggingFace or API models, compare releases, and track training progress.