llm-evaluation
sickn33/antigravity-awesome-skills
Master comprehensive strategies for evaluating Large Language Model (LLM) applications. This skill covers automated metrics (BLEU, ROUGE, BERTScore), structured human evaluation dimensions, and advanced LLM-as-Judge techniques. Use it to systematically measure model performance, compare prompts, detect regressions, and build confidence in production systems.