langchain-eval-harness
jeremylongshore/claude-code-plugins-plus-skills
This harness provides comprehensive, reproducible evaluation pipelines for complex LLM chains and agents (LangChain/LangGraph 1.0). It integrates golden dataset management, LangSmith evaluation, RAGAS metrics, deepeval LLM-as-judge, and structured agent trajectory analysis. Use it when establishing quality benchmarks for new chains, diagnosing performance regressions after model switches, or implementing CI/CD gates to prevent quality drops.