eval-driven-dev
github/awesome-copilot
This skill guides the end-to-end process of building automated evaluation pipelines for Python applications powered by Large Language Models (LLMs). It teaches how to define evaluation criteria, instrument the application, build golden datasets, and run real-world evaluations using tools like pixie, ensuring the app's logic (not just the LLM) is rigorously tested. Ideal for QA, benchmarking, and improving LLM-based services.