Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Quick Reference
| Task |
Files |
| Setup |
setup-python, setup-typescript |
| Decide what to evaluate |
evaluators-overview |
| Choose a judge model |
fundamentals-model-selection |
| Use pre-built evaluators |
evaluators-pre-built |
| Build code evaluator |
evaluators-code-python, evaluators-code-typescript |
| Build LLM evaluator |
evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates |
| Batch evaluate DataFrame |
evaluate-dataframe-python |
| Run experiment |
experiments-running-python, experiments-running-typescript |
| Create dataset |
experiments-datasets-python, experiments-datasets-typescript |
| Generate synthetic data |
experiments-synthetic-python, experiments-synthetic-typescript |
| Validate evaluator accuracy |
validation, validation-evaluators-python, validation-evaluators-typescript |
| Sample traces for review |
observe-sampling-python, observe-sampling-typescript |
| Analyze errors |
error-analysis, error-analysis-multi-turn, axial-coding |
| RAG evals |
evaluators-rag |
| Avoid common mistakes |
common-mistakes-python, fundamentals-anti-patterns |
| Production |
production-overview, production-guardrails, production-continuous |
Workflows
Starting Fresh:
observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator:
fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems:
evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production:
production-overview → production-guardrails → production-continuous
Reference Categories
| Prefix |
Description |
fundamentals-* |
Types, scores, anti-patterns |
observe-* |
Tracing, sampling |
error-analysis-* |
Finding failures |
axial-coding-* |
Categorizing failures |
evaluators-* |
Code, LLM, RAG evaluators |
experiments-* |
Datasets, running experiments |
validation-* |
Validating evaluator accuracy against human labels |
production-* |
CI/CD, monitoring |
Key Principles
| Principle |
Action |
| Error analysis first |
Can't automate what you haven't observed |
| Custom > generic |
Build from your failures |
| Code first |
Deterministic before LLM |
| Validate judges |
>80% TPR/TNR |
| Binary > Likert |
Pass/fail, not 1-5 |