Skills Artificial Intelligence Phoenix Evaluators Toolkit

Phoenix Evaluators Toolkit

v20260415
phoenix-evals
Phoenix Evals lets teams build evaluators for AI/LLM apps by wiring deterministic code with LLM nuance, running Python or TypeScript workloads, validating against humans, and iterating through error analysis, RAG, and production workflows.
Get Skill
417 downloads
Overview

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task Files
Setup setup-python, setup-typescript
Decide what to evaluate evaluators-overview
Choose a judge model fundamentals-model-selection
Use pre-built evaluators evaluators-pre-built
Build code evaluator evaluators-code-python, evaluators-code-typescript
Build LLM evaluator evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrame evaluate-dataframe-python
Run experiment experiments-running-python, experiments-running-typescript
Create dataset experiments-datasets-python, experiments-datasets-typescript
Generate synthetic data experiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracy validation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for review observe-sampling-python, observe-sampling-typescript
Analyze errors error-analysis, error-analysis-multi-turn, axial-coding
RAG evals evaluators-rag
Avoid common mistakes common-mistakes-python, fundamentals-anti-patterns
Production production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Reference Categories

Prefix Description
fundamentals-* Types, scores, anti-patterns
observe-* Tracing, sampling
error-analysis-* Finding failures
axial-coding-* Categorizing failures
evaluators-* Code, LLM, RAG evaluators
experiments-* Datasets, running experiments
validation-* Validating evaluator accuracy against human labels
production-* CI/CD, monitoring

Key Principles

Principle Action
Error analysis first Can't automate what you haven't observed
Custom > generic Build from your failures
Code first Deterministic before LLM
Validate judges >80% TPR/TNR
Binary > Likert Pass/fail, not 1-5
Info
Name phoenix-evals
Version v20260415
Size 41.52KB
Updated At 2026-04-17
Language