Eval-driven prompt engineering, RAG quality measurement, and agent workflow validation. Everything here is model-agnostic by design: techniques are framed by what they do, not by which model generation they were observed on, and the tools never hardcode model IDs or pricing — you supply your provider's current rates when you want dollar figures.
--analyze --output baseline.json), then compare every iteration against it.--price-per-mtok (never trust a cached price table — including any you remember).scripts/prompt_optimizer.pyStatic analysis: token estimate, clarity/structure scores (0–100), ambiguity + redundancy detection, few-shot example extraction.
# Full analysis (human-readable report)
python3 scripts/prompt_optimizer.py prompt.txt --analyze
# Save machine-readable baseline for later comparison
python3 scripts/prompt_optimizer.py prompt.txt --analyze --json --output baseline.json
# Token estimate; cost only if you supply your provider's current rate
python3 scripts/prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00
# Whitespace/redundancy-trimmed version
python3 scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt
# Extract Input/Output few-shot pairs to JSON
python3 scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json
# Compare a revision against the saved baseline
python3 scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json
--model accepts any string; only the tokenizer family is inferred (names containing "claude" → 3.5 chars/token, otherwise 4.0). Exit 0 on success, 1 on missing file.
scripts/rag_evaluator.pyMeasures retrieval and grounding quality from two JSON files (formats printed in --help).
python3 scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --k 10 --json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --compare baseline_report.json
Reports context relevance, precision@k, coverage, answer faithfulness, groundedness. Treat relevance < 0.80 as a retrieval problem (chunking/embedding/filtering), not a prompt problem — fix retrieval before rewriting the generation prompt.
scripts/agent_orchestrator.pyValidates agent configs (YAML/JSON): tool wiring, missing required config, loop risk, token estimates.
python3 scripts/agent_orchestrator.py agent.yaml --validate
python3 scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python3 scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100 \
--input-price-per-mtok 3.00 --output-price-per-mtok 15.00
Without the two price flags, --estimate-cost reports token estimates only. The model: field in the config is informational — any model name is accepted.
python3 scripts/prompt_optimizer.py current_prompt.txt --analyze --json --output baseline.json
| Symptom | Fix |
|---|---|
| Malformed/unparseable output | Native structured outputs / JSON schema if the API supports it; explicit schema-in-prompt otherwise |
| Inconsistent answers across runs | Tighten instructions + add 2–3 contrastive examples (one near-miss showing what NOT to do) |
| Misses edge cases | Enumerate the edge cases explicitly; add a "when uncertain, do X" rule |
| Token bloat on repeated calls | Move stable prefix (system rules, examples) first so prompt caching applies; trim redundancy |
| Wrong reasoning on hard cases | Ask for stepwise reasoning in a scratch field the consumer ignores, or use the provider's extended-thinking mode |
python3 scripts/prompt_optimizer.py revised.txt --analyze --compare baseline.json
eval_results.json, then assert:
python3 scripts/prompt_optimizer.py revised.txt --analyze --json --output revised.json \
&& python3 -c "
import json, sys
r = json.load(open('revised.json')); b = json.load(open('baseline.json'))
ok = r['clarity_score'] >= b['clarity_score'] and r['token_count'] <= b['token_count'] * 1.10
sys.exit(0 if ok else 1)"
echo "gate exit=$?" # 0 = ship; 1 = regression, iterate again
Pair this structural gate with your task-level eval: the revision must not lose any previously-passing eval case (no-regression rule).python3 scripts/prompt_optimizer.py prompt_with_examples.txt --extract-examples --output examples.json and inspect that every extracted pair parses against your schema.python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]" at minimum); 10/10 must parse, else return to step 2.questions.json (id, question, reference answer) and capture current retrievals to contexts.json.python3 scripts/rag_evaluator.py --contexts contexts.json --questions questions.json --output rag_baseline.json
python3 scripts/rag_evaluator.py --contexts new_contexts.json --questions questions.json --compare rag_baseline.json — every metric must be ≥ baseline; any regression blocks the change.python3 scripts/agent_orchestrator.py agent.yaml --validate — must exit with VALIDATION PASSED; fix every error and warning (missing tool config, unbounded iterations, loop risk).--estimate-cost --runs N with your current prices; if cost/run exceeds budget, cut tools or context before downgrading the model.| File | Contains | Load when user asks about |
|---|---|---|
references/prompt_engineering_patterns.md |
10 prompt patterns with input/output examples | "which pattern?", few-shot design, decomposition, meta-prompting |
references/llm_evaluation_frameworks.md |
Eval metrics, scoring methods, A/B testing | "how to evaluate?", "measure quality", "compare prompts" |
references/agentic_system_design.md |
Agent architectures (ReAct, Plan-Execute, Tool Use) | "build agent", "tool calling", "multi-agent" |
engineering-team/skills/senior-ml-engineer — model deployment and serving (this skill stops at the prompt/eval layer)engineering/rag-architect — RAG system architecture (this skill measures RAG quality; that one designs the pipeline)engineering/agent-designer — full agent system design (this skill validates configs; that one designs the architecture)