Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.
Prepare JSON test cases and run:
python3 scripts/prompt_tester.py \
--prompt-a-file prompts/a.txt \
--prompt-b-file prompts/b.txt \
--cases-file testcases.json \
--runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
--format text
Input can also come from stdin/--input JSON payload.
The tester scores outputs per case and aggregates:
Use the higher-scoring prompt as candidate baseline, then run regression suite.
# Add version
python3 scripts/prompt_versioner.py add \
--name support_classifier \
--prompt-file prompts/support_v3.txt \
--author alice
# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3
# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier
python3 scripts/prompt_tester.py --help
--input
python3 scripts/prompt_versioner.py --help
add, list, diff, changelog)Avoid these mistakes:
must_not_contain (forbidden-content) checks in evaluation criteria.Before promoting any prompt, confirm:
Each test case should define:
input: realistic production-like inputexpected_contains: required markers/contentforbidden_contains: disallowed phrases or unsafe contentexpected_regex: required structural patternsThis enables deterministic grading across prompt variants.
support_classifier, ad_copy_shortform).