Skip to content

LLM Answer Watcher

Evaluation Framework

nibzard/llm-answer-watcher

Evaluation Framework¶

Quality control and accuracy testing for brand extraction.

Purpose¶

The evaluation framework validates:

Mention detection accuracy
Rank extraction correctness
False positive/negative rates

Running Evaluations¶

llm-answer-watcher eval --fixtures evals/testcases/fixtures.yaml

Metrics Tracked¶

Mention Precision: Correct mentions / total found
Mention Recall: Correct mentions / expected mentions
Rank Accuracy: Correctly ranked brands
F1 Score: Harmonic mean of precision/recall

See Running Evals for detailed usage.