How to Validate and Measure LLM Accuracy in Production
Shipping an LLM feature without measurement is shipping a bug generator. Production validation combines automated metrics, human review, and business KPIs.
Levels of Evaluation
- Unit-level - Schema validation, regex checks, refusal detection
- Golden set - Curated Q&A pairs scored automatically
- Online - User thumbs, task completion, support escalations
- Human - Expert rubrics on sampled traffic
Metrics That Matter
- Faithfulness - Answer grounded in retrieved context (RAG)
- Relevance - Addresses the user question
- Toxicity / PII - Safety filters
- Latency and cost - p95 tokens and dollars per session
Implementation Sketch
def validate_response(answer: str, context: str) -> dict:
return {
"has_citation": "[source:" in answer,
"length_ok": 50 < len(answer) < 4000,
"grounded": entailment_score(context, answer) > 0.7,
}
Log scores to your observability stack (Datadog, LangSmith, Phoenix).
Human-in-the-Loop
Sample 1–5% of production traffic for review. Rubrics beat binary thumbs. Feed failures back into prompts, RAG chunks, or fine-tuning data.
Conclusion
Accuracy is task-specific-define it before you optimize. A support bot optimizes resolution rate; a codegen tool optimizes test pass rate. Align metrics with product outcomes.