How to Validate and Measure LLM Accuracy in Production

Shipping an LLM feature without measurement is shipping a bug generator. Production validation combines automated metrics, human review, and business KPIs.

Levels of Evaluation

  1. Unit-level - Schema validation, regex checks, refusal detection
  2. Golden set - Curated Q&A pairs scored automatically
  3. Online - User thumbs, task completion, support escalations
  4. Human - Expert rubrics on sampled traffic

Metrics That Matter

  • Faithfulness - Answer grounded in retrieved context (RAG)
  • Relevance - Addresses the user question
  • Toxicity / PII - Safety filters
  • Latency and cost - p95 tokens and dollars per session

Implementation Sketch

def validate_response(answer: str, context: str) -> dict:
    return {
        "has_citation": "[source:" in answer,
        "length_ok": 50 < len(answer) < 4000,
        "grounded": entailment_score(context, answer) > 0.7,
    }

Log scores to your observability stack (Datadog, LangSmith, Phoenix).

Human-in-the-Loop

Sample 1–5% of production traffic for review. Rubrics beat binary thumbs. Feed failures back into prompts, RAG chunks, or fine-tuning data.

Conclusion

Accuracy is task-specific-define it before you optimize. A support bot optimizes resolution rate; a codegen tool optimizes test pass rate. Align metrics with product outcomes.