How to Validate and Measure LLM Accuracy in Production

Tue, 18 Feb 2025 00:00:00 +0000

Shipping an LLM feature without measurement is shipping a bug generator. Production validation combines automated metrics, human review, and business KPIs.

Levels of Evaluation

Unit-level - Schema validation, regex checks, refusal detection
Golden set - Curated Q&A pairs scored automatically
Online - User thumbs, task completion, support escalations
Human - Expert rubrics on sampled traffic

Metrics That Matter

Faithfulness - Answer grounded in retrieved context (RAG)
Relevance - Addresses the user question
Toxicity / PII - Safety filters
Latency and cost - p95 tokens and dollars per session

Implementation Sketch

def validate_response(answer: str, context: str) -> dict:
    return {
        "has_citation": "[source:" in answer,
        "length_ok": 50 < len(answer) < 4000,
        "grounded": entailment_score(context, answer) > 0.7,
    }

Log scores to your observability stack (Datadog, LangSmith, Phoenix).

Mlops on David Lang

How to Validate and Measure LLM Accuracy in Production

Levels of Evaluation

Metrics That Matter

Implementation Sketch