Quick Answer (TL;DR)
Evaluating LLM quality requires five complementary approaches: Reference-based metrics (BLEU, ROUGE -- comparing outputs to known-good answers), Reference-free scoring (perplexity, coherence, fluency -- measuring output quality without ground truth), LLM-as-judge (using a stronger model to evaluate a weaker one), Human evaluation (expert or crowd assessment against rubrics), and Golden datasets (curated test sets representing your product's real-world inputs). No single approach is sufficient. Product managers should combine automated metrics for speed with human evaluation for depth, all anchored to a golden dataset that represents their specific use case.
What Is the LLM Evaluation Framework?
Language model evaluation is the process of systematically measuring how well an LLM performs for your specific product use case. It is a critical skill for product managers building with LLMs, because without reliable evaluation, you can't make informed decisions about model selection, prompt engineering, fine-tuning, or feature quality.
The challenge is that LLM outputs are open-ended text. Unlike traditional software where you can write assert output == expected, evaluating whether a generated paragraph is "good" requires nuanced judgment. Is the summary accurate? Is the response helpful? Is the generated code correct? Is the chatbot's tone appropriate? Each of these questions requires a different evaluation approach.
This framework emerged because early LLM product teams repeatedly made the same mistake: they eyeballed a few model outputs, decided it "looked good," and shipped to production. Then user complaints revealed systematic quality issues that anecdotal testing never caught: hallucinations in specific domains, inappropriate tone with certain user types, degraded performance on long inputs, or inconsistent formatting.
The framework provides PMs with a structured evaluation strategy that catches these issues before users do. It's designed to be practical -- you don't need a PhD in NLP to implement it, but you do need to invest deliberate effort in building evaluation infrastructure.
The Framework in Detail
Approach 1: Reference-Based Metrics
Reference-based metrics compare the model's output to a known-correct reference answer. They work well when there's a "right answer" you can define in advance.
Key Metrics:
BLEU (Bilingual Evaluation Understudy)
Originally designed for machine translation, BLEU measures the overlap of n-grams (word sequences) between the generated text and the reference text. Scores range from 0 to 1, with higher being better.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures the overlap between generated and reference text, with variants focusing on different granularities:
Exact Match (EM)
Binary: does the output exactly match the reference?
F1 Score (token-level)
Measures the overlap of individual tokens between generated and reference text, balancing precision (are generated tokens correct?) and recall (are reference tokens present?).
PM Guidance for Reference-Based Metrics:
Reference-based metrics are your first line of defense -- fast, automated, and reproducible. But they have a fundamental limitation: they can only evaluate outputs where you know the correct answer. For creative generation, conversation, and open-ended tasks, you need other approaches.
Set up a continuous evaluation pipeline that runs reference-based metrics on every model change (new prompt, new model version, new fine-tuning run). Track trends over time. A sudden drop in BLEU or ROUGE is a strong signal that something has changed.
Approach 2: Reference-Free Scoring
Reference-free metrics assess output quality without comparing to a ground truth. They're useful for tasks where there's no single correct answer.
Key Metrics:
Perplexity
Measures how "surprised" the model is by its own output. Lower perplexity indicates more fluent, natural-sounding text.
Coherence Scoring
Automated assessment of whether the output is logically structured and internally consistent. Typically measured by embedding-based similarity between sentences or paragraphs.
Toxicity and Safety Scoring
Automated classifiers (Perspective API, OpenAI moderation endpoint, custom classifiers) that score outputs for harmful content.
Factual Consistency (NLI-based)
Uses natural language inference models to check whether generated claims are supported by a source document. If the model is supposed to summarize or answer based on a document, NLI checks whether the output contradicts the source.
PM Guidance for Reference-Free Metrics:
Reference-free metrics are useful as automated guardrails. Set up toxicity scoring as a hard gate -- no output above a threshold reaches users. Use coherence and perplexity as regression detection signals. But never rely on reference-free metrics alone to assess quality; they measure surface properties, not whether the output actually helps the user.
Approach 3: LLM-as-Judge
LLM-as-judge uses a separate language model (typically a more capable one) to evaluate the outputs of your production model. The judging model scores outputs against criteria you define.
How It Works:
Example Judging Prompt:
You are evaluating the quality of a customer support response.
User question: {user_question}
AI response: {model_output}
Rate the response on the following criteria (1-5 scale):
1. Accuracy: Is the information in the response factually correct?
2. Completeness: Does the response fully address the user's question?
3. Tone: Is the tone professional and empathetic?
4. Conciseness: Is the response appropriately concise without omitting important details?
5. Actionability: Does the response give the user clear next steps?
For each criterion, provide the score and a brief justification.
When LLM-as-Judge Works Well:
When LLM-as-Judge Falls Short:
PM Guidance for LLM-as-Judge:
LLM-as-judge is the most practical evaluation approach for most LLM products. Invest time in crafting a clear rubric and judging prompt -- the quality of the rubric determines the quality of the evaluation. Calibrate the judge against human evaluations: run 100-200 examples through both human raters and the judge model, then measure agreement. If correlation is above 0.8, the judge is usable for automated evaluation. Recalibrate regularly.
Approach 4: Human Evaluation
Human evaluation is the gold standard for LLM quality assessment. It captures nuances that no automated metric can fully measure.
Types of Human Evaluation:
Expert Evaluation
Domain experts (lawyers for legal AI, doctors for medical AI, engineers for code AI) assess outputs for correctness, completeness, and appropriateness.
Crowdsourced Evaluation
General-purpose evaluators (via platforms like Scale AI, Surge, or Amazon Mechanical Turk) rate outputs against defined rubrics.
Internal Team Evaluation
Product team members evaluate outputs during development.
Evaluation Protocol Design:
To get reliable human evaluations, you need a rigorous protocol:
Approach 5: Golden Datasets
A golden dataset is a curated collection of input-output pairs that represents your product's real-world usage patterns. It's the foundation that makes all other evaluation approaches meaningful.
Building a Golden Dataset:
Step 1: Define coverage requirements.
Your golden dataset should cover:
Step 2: Source examples from real usage.
The best golden dataset examples come from actual user interactions. During beta testing or initial deployment, log user inputs (with consent) and curate the most representative ones. Supplement with synthetic examples for edge cases you haven't observed yet.
Step 3: Create reference outputs.
For each input, create one or more reference outputs that represent "good" quality. For tasks with objective answers (QA, classification), this is straightforward. For open-ended tasks, create examples that represent the minimum acceptable quality level.
Step 4: Version and maintain the dataset.
Your golden dataset is a living artifact. As the product evolves, user patterns change, and new failure modes are discovered, update the dataset. Version it like code -- every change should be tracked and documented.
Golden Dataset Size Guidelines:
| Product Stage | Recommended Size | Coverage Priority |
|---|---|---|
| Prototype | 50-100 examples | Core use cases only |
| Beta | 100-200 examples | Core use cases + major edge cases |
| Production | 200-500 examples | Full coverage including adversarial inputs |
| Mature product | 500-1,000+ examples | Comprehensive coverage with segment-specific subsets |
When to Use This Framework
| Scenario | Primary Evaluation Approaches |
|---|---|
| Selecting between LLM providers for a new feature | Golden dataset + reference-based metrics + LLM-as-judge |
| Evaluating prompt engineering changes | Golden dataset + LLM-as-judge + A/B testing |
| Assessing a fine-tuned model vs. base model | Golden dataset + all five approaches |
| Monitoring production quality over time | Automated metrics (reference-based + reference-free) + periodic human evaluation |
| Setting launch acceptance criteria | Human evaluation (expert) to establish the bar; LLM-as-judge for ongoing enforcement |
When NOT to Use It
Real-World Example
Scenario: A customer success platform is building an AI feature that drafts email responses to support tickets. The PM needs to evaluate whether the generated drafts are good enough for agents to use.
Golden Dataset: The team curates 300 support tickets from their historical data, spanning 8 product areas, 4 urgency levels, and tickets from both individual users and enterprise accounts. For each ticket, a senior support agent writes a reference response.
Reference-Based Metrics: ROUGE-L scores against reference responses give a baseline. Average ROUGE-L is 0.42 -- moderate overlap, which is expected since there are many valid ways to respond to a support ticket.
LLM-as-Judge: A GPT-4 class model evaluates each draft on five criteria: accuracy (does the response contain correct product information?), tone (professional and empathetic?), completeness (addresses all parts of the customer's question?), actionability (provides clear next steps?), and conciseness (appropriate length?). Average scores: accuracy 4.2/5, tone 4.5/5, completeness 3.8/5, actionability 4.0/5, conciseness 4.3/5. The completeness score reveals that the model often misses secondary questions in multi-part tickets.
Human Evaluation: 10 senior support agents each evaluate 30 drafted responses (blind, randomized). They rate each response as "send as-is," "edit lightly and send," or "rewrite from scratch." Results: 35% send as-is, 48% edit lightly, 17% rewrite. The PM sets a launch target of < 15% rewrite rate and > 40% send as-is rate.
Action: The team improves the prompt to explicitly address multi-part questions, retests, and achieves completeness score 4.3/5 and a human evaluation of 42% send as-is, 45% edit lightly, 13% rewrite -- meeting the launch criteria.
Common Pitfalls
LLM Evaluation vs. Other Quality Approaches
| Approach | Focus | Best Used For |
|---|---|---|
| This framework (5 approaches) | Comprehensive LLM output quality | Any LLM-powered product feature |
| Traditional QA testing | Software correctness (inputs -> expected outputs) | Deterministic product features |
| A/B testing | User behavior comparison between variants | Measuring the impact of LLM features on user metrics |
| Red teaming | Adversarial testing for safety and misuse | Safety-critical LLM applications |
| Benchmarking (MMLU, HumanEval, etc.) | General model capability assessment | Model selection before product integration |
The LLM Evaluation Framework is complementary to all of these approaches. Use benchmarking for initial model selection, this framework for product-specific quality assessment, A/B testing for user impact measurement, and red teaming for safety validation. Together, they give you a complete picture of whether your LLM feature is ready for production.