Skip to main content
TemplateFREE⏱️ 35 min

Data Labeling Specification Template

A data labeling specification template with labeling guidelines, quality criteria, inter-annotator agreement targets, edge case handling, and QA review...

Updated 2026-03-04
Data Labeling Specification
#1
#2
#3
#4
#5

Edit the values above to try it with your own data. Your changes are saved locally.

Get this template

Choose your preferred format. Google Sheets and Notion are free, no account needed.

Frequently Asked Questions

How many annotators should label each sample?+
For classification tasks, two annotators per sample is the standard balance between quality and cost. Use three annotators for high-stakes labeling where errors directly impact user safety. For tasks where inter-annotator agreement consistently exceeds 90%, you can reduce to one annotator with QA sampling. The [AI PM Handbook](/ai-guide) discusses labeling strategies across different model types.
What inter-annotator agreement score is good enough?+
For text classification, target Cohen's kappa > 0.8 (strong agreement). For more subjective tasks like sentiment scoring or relevance ranking, kappa > 0.6 (moderate agreement) may be acceptable. If kappa is below 0.6, the guidelines are ambiguous or the task is too subjective for reliable labeling. Revise the taxonomy and guidelines before continuing.
How do I handle annotator disagreements?+
Use a tiered approach: majority vote for 3+ annotators, senior annotator adjudication for 2-annotator setups, and ML engineer review for systematic disagreement patterns. Document every adjudication decision as an edge case in the specification. These decisions become training examples for future annotators.
Should I use in-house or external annotators?+
In-house annotators produce higher quality for domain-specific tasks (medical, legal, financial). External annotators (Scale AI, Labelbox, Surge AI) are better for high-volume, lower-complexity tasks. Hybrid approaches work well: use in-house experts to build the gold standard and guidelines, then scale with external annotators. The [AI Build vs. Buy](/tools) assessment helps evaluate this tradeoff.
How often should labeling guidelines be updated?+
Update guidelines whenever inter-annotator agreement drops below your target, when you discover new edge cases, or when the model fails on a pattern that traces back to inconsistent labels. Version-control the guidelines and track which samples were labeled under which version. Re-label affected samples if a guideline change invalidates previous annotations. ---

Explore More Templates

Browse our full library of PM templates, or generate a custom version with AI.