What This Template Is For
Data is the foundation of every AI product. Yet most AI projects fail not because the model is wrong, but because the data is incomplete, mislabeled, biased, or stale. A 2024 Google Research study found that data quality issues account for more AI project failures than model architecture decisions.
This template helps product managers document data requirements before model development begins. It covers data sourcing, quality standards, labeling workflows, governance, and pipeline specifications. Completing this document forces the critical conversations about data availability, quality, and compliance that otherwise surface too late in development.
The AI PM Handbook covers data strategy for AI products in depth. For understanding how data quality affects model outputs, see hallucination rate as a key metric to track. Use the AI Readiness Assessment to evaluate whether your organization's data infrastructure is ready for AI.
How to Use This Template
- Start with the Data Inventory to catalog what data you have, where it lives, and what condition it is in. Most teams overestimate their data readiness.
- Define quality standards before any data collection or labeling begins. Without explicit standards, labeling quality varies by annotator and the resulting model learns noise.
- Design the labeling workflow with clear guidelines, examples, and inter-annotator agreement targets. Poor labeling guidelines are the single most common source of data quality problems.
- Document governance requirements with your legal and compliance team. Data privacy regulations like GDPR and CCPA apply to training data, not just production data.
- Specify the pipeline architecture so engineering knows exactly how data flows from source to model. Include refresh cadence, validation checks, and failure handling.
The Template
Data Inventory
- ☐ List all potential data sources (internal databases, APIs, public datasets, licensed data)
- ☐ Document the format, size, and update frequency of each source
- ☐ Assess data quality for each source (completeness, accuracy, freshness)
- ☐ Identify gaps between available data and required data
- ☐ Estimate effort to close each gap (collection, licensing, generation)
## Data Inventory
### Available Data Sources
| Source | Type | Format | Volume | Quality | Sensitivity | Status |
|--------|------|--------|--------|---------|-------------|--------|
| [Source 1] | [Internal DB / API / File] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs access] |
| [Source 2] | [Internal DB / API / File] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs access] |
| [Source 3] | [Public dataset / Licensed] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs licensing] |
### Data Gaps
| Required Data | Current Gap | Closure Strategy | Effort Estimate |
|---------------|------------|-----------------|-----------------|
| [Data type 1] | [What is missing] | [Collect / License / Generate / Augment] | [Days/Weeks] |
| [Data type 2] | [What is missing] | [Collect / License / Generate / Augment] | [Days/Weeks] |
Quality Standards
- ☐ Define minimum acceptable quality for each data field
- ☐ Set completeness thresholds (what % of records must have each field)
- ☐ Define freshness requirements (maximum age of data at inference time)
- ☐ Establish deduplication rules
- ☐ Define outlier and anomaly handling policies
## Data Quality Standards
### Field-Level Requirements
| Field | Required | Completeness Target | Validation Rule | Handling if Invalid |
|-------|----------|--------------------|-----------------|--------------------|
| [Field 1] | Yes/No | [e.g., > 95%] | [e.g., Non-empty string, < 500 chars] | [Reject / Impute / Flag] |
| [Field 2] | Yes/No | [e.g., > 90%] | [e.g., Valid date, within last 2 years] | [Reject / Impute / Flag] |
### Dataset-Level Requirements
- **Minimum dataset size**: [N records for training, M for evaluation]
- **Class balance**: [Target distribution across categories]
- **Freshness**: [Data must be less than X months old]
- **Deduplication**: [Near-duplicate threshold and dedup method]
- **Diversity**: [Requirements for demographic, geographic, or topical diversity]
Labeling Workflow
- ☐ Write labeling guidelines with definitions, examples, and edge case decisions
- ☐ Define the label taxonomy (categories, tags, scores, or spans)
- ☐ Choose labeling approach (in-house, outsourced, or automated pre-labeling)
- ☐ Set inter-annotator agreement target (e.g., Cohen's kappa > 0.8)
- ☐ Design quality control process (double-labeling, spot checks, consensus resolution)
## Labeling Specification
### Label Taxonomy
| Label | Definition | Example | Edge Case Guidance |
|-------|-----------|---------|-------------------|
| [Label A] | [Clear definition] | [Concrete example] | [How to handle ambiguous cases] |
| [Label B] | [Clear definition] | [Concrete example] | [How to handle ambiguous cases] |
### Labeling Process
- **Method**: [In-house team / Outsourced / Automated pre-labeling + human review]
- **Annotators**: [Number and qualifications required]
- **Inter-annotator agreement target**: [Kappa > X or % agreement > Y]
- **Quality control**: [Double-label X% of data, review disagreements weekly]
- **Estimated throughput**: [N labels per annotator per hour]
- **Total labeling effort**: [N records x M hours = total cost estimate]
Data Governance
- ☐ Identify PII and sensitive data fields. Plan anonymization or removal
- ☐ Verify licensing terms for all third-party data sources
- ☐ Define data retention and deletion policies
- ☐ Establish data access controls (who can access what, and how)
- ☐ Document compliance requirements (GDPR, CCPA, HIPAA, industry-specific)
- ☐ Create a data lineage record (source to model input, traceable)
## Data Governance
### Privacy and Compliance
| Regulation | Applies? | Compliance Action | Owner |
|-----------|----------|-------------------|-------|
| GDPR | Yes/No | [PII removal / Consent collection / DPA signed] | [Name] |
| CCPA | Yes/No | [Data inventory / Opt-out mechanism] | [Name] |
| HIPAA | Yes/No | [PHI de-identification / BAA signed] | [Name] |
| [Industry-specific] | Yes/No | [Specific action required] | [Name] |
### Data Access Controls
| Role | Access Level | Justification |
|------|-------------|---------------|
| ML Engineer | Full training data access | Model development |
| PM | Aggregated metrics only | Product decisions |
| Analyst | Anonymized sample | Quality analysis |
### Retention Policy
- **Training data**: [Retain for X months after model retirement]
- **Evaluation data**: [Retain for X months after evaluation]
- **User data used for training**: [Right to deletion within X days of request]
Pipeline Specification
- ☐ Document the data flow from source to model input
- ☐ Define transformation and preprocessing steps
- ☐ Specify storage requirements (vector DB, feature store, cache)
- ☐ Set refresh cadence and triggering mechanism
- ☐ Define monitoring and alerting for pipeline health
- ☐ Document failure handling and recovery procedures
Filled Example
Product: AI-powered job matching platform that recommends candidates to hiring managers.
Data Inventory Summary:
- Source 1: Internal applicant database (2.3M profiles, structured, high quality, contains PII)
- Source 2: Job posting corpus (180K active postings, semi-structured, medium quality)
- Source 3: Hiring outcome data (420K decisions, structured, high quality, 18 months of history)
- Gap: Industry skill taxonomy. Strategy: License O*NET data and map to internal categories. Effort: 2 weeks.
Labeling Spec: 10,000 candidate-job pairs labeled as Strong Match / Moderate Match / Weak Match / No Match. Two in-house recruiters label each pair. Inter-annotator agreement target: kappa > 0.75. Estimated throughput: 40 pairs per hour per annotator. Total effort: 250 annotator-hours.
Governance: GDPR applies (EU candidates). All PII anonymized before entering the training pipeline. Candidate consent collected during application. Data retention: 24 months after last interaction. Right-to-deletion honored within 30 days.
