Skip to main content
New: Forge AI docs + Loop PM assistant. 7-day free trial.
TemplateFREE⏱️ 35 min

AI Data Requirements Template

A template for documenting AI training and evaluation data requirements including sources, quality standards, labeling guidelines, governance policies, and pipeline specifications.

By Tim Adair• Last updated 2026-03-04
AI Data Requirements Template preview

AI Data Requirements Template

Free AI Data Requirements Template — open and start using immediately

or use email

Instant access. No spam.

What This Template Is For

Data is the foundation of every AI product. Yet most AI projects fail not because the model is wrong, but because the data is incomplete, mislabeled, biased, or stale. A 2024 Google Research study found that data quality issues account for more AI project failures than model architecture decisions.

This template helps product managers document data requirements before model development begins. It covers data sourcing, quality standards, labeling workflows, governance, and pipeline specifications. Completing this document forces the critical conversations about data availability, quality, and compliance that otherwise surface too late in development.

The AI PM Handbook covers data strategy for AI products in depth. For understanding how data quality affects model outputs, see hallucination rate as a key metric to track. Use the AI Readiness Assessment to evaluate whether your organization's data infrastructure is ready for AI.

How to Use This Template

  1. Start with the Data Inventory to catalog what data you have, where it lives, and what condition it is in. Most teams overestimate their data readiness.
  1. Define quality standards before any data collection or labeling begins. Without explicit standards, labeling quality varies by annotator and the resulting model learns noise.
  1. Design the labeling workflow with clear guidelines, examples, and inter-annotator agreement targets. Poor labeling guidelines are the single most common source of data quality problems.
  1. Document governance requirements with your legal and compliance team. Data privacy regulations like GDPR and CCPA apply to training data, not just production data.
  1. Specify the pipeline architecture so engineering knows exactly how data flows from source to model. Include refresh cadence, validation checks, and failure handling.

The Template

Data Inventory

  • List all potential data sources (internal databases, APIs, public datasets, licensed data)
  • Document the format, size, and update frequency of each source
  • Assess data quality for each source (completeness, accuracy, freshness)
  • Identify gaps between available data and required data
  • Estimate effort to close each gap (collection, licensing, generation)
## Data Inventory

### Available Data Sources
| Source | Type | Format | Volume | Quality | Sensitivity | Status |
|--------|------|--------|--------|---------|-------------|--------|
| [Source 1] | [Internal DB / API / File] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs access] |
| [Source 2] | [Internal DB / API / File] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs access] |
| [Source 3] | [Public dataset / Licensed] | [JSON/CSV/Text] | [N records] | [High/Med/Low] | [PII/PHI/Public] | [Available / Needs licensing] |

### Data Gaps
| Required Data | Current Gap | Closure Strategy | Effort Estimate |
|---------------|------------|-----------------|-----------------|
| [Data type 1] | [What is missing] | [Collect / License / Generate / Augment] | [Days/Weeks] |
| [Data type 2] | [What is missing] | [Collect / License / Generate / Augment] | [Days/Weeks] |

Quality Standards

  • Define minimum acceptable quality for each data field
  • Set completeness thresholds (what % of records must have each field)
  • Define freshness requirements (maximum age of data at inference time)
  • Establish deduplication rules
  • Define outlier and anomaly handling policies
## Data Quality Standards

### Field-Level Requirements
| Field | Required | Completeness Target | Validation Rule | Handling if Invalid |
|-------|----------|--------------------|-----------------|--------------------|
| [Field 1] | Yes/No | [e.g., > 95%] | [e.g., Non-empty string, < 500 chars] | [Reject / Impute / Flag] |
| [Field 2] | Yes/No | [e.g., > 90%] | [e.g., Valid date, within last 2 years] | [Reject / Impute / Flag] |

### Dataset-Level Requirements
- **Minimum dataset size**: [N records for training, M for evaluation]
- **Class balance**: [Target distribution across categories]
- **Freshness**: [Data must be less than X months old]
- **Deduplication**: [Near-duplicate threshold and dedup method]
- **Diversity**: [Requirements for demographic, geographic, or topical diversity]

Labeling Workflow

  • Write labeling guidelines with definitions, examples, and edge case decisions
  • Define the label taxonomy (categories, tags, scores, or spans)
  • Choose labeling approach (in-house, outsourced, or automated pre-labeling)
  • Set inter-annotator agreement target (e.g., Cohen's kappa > 0.8)
  • Design quality control process (double-labeling, spot checks, consensus resolution)
## Labeling Specification

### Label Taxonomy
| Label | Definition | Example | Edge Case Guidance |
|-------|-----------|---------|-------------------|
| [Label A] | [Clear definition] | [Concrete example] | [How to handle ambiguous cases] |
| [Label B] | [Clear definition] | [Concrete example] | [How to handle ambiguous cases] |

### Labeling Process
- **Method**: [In-house team / Outsourced / Automated pre-labeling + human review]
- **Annotators**: [Number and qualifications required]
- **Inter-annotator agreement target**: [Kappa > X or % agreement > Y]
- **Quality control**: [Double-label X% of data, review disagreements weekly]
- **Estimated throughput**: [N labels per annotator per hour]
- **Total labeling effort**: [N records x M hours = total cost estimate]

Data Governance

  • Identify PII and sensitive data fields. Plan anonymization or removal
  • Verify licensing terms for all third-party data sources
  • Define data retention and deletion policies
  • Establish data access controls (who can access what, and how)
  • Document compliance requirements (GDPR, CCPA, HIPAA, industry-specific)
  • Create a data lineage record (source to model input, traceable)
## Data Governance

### Privacy and Compliance
| Regulation | Applies? | Compliance Action | Owner |
|-----------|----------|-------------------|-------|
| GDPR | Yes/No | [PII removal / Consent collection / DPA signed] | [Name] |
| CCPA | Yes/No | [Data inventory / Opt-out mechanism] | [Name] |
| HIPAA | Yes/No | [PHI de-identification / BAA signed] | [Name] |
| [Industry-specific] | Yes/No | [Specific action required] | [Name] |

### Data Access Controls
| Role | Access Level | Justification |
|------|-------------|---------------|
| ML Engineer | Full training data access | Model development |
| PM | Aggregated metrics only | Product decisions |
| Analyst | Anonymized sample | Quality analysis |

### Retention Policy
- **Training data**: [Retain for X months after model retirement]
- **Evaluation data**: [Retain for X months after evaluation]
- **User data used for training**: [Right to deletion within X days of request]

Pipeline Specification

  • Document the data flow from source to model input
  • Define transformation and preprocessing steps
  • Specify storage requirements (vector DB, feature store, cache)
  • Set refresh cadence and triggering mechanism
  • Define monitoring and alerting for pipeline health
  • Document failure handling and recovery procedures

Filled Example

Product: AI-powered job matching platform that recommends candidates to hiring managers.

Data Inventory Summary:

  • Source 1: Internal applicant database (2.3M profiles, structured, high quality, contains PII)
  • Source 2: Job posting corpus (180K active postings, semi-structured, medium quality)
  • Source 3: Hiring outcome data (420K decisions, structured, high quality, 18 months of history)
  • Gap: Industry skill taxonomy. Strategy: License O*NET data and map to internal categories. Effort: 2 weeks.

Labeling Spec: 10,000 candidate-job pairs labeled as Strong Match / Moderate Match / Weak Match / No Match. Two in-house recruiters label each pair. Inter-annotator agreement target: kappa > 0.75. Estimated throughput: 40 pairs per hour per annotator. Total effort: 250 annotator-hours.

Governance: GDPR applies (EU candidates). All PII anonymized before entering the training pipeline. Candidate consent collected during application. Data retention: 24 months after last interaction. Right-to-deletion honored within 30 days.

Frequently Asked Questions

How much training data do I need?+
It depends on the task complexity and model type. Fine-tuning an LLM for classification might need 500-2,000 labeled examples. Training a custom ML model from scratch might need 50,000+. Start with the minimum viable dataset, evaluate model performance, and add more data where the model struggles. Quality matters more than quantity.
Should I use synthetic data to fill gaps in my dataset?+
Synthetic data can supplement real data but should not replace it entirely. Use synthetic data for edge cases and minority classes where real examples are scarce. Always validate that synthetic data does not introduce [biases or hallucination patterns](/glossary/hallucination) that do not exist in real data. Label synthetic examples separately so you can measure their impact.
Who owns the data requirements document?+
The PM owns the document. Data engineers own the pipeline specification sections. ML engineers own the quality standards and labeling sections. Legal owns the governance sections. The PM's job is to ensure all sections are complete and consistent, not to write every section alone.
How do I handle data that changes over time?+
Define a refresh cadence for each data source based on how quickly the underlying data changes. Customer behavior data might need daily refreshes. Industry benchmarks might be quarterly. Build monitoring that alerts when data distribution shifts significantly from the training distribution. This is called [data drift](/glossary/feature-flag) and it degrades model performance silently.
What if my data sources have conflicting information?+
Define a priority order for data sources and document conflict resolution rules. For example: verified user-submitted data overrides inferred data, which overrides default values. Log conflicts for analysis. If conflict rates exceed 5%, investigate the root cause rather than relying on resolution rules.

Explore More Templates

Browse our full library of AI-enhanced product management templates

Free PDF

Like This Template?

Subscribe to get new templates, frameworks, and PM strategies delivered to your inbox.

or use email

Instant PDF download. One email per week after that.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro →