What This Template Is For
Most AI products exist on a spectrum between fully autonomous and fully human-controlled. Finding the right balance is a critical product design decision. Too much automation and you risk errors at scale. Too much human review and you lose the efficiency gains that justified building AI in the first place. Human-in-the-loop (HITL) design is the discipline of deciding when AI acts alone, when it asks for help, and how it learns from human corrections.
This template helps product teams design the human oversight layer for AI features. It covers escalation triggers, review queue design, feedback loop architecture, quality thresholds for automation levels, and the path toward increasing autonomy over time. The AI PM Handbook covers human-AI interaction design in depth. For glossary-level definitions, see the human-AI interaction glossary entry and the AI alignment glossary entry. If you need to assess the overall safety posture of your AI feature, the AI Ethics Scanner evaluates ethical risks that inform HITL design decisions. The explainability glossary entry covers how to make AI decisions transparent to human reviewers.
When to Use This Template
- You are building an AI feature where errors have real consequences (financial, legal, safety)
- Regulations require human oversight for AI-driven decisions
- Your AI model is not yet reliable enough for full autonomy
- You are designing the review workflow for content moderation, customer support, or medical AI
- You want a structured plan for gradually increasing AI autonomy as quality improves
How to Use This Template
- Define the automation levels for your feature from fully manual to fully autonomous
- Set the confidence thresholds that determine when AI acts alone vs escalates to humans
- Design the review queue with clear SLAs and reviewer roles
- Build feedback loops so human corrections improve the model over time
- Document the graduation criteria for moving from one automation level to the next
- Monitor human reviewer capacity and quality alongside AI performance
The Template
# Human-in-the-Loop Design Specification
**Feature**: [Name]
**Owner**: [Name and role]
**Date**: [Date]
**Current Automation Level**: [1-5, see below]
**Target Automation Level**: [1-5, with timeline]
---
## 1. Automation Levels
| Level | Name | Description | AI Role | Human Role |
|-------|------|-------------|---------|-----------|
| 1 | Manual | No AI involvement | None | Does everything |
| 2 | AI-Assisted | AI suggests, human decides | Generates suggestions | Reviews and selects |
| 3 | AI-First | AI acts, human reviews before effect | Drafts output | Approves or edits before publishing |
| 4 | AI-Autonomous with Audit | AI acts immediately, human samples after | Executes fully | Reviews samples, handles escalations |
| 5 | Fully Autonomous | AI handles everything, humans handle exceptions | Executes and monitors | Investigates anomalies only |
**Current level**: [1-5]
**Target level**: [1-5]
**Timeline to target**: [Months/Quarters]
---
## 2. Escalation Triggers
### Confidence-Based Escalation
| Confidence Range | Action | Rationale |
|-----------------|--------|-----------|
| > [X]% | AI acts autonomously | High confidence, low error rate |
| [Y]% - [X]% | AI suggests, human confirms | Moderate confidence, review recommended |
| < [Y]% | Escalate to human fully | Low confidence, human judgment needed |
### Rule-Based Escalation (Always Escalate When)
- [ ] Output involves financial amounts above $[X]
- [ ] Output affects more than [N] users simultaneously
- [ ] Request involves sensitive topics: [list categories]
- [ ] User explicitly requests human review
- [ ] Output contradicts a previous output for the same user
- [ ] AI detects potential bias or fairness concern
- [ ] First-time task type the model has not seen before
### Volume-Based Escalation
- [ ] If AI error rate exceeds [X]% over [time window], escalate all tasks
- [ ] If human reviewer queue exceeds [N] items, alert capacity manager
- [ ] If AI throughput drops below [N] tasks/hour, investigate system health
---
## 3. Review Queue Design
### Queue Architecture
- **Queue platform**: [Internal tool / Retool / Custom / Third-party]
- **Queue capacity**: [N tasks per reviewer per hour]
- **SLA**: [Time from escalation to human review]
- **Priority levels**: [How tasks are ordered in the queue]
### Reviewer Roles
| Role | Responsibilities | Required Skills | Capacity |
|------|-----------------|----------------|----------|
| [L1 Reviewer] | [Handle routine escalations] | [Domain knowledge] | [N tasks/hour] |
| [L2 Reviewer] | [Handle complex/ambiguous cases] | [Deep expertise] | [N tasks/hour] |
| [L3 Expert] | [Final authority on disputed cases] | [Senior domain expert] | [On-demand] |
### Review Interface Requirements
- [ ] Show AI output alongside source data for context
- [ ] Show AI confidence score and reasoning (if available)
- [ ] Allow reviewer to approve, edit, or reject
- [ ] Capture reviewer's correction with structured reason codes
- [ ] Show previous AI outputs for the same user/context
- [ ] Timer for SLA tracking
### Queue Metrics
| Metric | Target | Alert Threshold |
|--------|--------|----------------|
| Average review time | < [X] minutes | > [X] minutes |
| Queue depth | < [N] items | > [N] items |
| SLA compliance | > [X]% within SLA | < [X]% |
| Reviewer agreement rate | > [X]% | < [X]% |
| Reviewer throughput | [N] tasks/hour | < [N] tasks/hour |
---
## 4. Feedback Loop Architecture
### Correction Capture
When a reviewer corrects an AI output, capture:
- [ ] Original AI output
- [ ] Reviewer's corrected output
- [ ] Reason code (from standardized list)
- [ ] Free-text explanation (optional)
- [ ] Confidence score the AI assigned
- [ ] Time spent on review
### Reason Code Taxonomy
| Code | Description | Example |
|------|-------------|---------|
| WRONG_FACT | AI stated incorrect information | Wrong product name or feature |
| WRONG_TONE | AI used inappropriate tone | Too casual for formal context |
| WRONG_FORMAT | AI output format was incorrect | Missing required fields |
| INCOMPLETE | AI missed relevant information | Left out key details |
| OFF_TOPIC | AI addressed wrong topic | Answered a different question |
| UNSAFE | AI output contained harmful content | Inappropriate recommendation |
| CORRECT | AI output was correct (confirm) | Approved with no changes |
### Feedback-to-Training Pipeline
1. Corrections are stored in [correction database]
2. Every [cadence], corrections are reviewed for training data quality
3. High-quality corrections are added to the fine-tuning dataset
4. Model is retrained with updated data every [cadence]
5. New model is evaluated against the same test set
6. If improved, deploy new model and monitor
### Feedback Metrics
| Metric | What It Tells You |
|--------|------------------|
| Correction rate | How often humans change AI outputs (should decrease over time) |
| Correction type distribution | Where the model is weakest (guides training data focus) |
| Time between correction and model improvement | How fast the feedback loop closes |
| Post-retraining correction rate | Whether retraining actually helped |
---
## 5. Capacity Planning
### Current State
| Metric | Value |
|--------|-------|
| Total AI tasks per day | [N] |
| Escalation rate | [X]% |
| Tasks requiring human review | [N per day] |
| Available reviewers | [N people, Y hours/day each] |
| Current capacity utilization | [X]% |
### Scaling Scenarios
| Scenario | AI Tasks/Day | Escalation Rate | Human Reviews/Day | Reviewers Needed |
|----------|-------------|-----------------|-------------------|-----------------|
| Current | [N] | [X]% | [N] | [N] |
| 2x volume | [N] | [X]% | [N] | [N] |
| 5x volume | [N] | [X]% | [N] | [N] |
| Target automation level | [N] | [X]% | [N] | [N] |
### Capacity Gap Plan
- If volume exceeds reviewer capacity: [Raise confidence threshold / Hire / Queue management]
- If reviewer quality drops: [Reduce queue load / Add L2 support / Retrain reviewers]
---
## 6. Graduation Criteria
### Requirements to Increase Automation Level
| From Level | To Level | Required Metrics | Required Duration |
|-----------|----------|-----------------|------------------|
| 2 → 3 | [AI accuracy > X%, false negative rate < Y%] | [N weeks at target] |
| 3 → 4 | [AI accuracy > X%, zero critical errors in N weeks] | [N weeks at target] |
| 4 → 5 | [AI accuracy > X%, human audit finds < Y% errors] | [N months at target] |
### Regression Triggers (Decrease Automation Level)
- [ ] Error rate exceeds [X]% for [duration]
- [ ] Critical error detected (any single instance)
- [ ] Model update introduces regression
- [ ] Regulation change requires additional oversight
- [ ] User trust score drops below [threshold]
### Graduation Review Process
1. ML team presents evaluation data
2. Product team reviews user impact metrics
3. Legal/compliance confirms regulatory requirements are met
4. Governance board approves level change
5. Rollout plan with rollback trigger documented
Filled Example
## 2. Escalation Triggers (Partial)
### Confidence-Based Escalation
| Confidence Range | Action | Rationale |
|-----------------|--------|-----------|
| > 92% | AI sends reply directly | Historical error rate < 2% at this confidence |
| 75% - 92% | AI drafts reply, agent reviews | Error rate 5-12% in this range, review is fast |
| < 75% | Route to human agent fully | High error rate, AI draft would slow agent down |
### Rule-Based Escalation (Always Escalate When)
- [x] Customer mentions "cancel", "lawyer", or "lawsuit"
- [x] Refund amount exceeds $200
- [x] Customer has been escalated in the past 30 days
- [x] AI output references a product feature not in the knowledge base
- [x] Customer is flagged as enterprise tier
## 6. Graduation Criteria (Partial)
| From Level | To Level | Required Metrics | Required Duration |
|-----------|----------|-----------------|------------------|
| 2 → 3 | Accuracy > 90%, false negative < 3%, zero safety violations | 4 weeks at target |
| 3 → 4 | Accuracy > 95%, human review finds < 2% issues | 8 weeks at target |
| 4 → 5 | Accuracy > 98%, audit of 500 samples finds < 0.5% issues | 12 weeks at target |
Key Takeaways
- Default to more human oversight at launch, then earn autonomy with data
- Confidence thresholds should be calibrated on your actual data, not assumed
- Capture structured feedback from every human correction to close the feedback loop
- Plan reviewer capacity for 2x and 5x volume before you need it
- Define clear graduation criteria so the transition to higher autonomy is data-driven, not political
- Build the review queue interface before launch, not after the first incident
