What This Template Is For
Batch processing handles the work that does not need to happen in real time: nightly report generation, data warehouse loads, bulk email sends, monthly billing runs, and large-scale data transformations. Unlike request/response APIs that process one item at a time, batch jobs process thousands or millions of records in a single execution window.
When batch processing is poorly specified, teams discover problems at 3 AM: a job ran out of memory halfway through, duplicate records appeared because a retry re-processed already-completed items, or the billing run took six hours instead of two and collided with the morning traffic spike.
This template defines the full batch processing specification: what each job does, how data is partitioned, what happens on failure, how progress is tracked, and what SLAs the job must meet. Use it when building a new data pipeline, migrating from ad-hoc scripts to a proper scheduling system, or documenting existing batch jobs that live in someone's head. For context on how data processing fits into product analytics strategy, see the Product Analytics Handbook. The CI/CD Pipeline Template covers the deployment process for batch job code.
How to Use This Template
- Inventory all recurring data processing tasks, including informal scripts run manually. Each one is a batch job candidate.
- For each job, define inputs (data source, volume, format), processing logic, and outputs (destination, format, downstream consumers).
- Set the execution window: when the job runs, how long it is allowed to take, and what happens if it exceeds that window.
- Design the partitioning strategy. Large datasets should be split into chunks that can be processed independently and potentially in parallel.
- Define the error handling and recovery strategy: checkpoint frequency, idempotency guarantees, and alerting thresholds.
- Review with the data and infrastructure teams, then implement job by job starting with the most critical.
The Template
Batch System Overview
| Field | Details |
|---|---|
| System Name | [e.g., Analytics Pipeline, Billing Processor] |
| Owner | [Team or individual] |
| Scheduler | [Airflow / Prefect / Temporal / Step Functions / cron] |
| Infrastructure | [EC2 / ECS / Lambda / Kubernetes Jobs / Spark] |
| Data Store | [S3 / Redshift / BigQuery / PostgreSQL] |
| Date | [Date] |
Job Inventory
| Job Name | Schedule | Input Source | Output Destination | Priority | SLA |
|---|---|---|---|---|---|
| [Job A] | [Daily 02:00 UTC] | [S3 bucket / DB table] | [Data warehouse] | [P1 Critical] | [Complete by 06:00 UTC] |
| [Job B] | [Hourly] | [Kafka topic / API] | [Search index] | [P2 High] | [< 30 min latency] |
| [Job C] | [Monthly, 1st at 00:00] | [Billing DB] | [Payment processor + ledger] | [P1 Critical] | [Complete by 04:00 UTC] |
Job Specification: [Job Name]
Job metadata.
| Setting | Value |
|---|---|
| Job ID | [Unique identifier] |
| Schedule | [Cron expression or trigger event] |
| Input | [Source, format, expected volume] |
| Output | [Destination, format, consumers] |
| Owner | [Team or individual] |
| Max Duration | [e.g., 4 hours] |
| Concurrency | [Single instance / Max N parallel] |
Processing steps.
| Step | Description | Input | Output | Idempotent? |
|---|---|---|---|---|
| 1. Extract | [Read from source] | [Source table/file] | [Raw staging area] | [Yes/No] |
| 2. Validate | [Schema validation, dedup, null checks] | [Raw data] | [Validated data + error log] | [Yes] |
| 3. Transform | [Business logic, aggregations, joins] | [Validated data] | [Transformed data] | [Yes] |
| 4. Load | [Write to destination] | [Transformed data] | [Target table/index] | [Yes, upsert] |
| 5. Notify | [Send completion event] | [Job metadata] | [Downstream trigger] | [Yes] |
Data Partitioning
| Setting | Value |
|---|---|
| Partition Key | [e.g., date, customer_id range, region] |
| Partition Size | [e.g., 100K records per partition] |
| Parallelism | [e.g., 10 concurrent workers] |
| Ordering | [Required within partition / No ordering requirement] |
Partitioning rules.
- ☐ Each partition is independently processable (no cross-partition dependencies)
- ☐ Partition boundaries are deterministic (same input always produces same partitions)
- ☐ Failed partitions can be retried without re-processing successful partitions
- ☐ Partition state is tracked (pending, in-progress, completed, failed)
Error Handling and Recovery
Failure modes.
| Failure Type | Detection | Response | Recovery |
|---|---|---|---|
| Source unavailable | Connection timeout | Retry 3x with backoff | Alert after 3 failures, manual retry |
| Invalid record | Schema validation | Log to error table, skip record | Reprocess error table after fix |
| Transform error | Exception in business logic | Log error, skip record | Fix logic, reprocess partition |
| Destination write failure | Write timeout / conflict | Retry partition 3x | Alert, manual investigation |
| Job timeout | Duration exceeds max | Kill job | Resume from last checkpoint |
| Infrastructure failure | Worker crash / OOM | Scheduler detects, reschedules | Restart from last checkpoint |
Checkpointing.
| Setting | Value |
|---|---|
| Checkpoint Frequency | [Every N records / Every partition / Every step] |
| Checkpoint Storage | [S3 / Database table / State store] |
| Resume Strategy | [From last checkpoint / From failed partition] |
- ☐ Checkpoints stored durably (survives worker restart)
- ☐ Checkpoint includes partition ID, offset, and step
- ☐ Job can resume from any checkpoint without data loss or duplication
- ☐ Stale checkpoints cleaned up after [N] days
Idempotency
| Level | Strategy |
|---|---|
| Record level | [Upsert by primary key / Dedup by hash] |
| Partition level | [Skip completed partitions on retry] |
| Job level | [Full re-run produces identical output] |
- ☐ Re-running the same job with the same input produces the same output
- ☐ No duplicate records created on retry
- ☐ Downstream consumers handle duplicate delivery events
Resource Management
| Resource | Limit | Scaling |
|---|---|---|
| Memory per worker | [e.g., 4 GB] | [Vertical: increase per worker] |
| CPU per worker | [e.g., 2 vCPU] | [Horizontal: add workers] |
| Disk | [e.g., 100 GB temp] | [Clean up after each job] |
| Database connections | [e.g., Max 10 from job] | [Connection pool] |
| API rate limits | [e.g., 100 req/s to downstream] | [Throttle worker concurrency] |
Monitoring and Alerting
| Metric | Normal Range | Alert Threshold | Channel |
|---|---|---|---|
| Job duration | [30-60 min] | > [90 min] | [PagerDuty] |
| Records processed | [1M-1.5M/run] | < [500K] (data missing?) | [Slack] |
| Error rate | [< 0.1%] | > [1%] | [PagerDuty] |
| Checkpoint lag | [< 5 min] | > [15 min] | [Slack] |
| Queue depth | [< 100 partitions] | > [500] | [Slack] |
- ☐ Job start/completion events logged
- ☐ Per-partition progress visible in dashboard
- ☐ Failed record details queryable (not just counts)
- ☐ Historical job duration trends tracked for capacity planning
- ☐ Downstream data freshness monitored (detect silent failures)
SLA Definition
| SLA | Target | Measurement |
|---|---|---|
| Freshness | Data available within [N] hours of source event | Timestamp comparison |
| Completeness | > [99.9%] of source records processed | Source count vs. destination count |
| Accuracy | [0%] duplicate records in destination | Dedup check post-load |
| Availability | Job succeeds [99%] of scheduled runs | Success rate over 30 days |
Filled Example: Nightly Analytics Pipeline
System Overview
| Field | Details |
|---|---|
| System Name | Nightly Analytics ETL |
| Owner | Data Platform Team |
| Scheduler | Apache Airflow (managed, MWAA) |
| Infrastructure | ECS Fargate tasks (spot) |
| Data Store | S3 (staging) + Snowflake (warehouse) |
Job: Daily Event Aggregation
| Setting | Value |
|---|---|
| Schedule | Daily at 02:00 UTC |
| Input | S3: s3://events/raw/dt={yesterday}/ (~50M events, ~12 GB Parquet) |
| Output | Snowflake: analytics.daily_aggregates |
| Max Duration | 3 hours (must complete by 05:00 UTC) |
| Concurrency | 20 parallel ECS tasks |
Processing steps.
| Step | Duration | Details |
|---|---|---|
| 1. Partition | 5 min | Split 50M events into 200 partitions by user_id % 200 |
| 2. Validate | 15 min | Schema check, drop events with null event_type (~0.01%) |
| 3. Aggregate | 45 min | Group by user, event_type, hour. Compute counts, uniques, durations |
| 4. Load | 20 min | COPY INTO Snowflake via S3 staging. Merge on (date, user_id, event_type) |
| 5. Verify | 5 min | Compare event counts: source vs. warehouse (must match within 0.1%) |
Total duration: ~90 minutes typical, 3-hour max.
Error handling. Failed partitions retry individually up to 3 times. After 3 failures, the partition is logged to analytics.etl_errors with the error message and raw data location. The aggregation step is idempotent: reprocessing a partition overwrites (not appends) its rows in the staging table. The final MERGE into Snowflake is keyed on the date + user + event_type composite, so re-runs do not create duplicates.
Key Takeaways
- Design every job step to be idempotent so retries and re-runs are safe
- Partition large datasets into independently processable chunks with tracked state
- Checkpoint progress frequently enough that a failure loses minutes of work, not hours
- Set per-partition timeouts and retries, not just job-level limits
- Monitor data completeness (source count vs. destination count), not just job success/failure
About This Template
Created by: Tim Adair
Last Updated: 3/5/2026
Version: 1.0.0
License: Free for personal and commercial use
