What This Template Is For
Every SaaS product will eventually face a data loss event. A database corruption, a bad migration, a ransomware attack, or even a well-intentioned engineer running a DELETE without a WHERE clause. The question is not whether it will happen, but whether you can recover. And how fast.
This template provides a structured approach to planning your backup and disaster recovery (DR) procedures. It covers four areas: defining recovery objectives (RTO and RPO), designing backup schedules and storage, writing recovery runbooks, and testing your ability to actually restore from backups.
Most teams have backups. Few teams have tested their backups. An untested backup is not a backup. It is a hope. This template forces you to define what "recovered" actually means, document step-by-step restoration procedures, and schedule regular recovery drills.
For PMs who need to understand infrastructure reliability decisions, the Technical PM Handbook covers how these trade-offs affect product roadmaps. If your product has SLA commitments, the SLA glossary entry explains how RTO and RPO map to customer expectations. For incident handling when recovery is triggered, see the incident response template.
How to Use This Template
- Classify your data. Not all data has the same recovery priority. Separate critical data (user accounts, billing, core product data) from non-critical data (analytics, logs, caches).
- Set RTO and RPO targets. Define how fast you need to recover (RTO) and how much data loss you can tolerate (RPO) for each data class.
- Design backup schedules. Map each data class to a backup frequency and storage strategy that meets your RPO.
- Write recovery runbooks. Document step-by-step procedures for every recovery scenario.
- Schedule and run recovery drills. Test your runbooks quarterly. Fix anything that fails.
The Template
Part 1: Data Classification
Classify every data store in your product by criticality and recovery priority.
| Data Store | Data Type | Criticality | Example Contents | Acceptable Downtime | Acceptable Data Loss |
|---|---|---|---|---|---|
| Primary database | User accounts, product data | Critical | User profiles, projects, documents | < 1 hour | < 5 minutes |
| Billing database | Subscriptions, invoices | Critical | Stripe sync, payment history | < 1 hour | < 1 minute |
| File storage | User uploads | High | Documents, images, attachments | < 4 hours | < 1 hour |
| Search index | Derived data | Medium | Elasticsearch index, search cache | < 8 hours | Rebuildable from primary |
| Analytics database | Event data | Medium | Product analytics, usage metrics | < 24 hours | < 24 hours |
| Cache layer | Derived/temporary | Low | Redis sessions, computed values | < 1 hour | Rebuildable (no backup needed) |
| Audit logs | Compliance records | High | User actions, admin changes | < 4 hours | < 5 minutes |
Data classification checklist:
- ☐ Every data store in production is listed (no "shadow" databases or local file stores)
- ☐ Criticality is based on business impact, not technical convenience
- ☐ "Rebuildable" data stores are verified: confirm the rebuild procedure exists and is documented
- ☐ Compliance requirements (GDPR, SOC 2, HIPAA) are reflected in retention and recovery targets
- ☐ Third-party managed data (Stripe, Auth0, Salesforce) has its own backup/export strategy
Part 2: Recovery Objectives
Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for each criticality tier.
| Tier | RTO Target | RPO Target | Max Data Loss | Recovery Method | Cost Tier |
|---|---|---|---|---|---|
| Critical | < 1 hour | < 5 minutes | 5 min of transactions | Automated failover + point-in-time recovery | $$$ |
| High | < 4 hours | < 1 hour | 1 hour of uploads/logs | Restore from hourly snapshots | $$ |
| Medium | < 24 hours | < 24 hours | 1 day of derived data | Restore from daily backups or rebuild | $ |
| Low | < 48 hours | N/A | Rebuildable, no backup needed | Reprovision and rebuild | $ |
RTO/RPO decision factors:
- ☐ RTO targets account for: detection time + decision time + restoration time + validation time
- ☐ RPO targets are achievable with the selected backup frequency (5-min RPO requires continuous replication, not hourly snapshots)
- ☐ Cost of achieving each target is documented and approved by engineering leadership
- ☐ Targets are aligned with customer SLA commitments (your RTO cannot exceed your SLA uptime guarantee)
- ☐ Targets are realistic: they have been validated through at least one recovery drill
Part 3: Backup Schedule and Storage
| Data Store | Backup Type | Frequency | Retention | Storage Location | Encryption | Tested |
|---|---|---|---|---|---|---|
| Primary DB | Continuous WAL replication | Continuous | 30 days PITR | Cross-region replica | AES-256 at rest, TLS in transit | Yes |
| Primary DB | Full snapshot | Daily at 02:00 UTC | 90 days | S3 cross-region, Glacier after 30 days | AES-256 SSE-KMS | Yes |
| Billing DB | Continuous WAL replication | Continuous | 30 days PITR | Cross-region replica | AES-256 at rest, TLS in transit | Yes |
| File storage | Versioned storage | On every write | 90 days version history | S3 cross-region replication | AES-256 SSE-S3 | Yes |
| Audit logs | Append-only replication | Continuous | 1 year | Separate AWS account | AES-256 SSE-KMS | Yes |
| Analytics DB | Full snapshot | Daily at 04:00 UTC | 30 days | Same region S3 | AES-256 SSE-S3 | No |
| Search index | Rebuild from primary | N/A (derived) | N/A | N/A | N/A | Yes |
Backup storage rules:
- ☐ Critical backups are stored in a different region than production (survive regional outage)
- ☐ At least one backup copy is in a different cloud account (survive account compromise)
- ☐ All backups are encrypted at rest with keys managed separately from the data
- ☐ Backup access requires separate credentials from production access
- ☐ Backup retention meets compliance requirements (SOC 2: 90 days minimum, GDPR: aligned with data retention policy)
- ☐ Old backups are automatically purged per retention schedule (no indefinite retention of PII)
- ☐ Backup monitoring alerts if a scheduled backup fails or is delayed
Part 4: Recovery Runbooks
Document a step-by-step procedure for each recovery scenario. Each runbook should be executable by any on-call engineer, not just the person who wrote it.
Scenario A: Primary Database Corruption
Trigger: Data integrity check fails, or application reports unexpected data states.
Recovery procedure:
- Assess scope. Determine which tables/rows are affected and the time range of corruption.
- Stop writes. Enable read-only mode on the application to prevent further corruption.
- Identify the safe restore point. Find the latest WAL position before the corruption event using database logs.
- Perform point-in-time recovery. Restore to the identified position using PITR on the cross-region replica.
- Validate restored data. Run integrity checks on the restored database. Compare row counts and checksums against the last known-good backup.
- Switch traffic. Update the application to point to the restored database. Disable read-only mode.
- Communicate. Post status update per the incident response template.
- Postmortem. Document root cause, timeline, and prevention measures within 48 hours.
Estimated recovery time: 30-60 minutes.
Scenario B: Complete Regional Outage
Trigger: Cloud provider reports a regional service disruption affecting compute, database, or networking.
Recovery procedure:
- Confirm the outage. Verify via cloud provider status page and independent monitoring.
- Activate DR region. Switch DNS to the DR region using pre-configured failover records (TTL: 60 seconds).
- Promote read replica. Promote the cross-region database replica to primary.
- Scale DR compute. Auto-scaling should handle this, but verify instance counts match production baseline.
- Validate functionality. Run smoke tests against the DR deployment.
- Communicate. Update status page and notify customers.
- Plan failback. Once the primary region recovers, plan a controlled migration back during a maintenance window.
Estimated recovery time: 15-30 minutes (automated failover) to 2-4 hours (manual).
Scenario C: Accidental Data Deletion
Trigger: Engineer or customer accidentally deletes critical data (project, account, etc.).
Recovery procedure:
- Identify the deletion. Use audit logs to find the exact time, user, and scope of deletion.
- Check soft-delete. If soft-delete is implemented, recover via undelete API.
- If hard-deleted: Restore the affected table/rows from PITR to a temporary database.
- Extract and merge. Export the deleted rows from the temporary database and insert them into production.
- Validate. Confirm the restored data is complete and consistent with related tables.
- Notify the user. Confirm recovery and provide the restored data.
Estimated recovery time: 15 minutes (soft-delete) to 2 hours (PITR extraction).
Scenario D: Ransomware or Malicious Access
Trigger: Security team detects unauthorized encryption of data or ransom demand.
Recovery procedure:
- Isolate affected systems. Revoke all access keys, rotate credentials, disable network access.
- Assess scope. Determine which systems and data stores are compromised.
- Verify backup integrity. Confirm that backups in the separate account are uncompromised.
- Reprovision infrastructure. Build new infrastructure from infrastructure-as-code. Do not attempt to clean compromised systems.
- Restore data. Restore from the latest verified-clean backup.
- Harden access. Implement fixes for the attack vector before restoring service.
- Engage legal and compliance. Determine notification obligations (see GDPR compliance template for breach notification requirements).
Estimated recovery time: 4-24 hours depending on scope.
Part 5: Recovery Testing Schedule
| Test Type | Frequency | Scope | Owner | Last Tested | Result |
|---|---|---|---|---|---|
| Backup integrity check | Weekly (automated) | Verify backup completeness and decrypt ability | Infrastructure team | ___ | ___ |
| Single-table restore | Monthly | Restore one table from PITR to staging | On-call engineer | ___ | ___ |
| Full database restore | Quarterly | Restore complete database to DR environment | Infrastructure team | ___ | ___ |
| Regional failover drill | Semi-annually | Full traffic switch to DR region | Engineering leadership | ___ | ___ |
| Ransomware simulation | Annually | Full reprovision + restore from isolated backups | Security + Infrastructure | ___ | ___ |
Testing checklist:
- ☐ Every recovery runbook has been executed at least once in a non-production environment
- ☐ Recovery times from drills are within RTO targets (if not, update targets or improve procedures)
- ☐ Drill results are documented with actual times, issues found, and fixes applied
- ☐ Drills are announced in advance to prevent confusion with real incidents
- ☐ Failed drills generate action items with owners and deadlines
Filled Example: DataVault Analytics Platform
Product: DataVault, a SaaS analytics platform processing 2B events/day for 500 enterprise customers.
Recovery objectives (implemented):
| Data Store | RTO | RPO | Method |
|---|---|---|---|
| PostgreSQL (accounts, configs) | 30 min | 5 min | Continuous WAL streaming to us-west-2 replica |
| ClickHouse (analytics data) | 4 hours | 1 hour | Hourly S3 snapshots + incremental backups |
| S3 (raw event data) | 2 hours | 0 (versioned) | Cross-region replication to eu-west-1 |
| Redis (cache, sessions) | 15 min | N/A | No backup. Rebuilt from PostgreSQL on failover |
Last recovery drill (January 2026):
- Simulated PostgreSQL corruption in staging environment
- Restored from PITR in 22 minutes (within 30-min RTO)
- Found issue: restore script assumed a specific PostgreSQL version. Updated to auto-detect version.
- Simulated full regional failover to us-west-2 in 18 minutes (DNS propagation was the bottleneck)
Key lesson learned: Their ClickHouse backup restore took 6 hours on the first drill, exceeding the 4-hour RTO. Root cause: the restore process was downloading 2TB from S3 sequentially. They parallelized the restore across 8 nodes and reduced it to 90 minutes.
Common Mistakes
- Backing up but never testing restores. A backup you have never restored from is just a file. It might be corrupted, encrypted with a lost key, or missing critical tables. Test quarterly.
- Same-region backups only. If your backups are in the same region as your production systems, a regional outage takes out both. Cross-region replication is not optional for Critical-tier data.
- Backing up the database but not the schema. You restored the data, but the migrations are out of sync and the application cannot connect. Back up schema definitions, migration history, and application configuration alongside data.
- No backup monitoring. A nightly backup job failed silently for 3 weeks. When you need it, your newest backup is 3 weeks old. Monitor backup jobs and alert on failure.
- Treating all data as equally critical. Not everything needs 5-minute RPO. Over-engineering backup for analytics data wastes money. Under-engineering it for billing data risks compliance violations. Classify first, then design.
The vulnerability management template covers how to track and remediate the security gaps that could lead to the ransomware and breach scenarios described above.
