What This Template Is For
The database is the performance bottleneck in most web applications. A single missing index can turn a 5ms query into a 3-second full table scan. An N+1 query pattern hidden inside an ORM can generate 500 database round trips where 1 would suffice. A connection pool sized too small starves the application of database access during traffic spikes.
These problems compound. As data grows, unoptimized queries degrade logarithmically or worse. A query that ran in 20ms with 100K rows runs in 800ms with 10M rows if it is doing a sequential scan instead of using an index. By the time users complain, the fix requires understanding query plans, index selectivity, and connection pool behavior under load.
This template provides a structured approach to database optimization. It covers slow query identification, indexing strategy, connection pool tuning, and capacity projections. The goal is to move database performance from reactive firefighting to proactive management.
Use this alongside the API Performance Template to connect database improvements to user-facing latency. The Monitoring and Alerting Template covers alerting on database-specific metrics. For schema-level changes, the Database Migration Template provides a safe execution plan. The Technical PM Handbook explains how PMs should evaluate the product impact of database performance investments. See also the glossary entry on technical debt for context on why unoptimized databases accumulate cost over time.
When to Use This Template
- User-facing latency has increased and the root cause traces to database queries
- A specific query or set of queries consistently appear in slow query logs
- You are approaching a data milestone (10M rows, 100M rows) and need to verify query performance scales
- Cloud database costs are increasing and you suspect query inefficiency is a factor
- A new feature will add significant read or write load to existing tables
- Connection pool errors (timeouts, exhaustion) are occurring during peak traffic
How to Use This Template
- Start with the Slow Query Audit. Pull the top offenders from your slow query log and analyze each one.
- Build the Indexing Strategy. For each slow query, determine whether an index would help and what columns to index.
- Tune the Connection Pool. Review current settings against actual usage patterns.
- Estimate Capacity. Project when your current database configuration will no longer support traffic growth.
- Prioritize and schedule. Not every optimization is equally impactful. Rank by user-facing latency improvement.
- Measure after each change. Compare before/after for every optimization. If the numbers do not improve, revert.
The Template
Database Overview
| Field | Details |
|---|---|
| Engine | [Postgres 16 / MySQL 8 / etc.] |
| Instance Type | [e.g., db.r6g.xlarge / 4 vCPU, 32GB RAM] |
| Storage | [X GB used / Y GB provisioned, GP3/IO1] |
| Read Replicas | [X replicas, regions] |
| Largest Tables | [List top 5 by row count] |
| Total Connections | [Max configured / Avg in use / Peak in use] |
| Slow Query Threshold | [X ms] |
| Backup Schedule | [Daily automated, X day retention] |
Slow Query Audit
| Query | Avg Duration | Calls/Day | Total Time/Day | Table(s) | Scan Type | Index Used? |
|---|---|---|---|---|---|---|
| [Query description or fingerprint] | [X ms] | [X] | [X min] | [tables] | [Seq Scan / Index Scan / Bitmap] | [Yes: which / No] |
How to find slow queries.
- Postgres:
pg_stat_statementsextension (tracks all queries with timing), orlog_min_duration_statementin config. - MySQL:
slow_query_logwithlong_query_timethreshold. - Application APM: Datadog, New Relic, or similar will attribute latency to specific queries.
Query Analysis
For each slow query, document the following:
Query #1: [Description]
-- Original query
[Paste the full SQL]
EXPLAIN ANALYZE output.
[Paste the EXPLAIN ANALYZE output]
Analysis.
- Scan type. [Sequential scan / Index scan / Index-only scan / Bitmap heap scan]
- Rows estimated vs actual. [Estimated X, Actual Y. If >10x off, statistics may be stale.]
- Bottleneck. [Where is time being spent? Sort? Filter? Join? Disk I/O?]
- Proposed fix. [Add index / Rewrite query / Add caching / Partition table]
- Estimated improvement. [X ms to Y ms]
Indexing Strategy
| Table | Proposed Index | Columns | Type | Queries Benefiting | Est. Size | Write Impact |
|---|---|---|---|---|---|---|
| [table] | [idx_name] | [col1, col2] | [B-tree / GIN / GiST / Hash] | [List queries] | [X MB] | [X% write slowdown] |
Index design guidelines.
- Composite indexes. Order columns by selectivity (most selective first) and by query filter order. An index on
(user_id, created_at)serves queries filtering onuser_idalone, but not queries filtering oncreated_atalone. - Partial indexes. If a query always filters on a condition (e.g.,
WHERE deleted_at IS NULL), a partial index is smaller and faster. - Covering indexes (INCLUDE). Add non-indexed columns with
INCLUDEto enable index-only scans (Postgres 11+). Avoids heap lookups. - Index size. Indexes consume disk and memory. A 500MB index on a table with 100M rows is normal. A 5GB index suggests over-indexing.
- Write impact. Every index slows down INSERT, UPDATE, and DELETE operations. Add indexes selectively on write-heavy tables.
Connection Pool Configuration
| Setting | Current | Recommended | Rationale |
|---|---|---|---|
| Pool size (per app instance) | [X] | [X] | [Rule: pool_size = (num_cores * 2) + effective_spindle_count] |
| Max overflow | [X] | [X] | [Burst connections beyond pool size] |
| Connection timeout | [X seconds] | [X seconds] | [How long a request waits for a connection] |
| Idle timeout | [X seconds] | [X seconds] | [How long an idle connection stays open] |
| Max lifetime | [X minutes] | [X minutes] | [Force recycle to handle failover] |
| Validation query | [None / SELECT 1] | [SELECT 1] | [Detect stale connections after failover] |
Total connections check. [App instances] x [Pool size + Max overflow] = [Total max connections]. This must be less than the database's max_connections setting, with headroom for admin connections and monitoring.
Capacity Projections
| Metric | Current | 3 Months | 6 Months | 12 Months | Action Threshold |
|---|---|---|---|---|---|
| Largest table row count | [X M] | [X M] | [X M] | [X M] | [Table partitioning at X M rows] |
| Total database size | [X GB] | [X GB] | [X GB] | [X GB] | [Storage upgrade at X GB] |
| Peak connections | [X] | [X] | [X] | [X] | [Connection pooler at X connections] |
| Avg query latency | [X ms] | [X ms] | [X ms] | [X ms] | [Optimization sprint at X ms] |
| IOPS utilization | [X%] | [X%] | [X%] | [X%] | [Instance upgrade at 80%] |
Optimization Backlog
| Query/Issue | Impact (latency saved) | Effort | Priority | Status |
|---|---|---|---|---|
| [Description] | [p95 -Xms, Y queries/day] | [X days] | [P0/P1/P2] | [Planned / In Progress / Done] |
Filled Example: Postgres-Backed SaaS Application
Database Overview
| Field | Details |
|---|---|
| Engine | Postgres 16.1 |
| Instance | AWS RDS db.r6g.xlarge (4 vCPU, 32GB RAM) |
| Storage | 180 GB used / 500 GB GP3, 3000 IOPS |
| Read Replicas | 1 replica (us-east-1b), used for analytics queries |
| Largest Tables | events (85M), orders (12M), users (800K), line_items (45M), audit_log (120M) |
| Connections | Max 200 / Avg 45 / Peak 120 |
| Slow Query Threshold | 100ms |
Slow Query Audit
| Query | Avg | Calls/Day | Total/Day | Table | Scan | Index? |
|---|---|---|---|---|---|---|
| Dashboard event count by user | 840ms | 12,000 | 168 min | events | Seq Scan | No |
| Order search by date range + status | 320ms | 8,500 | 45 min | orders, line_items | Bitmap Heap | Partial (orders only) |
| User lookup by email | 2ms | 95,000 | 3.2 min | users | Index Scan | Yes (no action) |
| Audit log recent by entity | 1,200ms | 3,000 | 60 min | audit_log | Seq Scan | No |
Query Analysis: Dashboard Event Count
SELECT date_trunc('day', created_at) as day, COUNT(*)
FROM events
WHERE user_id = $1 AND created_at > NOW() - INTERVAL '30 days'
GROUP BY day ORDER BY day;
EXPLAIN ANALYZE. Sequential scan on events (85M rows). Estimated 85M rows, actual 85M. Filter removes 99.97% of rows. Sort on day.
Bottleneck. No index on (user_id, created_at). The planner scans the entire 85M row table to find ~25K matching rows.
Fix. Add composite index CREATE INDEX CONCURRENTLY idx_events_user_created ON events (user_id, created_at).
Estimated improvement. 840ms to 8ms (index scan on ~25K rows instead of seq scan on 85M).
Indexing Strategy
| Table | Index | Columns | Type | Queries | Est. Size | Write Impact |
|---|---|---|---|---|---|---|
| events | idx_events_user_created | (user_id, created_at) | B-tree | Dashboard event count | 1.8 GB | +3% on event inserts |
| orders | idx_orders_status_created | (status, created_at) INCLUDE (total, user_id) | B-tree | Order search | 280 MB | +1% on order writes |
| audit_log | idx_audit_entity | (entity_type, entity_id, created_at DESC) | B-tree | Audit log recent | 2.4 GB | +2% on audit inserts |
Connection Pool (Current Issues)
The application runs 8 instances with pool_size=25 each. 8 x 25 = 200, which equals max_connections. No headroom. During deploys (rolling restart), new instances open connections before old ones close, causing connection exhaustion errors.
Fix. Reduce pool_size to 20 per instance (160 total). Add PgBouncer as a connection pooler in front of RDS with max 180 server connections and 1,000 client connections.
Key Takeaways
- Fix the top 3 slow queries first. In most applications, 3-5 queries account for 80% of total database time. Use
pg_stat_statements(Postgres) or equivalent to find them. - Add indexes based on EXPLAIN ANALYZE, not guesswork. An index that the query planner does not use wastes disk and slows writes. Always verify with EXPLAIN ANALYZE after adding an index.
- Size your connection pool deliberately. Too small and requests queue for connections. Too large and the database spends time context-switching between connections. The formula
(cores * 2) + spindlesis a reasonable starting point. - Project capacity before you hit limits. If your largest table doubles every 6 months, you need to plan for partitioning or archiving before query performance degrades.
- Measure before and after every optimization. Without before/after data, you cannot verify that the change helped and you cannot prioritize future optimizations.
