2026-03-12 · CalcBee Team · 8 min read
Database Sizing Guide: How to Estimate Storage Before You Build
Underestimating database storage leads to emergency migrations, unplanned downtime, and panicked 2 AM pages. Overestimating it wastes money on provisioned IOPS and reserved storage you never use. Accurate database sizing before you write the first migration lets you choose the right instance class, provision appropriate storage, and set realistic budgets from day one.
This guide covers the formulas and methodology for estimating database storage across relational and NoSQL databases, from individual row sizes to multi-year growth projections.
Estimating Row Size in Relational Databases
Every storage estimate starts at the row level. The size of a single row multiplied by the expected row count gives you the raw data footprint.
Fixed-width columns have predictable sizes:
| Data Type | Storage Size | Notes |
|---|---|---|
| BOOLEAN | 1 byte | |
| SMALLINT | 2 bytes | |
| INTEGER | 4 bytes | |
| BIGINT | 8 bytes | |
| FLOAT/REAL | 4 bytes | |
| DOUBLE PRECISION | 8 bytes | |
| DATE | 4 bytes | (PostgreSQL) |
| TIMESTAMP | 8 bytes | |
| UUID | 16 bytes |
Variable-width columns depend on actual content:
| Data Type | Storage Size | Notes |
|---|---|---|
| VARCHAR(n) | 1–4 bytes + actual length | Plus 1–2 byte length prefix |
| TEXT | 1–4 bytes + actual length | Identical to VARCHAR internally in PostgreSQL |
| JSONB | 1 byte + data size | Plus overhead per key |
| BYTEA/BLOB | 1–4 bytes + data size | Consider external storage for large blobs |
Row overhead: Each row incurs a per-row overhead. In PostgreSQL, this is 23 bytes (the heap tuple header). In MySQL InnoDB, it is approximately 20 bytes for the row header plus 6 bytes for the transaction ID and 7 bytes for the roll pointer.
Example calculation: An orders table with columns:
- id (BIGINT): 8 bytes
- user_id (BIGINT): 8 bytes
- status (SMALLINT): 2 bytes
- total_cents (INTEGER): 4 bytes
- created_at (TIMESTAMP): 8 bytes
- shipping_address (VARCHAR, avg 80 chars): 82 bytes
- Row overhead: 23 bytes
Estimated row size: 135 bytes
For 10 million orders: 135 × 10,000,000 = 1.35 GB of raw data. But this is just the beginning — indexes, MVCC overhead, and fragmentation multiply this figure significantly.
The database size estimator automates this calculation. Input your schema, expected row counts, and growth rate, and it projects storage needs over one, three, and five years.
Index Overhead
Indexes are essential for query performance but consume significant storage. A B-tree index on a single column typically occupies 2 to 3 times the size of the indexed column data, because each index entry includes the column value plus a pointer to the heap tuple (6 bytes in PostgreSQL) plus B-tree page overhead.
| Index Type | Overhead Multiplier | Use Case |
|---|---|---|
| B-tree (single column) | 2–3× column data | Primary keys, unique constraints, range queries |
| B-tree (composite, 2 cols) | 2–3× combined column data | Multi-column lookups |
| GIN (full-text / JSONB) | 3–5× indexed data | Text search, JSONB containment |
| GiST (geometric / range) | 3–4× indexed data | Spatial queries, range types |
| Hash | 1.5–2× column data | Equality-only lookups |
For our orders table, suppose you have these indexes:
- Primary key on
id(BIGINT): 10M × 8 bytes × 2.5 = 200 MB - Index on
user_id: 10M × 8 bytes × 2.5 = 200 MB - Index on
created_at: 10M × 8 bytes × 2.5 = 200 MB - Composite index on
(user_id, status): 10M × 10 bytes × 2.5 = 250 MB
Total index overhead: ~850 MB — nearly 63 percent of the raw data size.
For write-heavy workloads, indexes also impact IOPS. Every INSERT updates every index, and every UPDATE that touches an indexed column triggers an index modification. The index size calculator helps you estimate both the storage and IOPS impact of your indexing strategy.
MVCC and Dead Tuple Overhead
PostgreSQL and other MVCC databases (including MySQL InnoDB) do not overwrite rows in place. Instead, they create new row versions and leave dead tuples behind until vacuuming cleans them up. This means your actual storage consumption can be 20 to 50 percent higher than the live data size, depending on update frequency and vacuum aggressiveness.
Estimating dead tuple overhead:
> MVCC overhead = live_data_size × (update_rate × avg_time_between_vacuums)
If your orders table receives 1,000 updates per hour and autovacuum runs every 10 minutes, at any given time there are about 167 dead tuples (1,000/6). At 135 bytes per tuple, that is only 22 KB — negligible.
But for a high-churn table receiving 100,000 updates per second with vacuum running every 5 minutes, dead tuple accumulation can reach hundreds of megabytes. Tables with frequent updates to wide rows (like user profiles with JSONB columns) are especially prone to bloat.
A conservative planning multiplier for MVCC overhead:
| Update Frequency | MVCC Multiplier |
|---|---|
| Rarely updated (logs, events) | 1.05× |
| Moderate updates (orders, profiles) | 1.15–1.25× |
| Heavy updates (sessions, counters) | 1.30–1.50× |
| Extreme churn (real-time analytics) | 1.50–2.00× |
Growth Projections
Storage is not static. Your database grows as the business grows, and the growth rate often accelerates.
Linear growth: A SaaS application onboarding 500 new customers per month, each generating 100 rows per month, adds 50,000 rows per month. Over three years: 50,000 × 36 = 1.8 million rows. At 135 bytes per row: 243 MB of new data.
Exponential growth: A marketplace growing at 15 percent month-over-month adds increasing amounts of data. Month 1: 50,000 rows. Month 12: 50,000 × 1.15^11 = 237,000 rows. Month 36: 50,000 × 1.15^35 = 6.6 million rows. The cumulative total over three years is far higher than linear projections suggest.
| Growth Model | Year 1 Total | Year 3 Total | Year 5 Total |
|---|---|---|---|
| Linear (50K rows/mo) | 600K rows (81 MB) | 1.8M rows (243 MB) | 3M rows (405 MB) |
| 10% MoM growth | 1.07M rows (144 MB) | 14.9M rows (2 GB) | 199M rows (27 GB) |
| 15% MoM growth | 1.57M rows (212 MB) | 39.4M rows (5.3 GB) | 988M rows (133 GB) |
The compounding effect of percentage-based growth is dramatic. A database that starts at 1 GB can reach 133 GB in five years at 15 percent monthly growth — a factor of 133. This is why right-sizing storage for the first year is not enough; you need a provisioning strategy that accommodates multi-year growth.
NoSQL Document Size Estimation
NoSQL databases like MongoDB and DynamoDB store documents rather than rows. Document sizes vary more widely than relational rows because of nested structures, arrays, and sparse schemas.
MongoDB document overhead: Each document has approximately 16 bytes of overhead for the _id field (12-byte ObjectId plus 4-byte BSON type header). Field names are stored in every document, so a field named shipping_address adds 16 bytes to every document just for the name. This is why MongoDB best practice recommends shorter field names for high-volume collections.
DynamoDB item size: Maximum item size is 400 KB. Each attribute contributes its name length (UTF-8 bytes), type overhead (1–3 bytes), and value size. DynamoDB charges for read and write capacity in 4 KB and 1 KB units respectively, so item sizes directly impact throughput cost.
Estimation approach: Create a representative JSON document and measure its size using JSON.stringify(). Multiply by 1.1 for BSON overhead (MongoDB) or attribute encoding (DynamoDB). Then apply row count and growth projections as with relational databases.
Backup and Replication Storage
Production databases require backups and often replicas. These multiply your storage requirements.
Full backups: One full backup equals approximately 1× the database size (slightly less after compression). If you retain 7 daily backups, 4 weekly backups, and 12 monthly backups, the backup storage is roughly 23× the compressed database size. With 50 percent compression: 23 × 0.5 = 11.5× the live data.
WAL / binlog archiving: Point-in-time recovery requires archiving transaction logs. A database generating 1 GB of WAL per hour and retaining 7 days of archives needs 168 GB of WAL storage.
Read replicas: Each read replica consumes roughly 1× the primary's storage. Two read replicas triple your total storage footprint.
| Component | Storage Multiplier |
|---|---|
| Primary database | 1.0× |
| MVCC overhead | 1.1–1.5× |
| Indexes | 0.5–1.0× of data |
| Full backups (7-day retention) | 3.5× (with 50% compression) |
| WAL/binlog archives (7-day) | Variable |
| Read replica (×2) | 2.0× |
| Total | ~8–10× raw data size |
This multiplier is why a database with 100 GB of raw data can easily require 800 GB to 1 TB of total provisioned storage across primary, replicas, and backups. Plan for this from the beginning rather than discovering it after the first billing cycle.
Putting It Together
A complete sizing exercise follows these steps:
- Define your schema and estimate row sizes
- Project row counts for year one, three, and five
- Add index overhead (50–100% of data)
- Apply MVCC multiplier based on update frequency
- Account for backups, replicas, and archives
- Add 20% headroom for operational flexibility
Run each step with realistic assumptions and validate against actual data if you have a staging environment. Use the database size estimator and index size calculator to automate the arithmetic and catch errors. The time invested in accurate sizing pays for itself in avoided downtime, smoother scaling, and a cloud bill that matches expectations.
Category: Tech
Tags: Database, Storage estimation, Database sizing, SQL, NoSQL, Capacity planning, Infrastructure