Database Sizing Guide: How to Estimate Storage Before You…

2026-03-12 · CalcBee Team · 8 min read

Database Sizing Guide: How to Estimate Storage Before You Build

Underestimating database storage leads to emergency migrations, unplanned downtime, and panicked 2 AM pages. Overestimating it wastes money on provisioned IOPS and reserved storage you never use. Accurate database sizing before you write the first migration lets you choose the right instance class, provision appropriate storage, and set realistic budgets from day one.

This guide covers the formulas and methodology for estimating database storage across relational and NoSQL databases, from individual row sizes to multi-year growth projections.

Estimating Row Size in Relational Databases

Every storage estimate starts at the row level. The size of a single row multiplied by the expected row count gives you the raw data footprint.

Fixed-width columns have predictable sizes:

Data Type	Storage Size	Notes
BOOLEAN	1 byte
SMALLINT	2 bytes
INTEGER	4 bytes
BIGINT	8 bytes
FLOAT/REAL	4 bytes
DOUBLE PRECISION	8 bytes
DATE	4 bytes	(PostgreSQL)
TIMESTAMP	8 bytes
UUID	16 bytes

Variable-width columns depend on actual content:

Data Type	Storage Size	Notes
VARCHAR(n)	1–4 bytes + actual length	Plus 1–2 byte length prefix
TEXT	1–4 bytes + actual length	Identical to VARCHAR internally in PostgreSQL
JSONB	1 byte + data size	Plus overhead per key
BYTEA/BLOB	1–4 bytes + data size	Consider external storage for large blobs

Row overhead: Each row incurs a per-row overhead. In PostgreSQL, this is 23 bytes (the heap tuple header). In MySQL InnoDB, it is approximately 20 bytes for the row header plus 6 bytes for the transaction ID and 7 bytes for the roll pointer.

Example calculation: An orders table with columns:

id (BIGINT): 8 bytes
user_id (BIGINT): 8 bytes
status (SMALLINT): 2 bytes
total_cents (INTEGER): 4 bytes
created_at (TIMESTAMP): 8 bytes
shipping_address (VARCHAR, avg 80 chars): 82 bytes
Row overhead: 23 bytes

Estimated row size: 135 bytes

For 10 million orders: 135 × 10,000,000 = 1.35 GB of raw data. But this is just the beginning — indexes, MVCC overhead, and fragmentation multiply this figure significantly.

The database size estimator automates this calculation. Input your schema, expected row counts, and growth rate, and it projects storage needs over one, three, and five years.

Index Overhead

Indexes are essential for query performance but consume significant storage. A B-tree index on a single column typically occupies 2 to 3 times the size of the indexed column data, because each index entry includes the column value plus a pointer to the heap tuple (6 bytes in PostgreSQL) plus B-tree page overhead.

Index Type	Overhead Multiplier	Use Case
B-tree (single column)	2–3× column data	Primary keys, unique constraints, range queries
B-tree (composite, 2 cols)	2–3× combined column data	Multi-column lookups
GIN (full-text / JSONB)	3–5× indexed data	Text search, JSONB containment
GiST (geometric / range)	3–4× indexed data	Spatial queries, range types
Hash	1.5–2× column data	Equality-only lookups

For our orders table, suppose you have these indexes:

Primary key on id (BIGINT): 10M × 8 bytes × 2.5 = 200 MB
Index on user_id: 10M × 8 bytes × 2.5 = 200 MB
Index on created_at: 10M × 8 bytes × 2.5 = 200 MB
Composite index on (user_id, status): 10M × 10 bytes × 2.5 = 250 MB

Total index overhead: ~850 MB — nearly 63 percent of the raw data size.

For write-heavy workloads, indexes also impact IOPS. Every INSERT updates every index, and every UPDATE that touches an indexed column triggers an index modification. The index size calculator helps you estimate both the storage and IOPS impact of your indexing strategy.

MVCC and Dead Tuple Overhead

PostgreSQL and other MVCC databases (including MySQL InnoDB) do not overwrite rows in place. Instead, they create new row versions and leave dead tuples behind until vacuuming cleans them up. This means your actual storage consumption can be 20 to 50 percent higher than the live data size, depending on update frequency and vacuum aggressiveness.

Estimating dead tuple overhead:

> MVCC overhead = live_data_size × (update_rate × avg_time_between_vacuums)

If your orders table receives 1,000 updates per hour and autovacuum runs every 10 minutes, at any given time there are about 167 dead tuples (1,000/6). At 135 bytes per tuple, that is only 22 KB — negligible.

But for a high-churn table receiving 100,000 updates per second with vacuum running every 5 minutes, dead tuple accumulation can reach hundreds of megabytes. Tables with frequent updates to wide rows (like user profiles with JSONB columns) are especially prone to bloat.

A conservative planning multiplier for MVCC overhead:

Update Frequency	MVCC Multiplier
Rarely updated (logs, events)	1.05×
Moderate updates (orders, profiles)	1.15–1.25×
Heavy updates (sessions, counters)	1.30–1.50×
Extreme churn (real-time analytics)	1.50–2.00×

Growth Projections

Storage is not static. Your database grows as the business grows, and the growth rate often accelerates.

Linear growth: A SaaS application onboarding 500 new customers per month, each generating 100 rows per month, adds 50,000 rows per month. Over three years: 50,000 × 36 = 1.8 million rows. At 135 bytes per row: 243 MB of new data.

Exponential growth: A marketplace growing at 15 percent month-over-month adds increasing amounts of data. Month 1: 50,000 rows. Month 12: 50,000 × 1.15^11 = 237,000 rows. Month 36: 50,000 × 1.15^35 = 6.6 million rows. The cumulative total over three years is far higher than linear projections suggest.

Growth Model	Year 1 Total	Year 3 Total	Year 5 Total
Linear (50K rows/mo)	600K rows (81 MB)	1.8M rows (243 MB)	3M rows (405 MB)
10% MoM growth	1.07M rows (144 MB)	14.9M rows (2 GB)	199M rows (27 GB)
15% MoM growth	1.57M rows (212 MB)	39.4M rows (5.3 GB)	988M rows (133 GB)

The compounding effect of percentage-based growth is dramatic. A database that starts at 1 GB can reach 133 GB in five years at 15 percent monthly growth — a factor of 133. This is why right-sizing storage for the first year is not enough; you need a provisioning strategy that accommodates multi-year growth.

NoSQL Document Size Estimation

NoSQL databases like MongoDB and DynamoDB store documents rather than rows. Document sizes vary more widely than relational rows because of nested structures, arrays, and sparse schemas.

MongoDB document overhead: Each document has approximately 16 bytes of overhead for the _id field (12-byte ObjectId plus 4-byte BSON type header). Field names are stored in every document, so a field named shipping_address adds 16 bytes to every document just for the name. This is why MongoDB best practice recommends shorter field names for high-volume collections.

DynamoDB item size: Maximum item size is 400 KB. Each attribute contributes its name length (UTF-8 bytes), type overhead (1–3 bytes), and value size. DynamoDB charges for read and write capacity in 4 KB and 1 KB units respectively, so item sizes directly impact throughput cost.

Estimation approach: Create a representative JSON document and measure its size using JSON.stringify(). Multiply by 1.1 for BSON overhead (MongoDB) or attribute encoding (DynamoDB). Then apply row count and growth projections as with relational databases.

Backup and Replication Storage

Production databases require backups and often replicas. These multiply your storage requirements.

Full backups: One full backup equals approximately 1× the database size (slightly less after compression). If you retain 7 daily backups, 4 weekly backups, and 12 monthly backups, the backup storage is roughly 23× the compressed database size. With 50 percent compression: 23 × 0.5 = 11.5× the live data.

WAL / binlog archiving: Point-in-time recovery requires archiving transaction logs. A database generating 1 GB of WAL per hour and retaining 7 days of archives needs 168 GB of WAL storage.

Read replicas: Each read replica consumes roughly 1× the primary's storage. Two read replicas triple your total storage footprint.

Component	Storage Multiplier
Primary database	1.0×
MVCC overhead	1.1–1.5×
Indexes	0.5–1.0× of data
Full backups (7-day retention)	3.5× (with 50% compression)
WAL/binlog archives (7-day)	Variable
Read replica (×2)	2.0×
Total	~8–10× raw data size

This multiplier is why a database with 100 GB of raw data can easily require 800 GB to 1 TB of total provisioned storage across primary, replicas, and backups. Plan for this from the beginning rather than discovering it after the first billing cycle.

Putting It Together

A complete sizing exercise follows these steps:

Define your schema and estimate row sizes
Project row counts for year one, three, and five
Add index overhead (50–100% of data)
Apply MVCC multiplier based on update frequency
Account for backups, replicas, and archives
Add 20% headroom for operational flexibility

Run each step with realistic assumptions and validate against actual data if you have a staging environment. Use the database size estimator and index size calculator to automate the arithmetic and catch errors. The time invested in accurate sizing pays for itself in avoided downtime, smoother scaling, and a cloud bill that matches expectations.

Category: Tech

Tags: Database, Storage estimation, Database sizing, SQL, NoSQL, Capacity planning, Infrastructure