2026-03-08 · CalcBee Team · 8 min read
API Rate Limiting Math: How to Design Limits That Scale
Rate limiting is the guardrail that keeps your API from collapsing under load. Without it, a single misbehaving client can consume all available capacity, degrading the experience for everyone else. But setting rate limits is not as simple as picking a number — it requires understanding your server capacity, client usage patterns, and the mathematical behavior of different rate-limiting algorithms.
This guide dives into the math behind rate limit design, walks through the most common algorithms, and shows you how to set limits that protect your infrastructure while keeping legitimate users happy.
Why Rate Limits Exist
Every API endpoint has a finite capacity determined by the underlying compute, memory, and database resources. A REST endpoint backed by a PostgreSQL database might handle 500 queries per second before response times degrade. A compute-heavy endpoint processing image transformations might max out at 50 requests per second.
Rate limits translate these infrastructure constraints into per-client policies:
- Protect shared resources — prevent one client from monopolizing database connections, CPU, or memory
- Ensure fairness — distribute capacity evenly among clients
- Control costs — limit expensive operations like external API calls, storage writes, or ML inference
- Prevent abuse — block brute-force attacks, credential stuffing, and scraping
The API rate limit calculator helps you translate server capacity into per-client limits based on expected client count, request distribution, and headroom requirements.
Token Bucket Algorithm
The token bucket is the most widely used rate-limiting algorithm. It models each client as having a "bucket" that fills with tokens at a fixed rate. Each request costs one token. When the bucket is empty, requests are rejected until more tokens accumulate.
Parameters:
- Bucket size (burst capacity): The maximum number of tokens the bucket can hold — this is the burst limit
- Refill rate: How many tokens are added per second — this is the sustained rate limit
Math:
> Available tokens at time t = min(bucket_size, tokens_at_t0 + refill_rate × elapsed_time)
| Bucket Size | Refill Rate | Behavior |
|---|---|---|
| 10 | 2/sec | Burst of 10 requests, then sustained 2/sec |
| 100 | 10/sec | Burst of 100, then sustained 10/sec |
| 50 | 50/sec | No practical burst (bucket refills as fast as it drains) |
| 1000 | 5/sec | Large burst for batch operations, slow sustained rate |
The token bucket naturally handles bursty traffic. A client that has been idle accumulates tokens and can send a quick burst of requests — useful for mobile apps that batch API calls after coming online. But the sustained rate prevents any client from exceeding the long-term capacity allocation.
Implementation note: In distributed systems, maintaining a per-client token count requires shared state. Redis is the most common backing store for distributed rate limiters, using atomic INCR and EXPIRE commands. The Redis-based implementation typically adds 1 to 2 milliseconds of latency per request check.
Sliding Window Algorithm
The sliding window algorithm counts requests within a rolling time window. Unlike the token bucket, it does not allow bursts beyond the window limit — if the limit is 100 requests per minute, you cannot send 100 requests in the first second.
Fixed window (simpler variant): Divide time into fixed intervals (e.g., one-minute windows). Count requests in the current window. Reset the count at the window boundary. The problem: a client can send 100 requests at minute 0:59 and another 100 at minute 1:01, effectively achieving 200 requests in a two-second span.
Sliding window log: Track the timestamp of every request. When a new request arrives, remove all timestamps older than the window size and count the remaining ones. This produces accurate results but requires storing every timestamp, which is memory-intensive for high-throughput APIs.
Sliding window counter (hybrid): Estimate the current window count using a weighted combination of the previous and current fixed windows:
> Estimated count = (previous_window_count × overlap_percentage) + current_window_count
For example, at 15 seconds into a 60-second window: overlap = (60 - 15) / 60 = 75%. If the previous window had 80 requests and the current window has 20: estimated count = 80 × 0.75 + 20 = 80. This is the approach used by Cloudflare, Stripe, and other major API providers.
| Algorithm | Burst Handling | Memory Usage | Accuracy |
|---|---|---|---|
| Token bucket | Allows controlled bursts | Low (token count per client) | Exact |
| Fixed window | Allows double-burst at boundary | Low (counter per window) | Approximate |
| Sliding window log | No bursts beyond limit | High (all timestamps) | Exact |
| Sliding window counter | Minimal burst near boundary | Low (two counters) | ~99.7% accurate |
Calculating Your Rate Limits
Setting the actual numbers requires working backward from infrastructure capacity.
Step 1: Measure server capacity. Load test your API to determine the maximum requests per second (RPS) before response times exceed your SLA. Suppose your SLA requires P99 latency under 200 milliseconds, and testing shows this holds up to 1,000 RPS.
Step 2: Reserve headroom. Never run at 100 percent capacity. Reserve 20 to 30 percent for traffic spikes and operational overhead. Usable capacity: 1,000 × 0.7 = 700 RPS.
Step 3: Estimate client count. How many active clients will share this capacity? If you have 100 active API consumers, each gets 7 RPS on average. If you have 1,000, each gets 0.7 RPS.
Step 4: Account for uneven distribution. Client usage is never uniform. The top 10 percent of clients typically generate 60 to 80 percent of traffic (Pareto distribution). Design tiered limits:
| Tier | Rate Limit | Burst Limit | Use Case |
|---|---|---|---|
| Free | 10 req/min | 5 | Hobbyist developers |
| Basic | 60 req/min | 20 | Small applications |
| Pro | 600 req/min | 100 | Production apps |
| Enterprise | 6,000 req/min | 1,000 | High-volume integrations |
Step 5: Validate capacity. Calculate worst-case load: if all Pro-tier clients hit their limit simultaneously, total demand = number_of_pro_clients × pro_limit. This should not exceed usable capacity. If it does, either reduce the per-client limit or increase infrastructure.
The concurrent users calculator helps model how many simultaneous clients your infrastructure can support at different rate limit configurations.
Handling Rate Limit Responses
How you communicate rate limits is as important as how you enforce them. Poor rate limit responses frustrate developers and increase support tickets.
Standard headers: Include these HTTP headers in every API response:
```
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 427
X-RateLimit-Reset: 1709856000
Retry-After: 32
```
X-RateLimit-Limit— the maximum requests allowed in the current windowX-RateLimit-Remaining— how many requests remain before the limit is reachedX-RateLimit-Reset— Unix timestamp when the window resetsRetry-After— seconds to wait before retrying (included only in 429 responses)
HTTP 429 response body: Return a structured error message:
```json
{
"error": "rate_limit_exceeded",
"message": "You have exceeded 600 requests per minute. Please retry after 32 seconds.",
"retry_after": 32
}
```
Implement exponential backoff guidance. Document a recommended retry strategy: wait 1 second after the first 429, 2 seconds after the second, 4 after the third, with a maximum backoff of 60 seconds. Clients that implement backoff automatically reduce pressure during overload events.
Scaling Rate Limits Across Distributed Systems
In a single-server setup, rate limiting is straightforward — a local in-memory counter suffices. In distributed systems with multiple API servers behind a load balancer, each server sees only a fraction of the client's requests, making local counters inaccurate.
Centralized counter (Redis): All API servers check and increment a counter in Redis. This provides accurate global counts but adds a network round trip to every request. At scale, Redis itself can become a bottleneck, requiring Redis Cluster or dedicated rate-limit Redis instances.
Local approximation with sync: Each server maintains a local counter and periodically syncs with a central store. This reduces Redis load but allows short-term over-limit bursts between sync intervals. A sync interval of 1 second is typical, meaning the rate limit can be exceeded by up to (number_of_servers × local_limit_slice) during the worst case.
Cell-based architecture: Partition clients across server groups (cells). Each cell manages its own rate limits independently. This eliminates cross-cell coordination but requires consistent client-to-cell routing.
For most applications, centralized Redis counters with connection pooling handle up to 100,000 rate limit checks per second with sub-millisecond latency. Beyond that scale, local approximation or purpose-built rate limiting services (like Envoy's ratelimit service) become necessary.
The API call cost calculator helps you understand the infrastructure cost of rate limiting itself — the Redis instances, additional latency, and compute overhead that rate limit checks add to each request.
Common Pitfalls
Setting limits too low. Conservative limits that regularly block legitimate traffic erode developer trust. Monitor 429 response rates. If more than 1 percent of requests from well-behaved clients hit the limit, it is too low.
Not differentiating by endpoint. A limit of 100 requests per minute makes sense for a search endpoint but is unnecessarily restrictive for a health check. Apply per-endpoint or per-resource limits in addition to global client limits.
Ignoring retry storms. When a rate-limited client receives a 429, naive retry logic can amplify load. If 100 clients all retry simultaneously after a one-second backoff, you get a thundering herd. Require jitter in retry delays: wait = base_delay × (1 + random(0, 0.5)).
Forgetting internal services. Service-to-service calls within your infrastructure also need limits. A background job that queries an internal API at full speed can starve user-facing traffic.
Rate limiting is the intersection of system design and arithmetic. Get the math right, choose the algorithm that matches your traffic pattern, and communicate limits clearly through standard headers. The result is an API that remains fast, fair, and reliable — even when the load spikes.
Category: Tech
Tags: API, Rate limiting, Token bucket, Scaling, Backend engineering, System design, API design