Designing an API That Handles Millions of Requests

Most APIs are built to work. Very few are built to survive.

There is a difference between an API that passes your tests and one that holds steady when a hundred thousand requests hit it in the same minute. The gap between those two things is not about writing better code. It is about making the right decisions before you write any code at all.

This article covers the design decisions that separate an API that crumbles under load from one that scales without drama.

Design for Statelessness First

Every decision you make builds on this one, so it needs to come first.

A stateless API means that every request contains all the information needed to process it. The server does not hold any memory of what a client did in a previous request. No sessions stored in memory. No user context living on a specific server.

// Every request is self-contained
GET /orders/99
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

Why does this matter at scale? Because any server can handle any request. When you need to handle more traffic, you add more servers behind a load balancer and every one of them is immediately capable of serving any client. No sticky sessions to manage. No shared memory to synchronise.

The moment you break statelessness, scaling becomes a coordination problem. You end up with users pinned to specific servers and infrastructure that cannot grow cleanly. Start stateless and stay that way.

Rate Limiting Is Not Optional

Without rate limiting, one bad client or one viral moment can take your entire API down. It is not a feature you add later. It is a foundation you build early.

Rate limiting controls how many requests a client can make within a given time period. When the limit is exceeded, you return a 429 Too Many Requests response and tell the client when to try again.

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719820800
Retry-After: 60

Always expose those headers. A client that can see how many requests it has left and when the window resets will behave far better than one flying blind.

Choosing a rate limiting algorithm

There are four main approaches and each one behaves differently under load.

The fixed window counter is the simplest. You count requests within a fixed time window and reset the counter when the window ends. The problem is the boundary. If your limit is 1000 requests per minute and a client sends 1000 in the last second of one window and 1000 in the first second of the next, it sends 2000 requests in two seconds without technically breaking the rule.

The sliding window fixes this by tracking requests across a rolling period rather than a fixed one. It is more accurate but requires more memory because you need to store timestamps for each request.

The leaky bucket processes requests at a constant output rate regardless of how quickly they arrive. Think of a bucket with a small hole at the bottom. Requests fill the bucket and drain at a steady rate. If the bucket overflows, requests are dropped. NGINX uses this algorithm. It is ideal when you want smooth, predictable throughput with no bursts.

The token bucket is the most widely used algorithm in public APIs. A bucket is filled with tokens at a fixed rate up to a maximum capacity. Each request consumes one token. If there are no tokens left, the request is rejected. The key difference from leaky bucket is that unused tokens accumulate, so a client that has been quiet can briefly burst above the steady-state rate. AWS API Gateway, Kong and most major API gateways use variations of this. It is the best default for public APIs because it handles legitimate traffic spikes without punishing well-behaved clients.

For distributed systems where multiple servers handle requests from the same client, store rate limit counters in a shared cache like Redis. A counter living inside a single server’s memory will give inconsistent results when a client’s requests fan out across multiple instances.

Caching at the Right Layer

Caching and rate limiting share the same goal: reduce the amount of work your servers do for each request. But they approach it differently.

Rate limiting controls who can ask. Caching controls what you recompute.

Response caching with HTTP headers

The cheapest cache hit is the one that never reaches your server at all. For public data that does not change per user, use HTTP cache headers to let CDNs and browsers serve the response themselves.

HTTP/1.1 200 OK
Cache-Control: public, max-age=3600
ETag: "a8f3c2d1"
Last-Modified: Mon, 01 Jul 2024 10:00:00 GMT

When a client sends a follow-up request with If-None-Match: "a8f3c2d1", your server checks if the ETag matches. If it does, you return 304 Not Modified with no body at all. Less bandwidth consumed and less work done on every layer of your stack.

Application-level caching

For data that requires computation or database access, cache the result in Redis or Memcached and serve it directly for subsequent requests.

function getProductCatalog(categoryId):
    cacheKey = "catalog:" + categoryId
    cached = redis.get(cacheKey)

    if cached is not null:
        return cached
    data = database.query("SELECT ... WHERE category_id = ?", categoryId)
    redis.set(cacheKey, data, ttl=600)
    return data

Be deliberate about what you cache and for how long. Product catalogs, public pricing and reference data are good candidates. User-specific data, live inventory counts and anything where staleness causes real problems are not.

Connection Pooling Is a Requirement, Not an Optimisation

Every database connection has a cost. Opening a new one takes anywhere from 20 to 100 milliseconds depending on your database and network. At a few requests per second that overhead is invisible. At thousands of requests per second it becomes the primary bottleneck.

Connection pooling keeps a set of open connections ready and reuses them across requests instead of opening and closing one per request.

// Without pooling
// Each request opens a connection (20-100ms overhead), runs the query
// then closes the connection. Wasteful at scale.

// With pooling
// Pool starts with minimum connections open.
// Each request borrows a connection, uses it and returns it.
// No connection setup cost on the hot path.
pool = createPool(min: 5, max: 20, idleTimeout: 30000)
function handleRequest(userId):
    connection = pool.acquire()
    result = connection.query("SELECT * FROM users WHERE id = ?", userId)
    pool.release(connection)
    return result

The pool size matters. Setting it too high on a database with limited CPU cores causes the database to spend more time switching between connection handlers than actually processing queries. A common starting point is to set max connections to around 10 to 20 per application server and adjust based on observed database load. For PostgreSQL specifically, tools like PgBouncer act as a connection proxy that sits between your application servers and the database, providing a centralised pool shared across all instances.

Push Long Work Off the Request Path

A request that takes 5 seconds to complete does not just affect the user who sent it. It ties up a worker thread for 5 seconds. Under load, those workers fill up and new requests start queuing. The queue grows. Response times climb across the board.

The solution is to push anything slow or non-critical off the request path entirely.

When a user registers, you do not need to send their welcome email before you respond. You need to create their account. Do that, respond immediately with a 202 Accepted or 201 Created and put the email job in a queue for a background worker to process.

POST /users HTTP/1.1

// Handler logic
function registerUser(userData):
    user = database.create(userData)      // Fast: must happen now
    queue.publish("send_welcome_email", { // Slow: can happen later
        userId: user.id,
        email: user.email
    })
    return respond(201, user)             // Respond immediately

This pattern works for any operation where the client does not need the result immediately: generating reports, processing uploaded files, sending notifications, calling third-party APIs and synchronising data between systems.

For long-running operations where the client does need the result eventually, return a job ID and provide a status endpoint they can poll, or use a webhook to notify them when processing is complete.

// Client submits a job
POST /reports/generate
→ 202 Accepted
→ { "jobId": "abc-123", "statusUrl": "/reports/abc-123/status" }

// Client polls for completion
GET /reports/abc-123/status
→ { "status": "processing", "progress": 60 }
GET /reports/abc-123/status
→ { "status": "complete", "downloadUrl": "/reports/abc-123/download" }

Popular queue systems include RabbitMQ for general-purpose job processing, Apache Kafka for high-throughput event streaming and Amazon SQS or Google Cloud Pub/Sub if you are already on those platforms.

Pagination on Every Endpoint That Returns a List

An endpoint that returns all records in a table is a ticking clock. At a hundred records it is fine. At a hundred thousand it times out. At a million it takes your server down.

Every endpoint that returns a collection needs pagination. No exceptions.

There are two approaches. Offset-based pagination is simpler to implement but gets slower as the offset increases because the database reads and discards every row before the offset.

GET /products?page=1&limit=20
GET /products?page=500&limit=20   // Reads 10,000 rows, returns 20

Cursor-based pagination (also called keyset pagination) stays fast regardless of how deep into the dataset you go because it uses the last seen record as a starting point instead of a row count.

GET /products?limit=20
→ { data: [...], nextCursor: "cHJvZHVjdDo5OTk" }

GET /products?cursor=cHJvZHVjdDo5OTk&limit=20
→ { data: [...], nextCursor: "cHJvZHVjdDoxMDEx" }

For feeds, logs and anything that involves scrolling through a large ordered dataset, use cursor-based pagination. For simple admin interfaces where users need to jump to a specific page number, offset pagination is acceptable as long as the dataset stays bounded.

Idempotency for Non-Idempotent Operations

Networks are unreliable. Clients retry failed requests. Without careful design, a payment endpoint can charge a user twice or an order endpoint can create duplicate orders when a network timeout triggers a retry.

Idempotency keys solve this. The client generates a unique key for each logical operation and sends it with the request. The server stores the key and its result. If the same key arrives again, the server skips reprocessing and returns the stored result.

POST /payments HTTP/1.1
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Content-Type: application/json


{
    "amount": 9900,
    "currency": "MYR",
    "customerId": "cust_123"
}

The server checks whether it has seen this key before. If yes, it returns the previous response. If no, it processes the request and stores the key with the result.

Idempotency keys matter most for POST endpoints that create resources or trigger financial operations. GET, PUT and DELETE are naturally idempotent by the HTTP standard. POST is not.

Stripe uses this pattern for their payment API and it is the right model for any API handling money or irreversible operations.

Version Your API Before You Need To

The worst time to think about versioning is after you have clients depending on your API in production and you need to make a breaking change.

The simplest and clearest approach is to include the version in the URL path:

GET /v1/users/123
GET /v2/users/123

The version is visible in every request, easy to route at the infrastructure level and obvious to anyone reading logs or documentation.

A breaking change is one that removes a field, renames a field, changes a field type or changes the behaviour of an existing endpoint in a way that will break existing clients. Adding new optional fields to a response is not a breaking change.

When you release a new major version, keep the previous version running during a documented deprecation window. Send a Sunset header on responses from deprecated endpoints to tell clients when support ends:

Sunset: Tue, 31 Dec 2025 23:59:59 GMT

Forcing clients to migrate with no warning is how you lose trust in an API.

Graceful Degradation Under Load

A well-designed API does not fail completely under extreme load. It degrades in a controlled way.

Circuit breaker pattern

When a downstream service like a payment gateway or third-party API is slow or failing, a circuit breaker stops sending requests to it after a threshold of failures. Instead of every request waiting for a timeout, the circuit trips and requests fail immediately with a clear error. After a recovery period, a small number of requests are allowed through to test if the service has recovered.

if circuitBreaker.isOpen("payment-service"):
    return respond(503, { error: "Payment service temporarily unavailable" })

try:
    result = paymentService.charge(amount)
    circuitBreaker.recordSuccess("payment-service")
    return result
catch error:
    circuitBreaker.recordFailure("payment-service")
    throw error

Load shedding

Under extreme traffic, it is better to reject low-priority requests immediately and preserve capacity for critical ones than to accept everything and let everything slow down together. If your API serves both a critical checkout flow and a low-priority product recommendation endpoint, the recommendation endpoint should be the first to start returning errors when the system is under strain.

Monitor What Actually Matters

You cannot fix what you cannot see.

The four metrics that tell you the most about API health are latency (how long requests take), throughput (how many requests you handle per second), error rate (what percentage of requests result in 5xx responses) and saturation (how close your servers are to their limits on CPU, memory and connection counts).

Set up alerts on all four. Do not wait for users to report problems. A p99 latency spike is almost always visible in your metrics before it is visible in your support queue.

Beyond those four, track your cache hit rate, your queue depth for async jobs and your connection pool utilisation. These tell you where headroom is disappearing before it becomes a crisis.

The Order Matters

None of these are optional for an API serving real traffic at scale. But if you are building from scratch, do them in roughly this order.

Start with statelessness and versioning. They are architectural decisions that are painful to bolt on later. Add pagination before you have data to worry about. Implement rate limiting and caching before you go live. Add connection pooling as soon as you have a database. Push long work to queues once you start seeing latency spikes from slow operations. Add idempotency to any endpoint handling money or irreversible side effects.

The circuit breaker and load shedding patterns are refinements you add as your understanding of failure modes grows.

An API that handles millions of requests is not a different kind of API from one that handles thousands. It is the same API, built with the same patterns, applied consistently from the start.

The decisions are not complicated. The discipline to make them early is.