40 Ways to Scale a System, Part 2: Stop the System From Eating Itself

Queueing, stateless design, indexing, timeouts, retries, rate limiting, circuit breakers, backpressure, GeoDNS and multi-region for engineers who have been paged at 3am

Read the full series

Part 1: The Foundations — ways 1 to 10

Part 2: Stop the System From Eating Itself — ways 11 to 20 (you are here)

Part 3: The Infrastructure Layer — ways 21 to 30

Part 4: The Techniques Most Lists Skip — ways 31 to 40

Part 1 covered techniques that add capacity. More nodes, bigger machines, smarter data distribution. Those answer the question of how to handle more traffic.

Part 2 answers a different question: how do you stop the system from destroying its own capacity when things get hard?

A system that collapses under pressure is not a scaled system. It is a system that works until the moment it matters most. The ten techniques here are about resilience, predictability and the operational discipline that separates production-grade architecture from infrastructure that just happens to be running on cloud servers.

11. Queueing

A message queue buffers work between a producer and a consumer. The producer puts a task into the queue and returns immediately. The consumer pulls tasks from the queue at its own pace.

This decouples the rate of task creation from the rate of task processing. If your API receives 10,000 image processing requests in one minute but your workers can only handle 500 per minute, the queue absorbs the difference. No requests are dropped. No upstream callers time out. Workers drain the backlog as capacity allows.

RabbitMQ, Apache Kafka, AWS SQS and Redis Streams are common implementations. Each has different guarantees around message ordering, delivery semantics and message retention duration.

Dead letter queues are where messages go after repeated processing failures. Without one, a single malformed message can block a worker indefinitely or trigger infinite retry loops. A dead letter queue isolates the problem so you can alert on it and investigate without blocking everything else.

What you give up: real-time feedback. A queued task is not processed immediately. Design your product experience around async workflows with status polling or webhook callbacks, not instant results.

12. Stateless Design

A stateless service stores nothing on the server between requests. Every request carries all the context the service needs to process it. The server handles the request, returns a response and has no memory of the exchange.

This matters for horizontal scaling because if a server holds session data for a specific user, that user must always be routed back to the same server. This is called sticky sessions and it creates uneven load distribution plus fragility. If that server goes down, all its sessions are gone.

A stateless service removes this constraint entirely. Any server in the pool can handle any request at any time. You add nodes, replace nodes and remove nodes without worrying about which user’s session lives where.

Session data does not disappear. It moves. Common approaches are client-side tokens where JWTs carry user context inside the token itself, shared session stores where Redis holds session data centrally accessible to all servers and database-backed sessions.

Stateless design is one of the twelve factors in the Twelve-Factor App methodology, which has shaped how cloud-native applications are built since its publication.

What you give up: the simplicity of local state. Token validation, shared cache lookups and token expiry management are additional concerns your application must handle explicitly.

13. Indexing

A database index is a data structure that speeds up query lookups at the cost of additional storage and slower writes.

Without an index, the database performs a full sequential scan. It reads every row in the table to find matches. With the right index on the right column, it jumps directly to the matching rows. A query that takes tens of seconds on a large unindexed table can return in under a millisecond with a proper index.

B-tree indexes are the default in most relational databases. They handle equality lookups and range queries well. Hash indexes are faster for exact equality but cannot support range queries. PostgreSQL’s GIN and GiST indexes handle more complex types including full-text search, JSONB fields and geometric data.

Composite indexes cover multiple columns. The column order matters. An index on (user_id, created_at) helps queries that filter by user_id alone or by both user_id and created_at but does nothing for queries that filter only by created_at.

The common mistake is adding too many indexes. Every index slows inserts and updates because the index must be maintained alongside the table. A table with fifteen indexes on a high-write workload often performs worse overall than the same table with three well-chosen ones.

What you give up: write performance. Profile your actual query patterns before adding indexes. EXPLAIN ANALYZE in PostgreSQL is the right starting point.

14. Timeouts

A timeout sets a hard upper limit on how long an operation can wait before giving up.

Without timeouts, a slow downstream service holds your thread or connection open indefinitely. If enough requests pile up waiting for that slow response, your thread pool exhausts and your entire service becomes unresponsive even though the root cause is in a completely different component. This is how cascading failures start.

There are several distinct kinds. Connection timeouts control how long to wait when establishing a connection. Read timeouts control how long to wait for data after a connection is open. Request timeouts control the total allowed time for the full request-response cycle.

Setting the right values is harder than it looks. Too tight and you get false positives where legitimate slow operations are cancelled unnecessarily. Too loose and timeouts stop providing protection. Observing the p99 latency of a dependency in actual production traffic is the right starting point, not a guess.

Timeouts must be configured at every layer of a distributed system. If service A calls service B which calls service C, each hop needs its own timeout. Without them, a delay deep in the chain propagates upstream and users waiting on service A wait far longer than they should.

What you give up: long-running operations that are genuinely slow. Bulk exports, report generation and batch jobs should run as async tasks in a queue rather than synchronous requests with extended timeouts.

15. Retries

Retries automatically repeat a failed operation. Transient failures, brief network interruptions, momentary service unavailability and throttling responses often resolve on the next attempt without any real underlying problem.

The naive implementation is an immediate retry on failure. Under load this makes the situation worse. If a service is already struggling, a flood of immediate retries adds more pressure at exactly the wrong moment.

Exponential backoff is the standard approach. After the first failure wait one second. After the second wait two seconds. After the third wait four. The delay doubles with each attempt up to a configured maximum. Adding jitter, a small random offset to the delay, prevents multiple clients from retrying in perfect synchrony and creating a thundering herd.

Retry budgets cap how many retries a service will attempt across all requests in a time window. This prevents retries from amplifying load beyond what the system can absorb during a degraded period.

Not every operation should be retried. Non-idempotent operations like creating a payment or placing an order need idempotency keys or deduplication logic so that retrying does not create duplicate records.

What you give up: retry storms are real. A burst of failures combined with aggressive retry logic can turn a brief hiccup into a sustained outage. Circuit breakers, covered next, are the complementary mechanism.

16. Rate Limiting

Rate limiting controls how many requests a client or service can make within a given time window. Exceeding the limit produces a rejection, typically HTTP 429 Too Many Requests, rather than slow degraded processing.

This protects services from overload, prevents individual clients from consuming disproportionate resources and defends against patterns like credential stuffing and scraping.

The token bucket algorithm gives each client a bucket of tokens that replenish at a fixed rate. Each request consumes a token. When the bucket is empty, requests are rejected. The sliding window algorithm tracks request counts over a rolling time window rather than fixed intervals. The leaky bucket processes requests at a fixed output rate regardless of how fast they arrive.

Rate limits apply at multiple layers. API gateways enforce per-client or per-IP limits. Application code enforces per-user or per-resource limits. Infrastructure-level rate limiting protects databases and downstream services from being overwhelmed by application code that misbehaves.

Kong, AWS API Gateway, NGINX and Cloudflare all include rate limiting as native features. Redis is commonly used to store rate limit counters in distributed systems where multiple application instances need to share state.

What you give up: legitimate traffic gets blocked when it exceeds limits. Thresholds require calibration based on real usage data. Guessed limits that are too low will cause real user impact.

17. Circuit Breaker

The circuit breaker pattern prevents a failing downstream service from taking down everything that depends on it.

Named after the electrical component, it operates in three states. Closed is normal operation where requests flow through to the downstream service. When the failure rate exceeds a configured threshold, the circuit opens. In the open state, requests are rejected immediately without attempting to call the failing service, protecting it from additional load while it recovers. After a configurable timeout, the circuit enters a half-open state and allows a small number of test requests through. If those succeed, the circuit closes again.

The distinction from retries is important. Retries handle transient failures. A circuit breaker handles sustained failures where continuing to send requests makes the situation worse. If your payment processor is down, retrying every request floods an already failing system. An open circuit stops that entirely.

Resilience4j is the current standard library for Java and Kotlin. Netflix Hystrix was the original widely adopted library but has been in maintenance mode for years, with Resilience4j as the recommended replacement. Service meshes like Istio implement circuit breaking at the network layer without requiring application-level code changes.

What you give up: complexity in configuration. Failure thresholds, timeout windows and recovery criteria all need careful tuning per service. A circuit breaker that trips too easily causes more disruption than it prevents.

18. Backpressure

Backpressure is a mechanism that lets an overwhelmed component signal upstream that it cannot accept more work, slowing down the flow of incoming requests before something breaks.

Without backpressure, a fast producer and a slow consumer create an unbounded queue. Memory fills. The process crashes. Or the queue grows until messages are so old they have no value by the time they are processed.

With backpressure, the slow consumer signals that it cannot keep up and the producer responds by reducing its production rate. This creates a self-regulating system instead of a runaway one.

In practice, backpressure shows up as bounded queues where new items are rejected once the queue reaches its size limit, as flow control in streaming protocols where TCP implements backpressure at the network layer and as explicit rate signalling in reactive programming frameworks like Project Reactor and RxJava.

Backpressure and queueing work in opposite directions. Queueing absorbs bursts by buffering work. Backpressure limits bursts by slowing producers. In a well-designed system both mechanisms cooperate.

What you give up: throughput during the periods backpressure is active. Sizing your consumer capacity correctly reduces how often it needs to kick in.

19. GeoDNS

GeoDNS routes incoming DNS queries to different servers based on the geographic location of the requester. A user in Southeast Asia resolves your domain to a server in Singapore. A user in Europe resolves the same domain to a server in Frankfurt.

The primary benefit is reduced latency. A request that travels 200 milliseconds to a distant data centre travels 10 milliseconds to a nearby one. For latency-sensitive applications this difference is significant.

GeoDNS is also used for traffic distribution, disaster recovery routing where traffic from a failed region is redirected to a healthy one and regulatory compliance where data must stay within specific geographic boundaries.

AWS Route 53 with geolocation routing policies, Cloudflare and NS1 all provide GeoDNS as a managed service.

TTL values on DNS records determine how quickly routing changes propagate. A short TTL means faster failover but more DNS query load. A long TTL means slower failover but less overhead on DNS infrastructure.

What you give up: accuracy. IP geolocation is not perfect. VPN users, proxy traffic and some ISPs produce incorrect location signals. The routing decision is always a best estimate.

20. Multi-Region Deployment

A multi-region deployment runs your application in multiple geographic data centres simultaneously. Users in each region hit the nearest set of servers rather than crossing an ocean to reach a distant one.

This reduces latency for globally distributed users, provides fault isolation where a regional outage does not take the whole system down and satisfies data residency requirements in regulated industries.

The hard problem is always data. Keeping databases consistent across regions adds significant complexity. Active-passive has one primary region handling all writes while other regions handle reads from replicas. Active-active allows any region to serve writes with synchronisation between them. Geographic sharding keeps each region’s user data in that region and routes requests accordingly.

Active-active multi-region is among the most complex production architectures that exist. Global databases like Google Spanner and CockroachDB are built specifically to handle cross-region consistency. For most teams, active-passive with regional read replicas provides most of the latency benefit at much lower operational cost.

What you give up: simplicity, cost and debugging velocity. Every distributed system problem becomes harder when the nodes are on different continents with variable network conditions between them.

The shift from capacity to resilience

Part 1 was about adding capacity. Part 2 is about using that capacity without it collapsing. Timeouts prevent thread exhaustion. Rate limits prevent overload. Circuit breakers stop cascading failures. Backpressure prevents runaway queues. GeoDNS and multi-region ensure the system survives regional failures.

These are not optional features for large companies. They are the difference between a system that scales and one that falls apart at the worst possible moment.

Part 3 covers ways 21 through 30: containers, orchestration, service mesh, CDN, monitoring, distributed tracing, failover, graceful degradation, consistent hashing and bulkhead.