40 Ways to Scale a System, Part 4: The Techniques Most Lists Skip

Prefetching, lazy loading, capacity planning, read replicas, write batching, connection pooling, CQRS, serverless, edge computing and async I/O including five that should have been in the original

Read the full series

Part 1: The Foundations — ways 1 to 10

Part 2: Stop the System From Eating Itself — ways 11 to 20

Part 3: The Infrastructure Layer — ways 21 to 30

Part 4: The Techniques Most Lists Skip — ways 31 to 40 (you are here)

Part 4 closes the series. But before the final ten, a word on what most scaling technique lists get wrong.

A lot of lists pad their count with entries that are not actually techniques. High Availability is a goal, not a method. CAP Tradeoff is a theoretical framework. Hot Standby is just Replication and Failover used together. Modularity is a software design principle.

This series leaves those out and includes five things that genuinely belong but rarely appear: connection pooling, CQRS, serverless, edge computing and async I/O. Each is something you can actually implement. Each solves a distinct problem at scale.

Here are the final ten.

31. Prefetching

Prefetching loads data before it is explicitly requested, placing it in cache or memory so that when the request arrives the response is immediate.

Many user actions are predictable. After a user opens their inbox, they are likely to read one of the emails. After loading a product page, they may click a related product. After viewing page one of search results, they may navigate to page two. Prefetching exploits this predictability by loading likely-next data during idle moments.

Browser prefetch hints using rel=”prefetch” and rel=”preload” in HTML link elements tell the browser what to fetch next. Server-side prefetching is more deliberate: model the access patterns of your users, compute the most likely next actions and preload that data into caches before the request arrives.

DNS prefetching is a simpler application of the same idea. Hinting to the browser to resolve DNS for domains it is likely to need reduces lookup time when those resources are eventually requested.

The risk is wasted computation. If your prediction model is inaccurate, you prefetch data that is never used, consuming bandwidth, CPU and cache space for no benefit. Prefetching needs to be calibrated against your actual access patterns, not assumptions.

What you give up: resources consumed on mispredictions. Prefetching is always a bet on user behaviour.

32. Lazy Loading

Lazy loading defers the loading of a resource until it is actually needed.

The counterpart to prefetching, lazy loading improves initial load performance by not fetching data that may never be required. A long page does not load images below the fold until the user scrolls down to them. A dashboard does not load tab data until the user clicks that tab. An admin panel does not query all user records until the admin opens that section.

In databases, lazy loading is an ORM pattern where related records are not fetched until accessed in application code. In Laravel, Eloquent relationships are loaded on demand unless explicitly eager-loaded. In Django, querysets are not evaluated until iteration.

The N+1 query problem is the classic failure mode of lazy loading in ORM contexts. Load a list of 100 orders and then access the customer relationship on each order and you fire one query for the orders followed by 100 individual queries to fetch each customer. The solution is eager loading: fetching related records in the initial query using JOINs or batch lookups.

For web assets, the browser’s IntersectionObserver API enables efficient lazy loading of images and iframes, triggering loads only when the element enters the viewport. This is now a standard performance technique for content-heavy pages.

What you give up: predictability. Lazy loading can produce perceptible latency spikes when a user action triggers a load that could have been done earlier. Profile real user sessions to identify where eager loading would produce better perceived performance.

33. Capacity Planning

Capacity planning is the process of estimating future resource requirements and provisioning infrastructure ahead of demand.

This requires understanding current resource consumption across CPU, memory, disk, network bandwidth and database connections, modelling how those resources scale with user growth and anticipating demand spikes from product launches, marketing campaigns and seasonal peaks.

Back-of-envelope calculation is a core skill here. If your application serves 10,000 daily active users each making an average of 50 requests per day and each request consumes 200 milliseconds of CPU, you can estimate current CPU requirements and project what the numbers look like at 100,000 users.

Load testing is the empirical counterpart to estimation. Tools like k6, Locust and Apache JMeter simulate realistic traffic against your actual system, revealing bottlenecks before they become production incidents. A load test that identifies your database connection pool as the bottleneck at 5,000 concurrent users is worth running before your traffic reaches that level, not after.

Runway is the metric that matters in practice: at current growth rates, how many months before you hit a specific resource limit? Knowing you have three months before a database disk fills up is actionable. Finding out after the disk is full is an incident.

What you give up: capacity planning takes time that most teams deprioritise in favour of feature work. Teams without it tend to scale reactively, in response to outages rather than ahead of them.

34. Read Replica

A read replica is a database copy that is kept synchronised with the primary and configured to serve read-only queries.

Read replicas address a specific bottleneck: read-heavy workloads where the primary database spends most of its time serving SELECT queries rather than writes. By routing reads to replicas, the primary focuses on writes and read throughput scales by adding more replicas.

This is different from sharding. All replicas contain the same full dataset as the primary. You are scaling read capacity, not data volume. If your dataset is too large for a single server, you need sharding, not replicas.

Replication lag is the main tradeoff. Asynchronous replication means a replica may be slightly behind the primary. A write that just happened may not yet be visible on the replica. Applications that read their own writes need to be routed to the primary for that read, or the replication lag must be negligible.

AWS RDS read replicas, Google Cloud SQL read replicas and PostgreSQL streaming replication all implement this pattern. Amazon Aurora PostgreSQL allows up to 15 Aurora Replicas in a cluster sharing the same underlying storage volume, which significantly reduces replication lag compared to traditional streaming replication.

What you give up: eventual consistency between reads directed at replicas versus the primary. Applications must be designed with this in mind.

35. Write Batching

Write batching groups multiple write operations together and executes them as a single database operation, reducing the per-write overhead of network round trips, transaction initiation and log flushing.

The cost of a database write is not just the I/O of writing data. It includes establishing or borrowing a connection, beginning and committing a transaction, flushing the write-ahead log and a network round trip. If you write 1,000 rows individually, you pay this overhead 1,000 times. Batching reduces that to a single transaction inserting all 1,000 rows together.

PostgreSQL and MySQL both support multi-row INSERT syntax that inserts multiple rows in a single statement and significantly outperforms individual inserts for bulk operations.

Apache Kafka producers use write batching by default. Records are buffered in memory and sent to the broker in batches. The batch.size and linger.ms configuration parameters control how large batches grow and how long the producer waits before flushing.

The tradeoff is latency. A write held in a batch is not immediately durable. It waits until the batch fills or a timeout expires before flushing to the database. For event logging, analytics pipelines and audit trails this delay is acceptable. For financial transactions where immediate confirmation is required, individual writes are the right choice.

What you give up: immediate durability per write. Data in the buffer is not yet persisted. A process crash before flushing loses that batch. The acceptable loss window determines how you configure batch size and flush interval.

36. Connection Pooling

Connection pooling maintains a pool of reusable database connections so that your application reuses existing connections rather than opening a new one for every request.

Opening a database connection is not free. It involves a TCP handshake, authentication and session initialisation. In PostgreSQL this process adds roughly 20 to 50 milliseconds of overhead per connection. At thousands of requests per second, creating a new connection for every request exhausts the database’s connection limit, spikes CPU usage for connection management and adds latency to every operation.

Connection pooling solves this by keeping a fixed pool of open connections. When a request needs the database, it borrows a connection from the pool, uses it and returns it. The connection stays open and ready for the next borrower.

PgBouncer is the standard external connection pooler for PostgreSQL in production. It sits between your application and the database and handles the pool outside of the application process, which is critical in serverless and horizontally scaled environments where application-level pooling cannot share state across instances. HikariCP is the standard for Java applications. Laravel uses PDO persistent connections or a dedicated external pooler for the same purpose.

Pool size is a real configuration problem. Too small and requests queue up waiting for a connection. Too large and you overwhelm the database with simultaneous connections. HikariCP’s documentation recommends a starting formula of (number of CPU cores x 2) + number of effective spindle disks as a baseline for calibration.

What you give up: slight connection acquisition overhead even when borrowing from the pool rather than creating a fresh one. This overhead is orders of magnitude smaller than connection creation cost.

37. CQRS

CQRS stands for Command Query Responsibility Segregation. It separates the write model from the read model so that each can be scaled, optimised and maintained independently.

In a standard CRUD architecture, reads and writes share the same data model and the same database. When the read patterns and write patterns have different characteristics, this creates tension. A highly normalised schema optimised for write consistency may be inefficient for read queries. A schema denormalised for fast reads may complicate writes.

CQRS splits them. The command side handles writes using a model optimised for enforcing business rules and data integrity. The query side handles reads using a model optimised for the specific read patterns your application needs, often a denormalised read store that can be served from a separate read-optimised database or cache.

This allows the read model and write model to be scaled independently. If reads are the bottleneck, scale the read side without touching the write infrastructure. If writes are the bottleneck, optimise the write side without affecting read performance.

CQRS is described in AWS prescriptive guidance for microservices architectures and appears in production architectures at companies processing high-volume transaction workloads. It pairs naturally with event sourcing, where the state of the system is derived from a log of events rather than direct database mutations.

What you give up: complexity. CQRS adds architectural overhead and the need to synchronise the command and query data models. It is the right choice when read and write patterns are genuinely divergent, not a default approach for every service.

38. Serverless

Serverless is a deployment and execution model where you write functions, the cloud provider runs them on demand and you are billed only for actual execution time. There are no servers to provision, patch or manage.

AWS Lambda, Google Cloud Functions and Azure Functions are the major Function-as-a-Service platforms. You deploy a function, define what events trigger it such as HTTP requests, queue messages, scheduled jobs or storage events and the platform handles everything else: spinning up instances when requests arrive, scaling to zero when there is no traffic and scaling to thousands of concurrent executions during a spike.

The scaling model is fundamentally different from everything else on this list. Instead of capacity planning, autoscaling groups and idle servers waiting for traffic, you write code that runs only when triggered. A service that receives 10 requests per day and 100,000 requests during a campaign costs almost nothing during the slow period and scales instantly during the spike without any configuration change.

Cold starts are the known tradeoff. When a function has not run recently, the runtime environment needs to be initialised before the function can execute. AWS Lambda cold starts typically add 100 to 1,000 milliseconds depending on the runtime and function size. Provisioned concurrency keeps a set of pre-initialised instances warm at additional cost.

Serverless is not appropriate for all workloads. Long-running processes, functions requiring persistent in-memory state and workloads with consistent high traffic may be cheaper and simpler on traditional compute. But for bursty, event-driven and variable workloads, the model changes the cost and operational equation significantly.

What you give up: control over the execution environment, cold start latency for infrequently called functions and some degree of vendor dependency on platform-specific runtime behaviour.

39. Edge Computing

Edge computing runs application logic at the network edge, physically close to users, rather than in a centralised data centre.

This is distinct from CDN, covered in part 3. A CDN caches static content and delivers it from nearby nodes. Edge computing runs code at those nodes. The distinction matters. A CDN serves the same cached file to every user. Edge computing runs logic that can produce different responses based on the request, the user’s location, authentication state or real-time conditions, without the request ever reaching your origin server.

Cloudflare Workers runs JavaScript and WebAssembly on V8 isolates across Cloudflare’s network of over 300 cities, with cold starts under 5 milliseconds. AWS Lambda@Edge runs functions at CloudFront edge locations with cold starts in the 50 to 200 millisecond range and deeper integration with the AWS stack. Fastly Compute@Edge uses WebAssembly for consistent execution across edge locations.

Practical uses include authentication and authorisation at the edge before a request reaches your origin, geographic redirects where users in different regions are routed to different content without a round trip to your server, bot detection and filtering, A/B testing at the edge and real-time personalisation based on request headers or cookies.

What you give up: edge runtimes have resource constraints. Cloudflare Workers limits each isolate to 128 MB of memory and 30 seconds of CPU time on paid plans. Workloads requiring large in-memory state or extended computation still belong on origin servers.

40. Async I/O

Async I/O is a concurrency model where a single thread handles many simultaneous connections by not blocking while waiting for I/O operations to complete.

In a traditional synchronous threading model, each incoming request gets its own thread. While that thread waits for a database query to return, a file to be read or an external API to respond, it is blocked and unable to do anything else. At high concurrency, you need as many threads as simultaneous in-progress requests. Threads consume memory and context switching between many of them has overhead.

Async I/O inverts this. Instead of blocking on I/O, the thread registers a callback and returns to handling other requests. When the I/O completes, the callback fires and processing resumes. A single thread can manage thousands of simultaneous in-progress operations this way.

Node.js is built on this model. Its single-threaded event loop handles I/O-bound workloads with very high concurrency using far fewer resources than a thread-per-request model. The Node.js official documentation notes that for a request spending 45 of its 50 milliseconds waiting on a database call, choosing non-blocking I/O frees that 45 milliseconds to serve other requests entirely.

Go goroutines take a different approach. Each goroutine starts at approximately 2 KB of memory and the Go runtime schedules thousands of goroutines across a smaller pool of OS threads. This gives Go both high concurrency and true parallelism across CPU cores.

Python’s asyncio, Java NIO and Kotlin coroutines implement the same principle in their respective language stacks.

The limitation is CPU-bound work. Async I/O excels at I/O-bound concurrency: waiting on databases, files and network calls. For CPU-intensive computation, worker threads, process pools or a language with true multi-threading like Go or Java is the right tool.

What you give up: programming model complexity. Async code with callbacks, promises and async/await patterns can be harder to reason about than sequential synchronous code.

The 40 done

Looking back across all four parts, the pattern is consistent.

The most expensive scaling mistakes are not technical failures. They are applying the right technique at the wrong time. Sharding a database that a read replica would have fixed. Adopting microservices before a monolith is strained. Implementing a service mesh before there are enough services to justify the overhead. Choosing async I/O for a CPU-bound workload.

The engineers who build systems that scale are not the ones with the most complex architecture. They are the ones who understand every technique on this list well enough to know which one their current problem actually needs and who have the discipline to implement the simplest one that works.

The list of 40 is a vocabulary, not a checklist. You do not need all of them. You need the right ones at the right time.