Skip to content
All posts
System DesignPerformanceArchitecture

40 Ways to Scale a System, Part 1: The Foundations

April 1, 2026·Read on Medium·

10 core techniques that every engineer needs before anything else in system design

Read the full series
Part 1: The Foundations — ways 1 to 10 (you are here)
Part 2: Stop the System From Eating Itself — ways 11 to 20
Part 3: The Infrastructure Layer — ways 21 to 30
Part 4: The Techniques Most Lists Skip — ways 31 to 40

There are exactly 40 distinct ways to scale a system. Not 35. Not 50. Some lists out there pad the count with goals and frameworks dressed up as techniques. High Availability is not a technique, it is an outcome. CAP Tradeoff is a theoretical framework, not something you deploy. Diagonal Scale is just doing horizontal and vertical scaling at the same time, which is not a technique of its own.

This series cuts those out and replaces them with five things most lists miss entirely: connection pooling, CQRS, serverless, edge computing and async I/O. Every entry here is something you can actually implement.

Part 1 covers ways 1 through 10. The concepts that everything else builds on.

1. Horizontal Scaling

Horizontal scaling means adding more servers to handle more load. Instead of making one machine bigger, you add more machines and distribute the work across all of them. This is also called scaling out.

If traffic doubles, you add more nodes. If one server fails, the others continue running. The system has no single point of failure at the compute layer.

The catch is that your application needs to be designed for it. If your app stores session data locally on the server, the next request from the same user might land on a different server that knows nothing about that session. Stateless design, shared session stores or sticky sessions all address this, but they require deliberate decisions up front.

Horizontal scaling is the default assumption in cloud-native design. It is why containerised workloads and microservices became so dominant, because they are built to run as many identical copies as needed.

What you give up: operational complexity. Load balancers, service discovery, consistent deployments across nodes and data synchronisation all become your responsibility.

2. Vertical Scaling

Vertical scaling means making the existing server more powerful. More CPU, more RAM, faster storage. This is also called scaling up.

It is underrated in engineering discussions because it sounds boring. But it is often the fastest path to solving a real bottleneck. You skip distributed state management, load balancer configuration and inter-node consistency entirely because everything still runs on one machine.

The hard ceiling is real. There is a physical limit to how powerful a single machine can be. Cloud providers offer increasingly large instance types, but at some point the cost of one massive instance exceeds the cost of several smaller ones running horizontally.

Vertical scaling is most useful in the early stages of a product and for stateful components like a primary database where distribution is expensive. A PostgreSQL primary often benefits more from a larger instance than from an attempt to shard it prematurely.

What you give up: resilience. A single server is a single point of failure.

3. Caching

Caching stores the result of an expensive operation so future requests can return that result without repeating the work.

The most common targets are database query results, API responses and rendered page fragments. Instead of hitting the database every time a product page loads, you compute the result once, store it in memory and serve it from there for subsequent requests.

Redis and Memcached are the standard in-memory cache stores. RAM access is orders of magnitude faster than a disk read or a network round trip to a database, which is why in-memory caching produces such large latency reductions.

Cache hit rate is the metric that matters. A high hit rate means most requests are served from cache without touching the database. A low hit rate means you are paying the overhead of a cache layer without getting much benefit from it.

Cache invalidation is where most engineers make mistakes. Stale data sitting in cache causes bugs that are hard to reproduce. The three main approaches are TTL-based expiration, event-driven invalidation where the cache is cleared when the underlying data changes and write-through caching where the cache is updated at the same time as the database.

What you give up: consistency. Cached data is a snapshot from a past moment. How much that matters depends entirely on your application’s tolerance for serving slightly stale data.

4. Load Balancing

A load balancer sits in front of your servers and distributes incoming requests across them. Without one, horizontal scaling does not work because all traffic would still land on a single server.

Common distribution strategies include round-robin which cycles through servers in order, least-connections which routes each request to whichever server currently has the fewest active connections and IP hash which keeps the same client IP routed to the same server for loose session affinity.

Modern load balancers do more than distribute traffic. They run health checks and remove unhealthy servers from the pool automatically. They handle TLS termination, offloading that CPU cost from application servers. They can rate limit requests and route based on URL path, HTTP headers or cookies.

Layer 4 load balancers operate at the TCP level, making decisions based on IP addresses and ports. Layer 7 load balancers inspect request content, which allows smarter routing at the cost of slightly more processing overhead.

Software options like NGINX and HAProxy, along with cloud-managed offerings like AWS Application Load Balancer and Google Cloud Load Balancing, handle the majority of web traffic today.

What you give up: the load balancer itself becomes a component that requires redundancy. Active-passive or active-active load balancer pairs are standard for production setups.

5. Sharding

Sharding splits your database horizontally across multiple servers. Each shard holds a different subset of the total data. The shard key determines which server a given record lives on.

The problem sharding solves is the write bottleneck that replication alone cannot address. With replication, all writes still go to one primary server. With sharding, writes are distributed across multiple independent database servers, each acting as the primary for its slice of the data.

Common shard keys include user ID, geographic region or a hash of some attribute. Choosing the wrong shard key causes hotspots where some shards receive far more traffic than others and the load distribution you wanted is not actually achieved.

Cross-shard queries are the painful part. Joining data across shards requires application-level logic or a middleware layer and it is slow. Schema changes across all shards simultaneously are operationally complex and error-prone.

MongoDB, Cassandra and CockroachDB have sharding built in. Sharding a PostgreSQL deployment requires additional tooling. Citus is the standard choice for that.

What you give up: query flexibility. Operations that were trivial on a single database, such as global aggregations and cross-entity joins, become significantly harder once data is split across machines.

6. Replication

Replication creates copies of your database on multiple servers. The primary node handles writes. One or more replicas receive a copy of every write and can serve read queries.

This improves read throughput because read traffic is distributed across replicas rather than hitting a single server. It also improves fault tolerance because if the primary fails, a replica can be promoted to primary.

Synchronous replication waits for a replica to confirm it received the write before acknowledging success to the client. This guarantees no data loss but increases write latency. Asynchronous replication acknowledges the write immediately and syncs to replicas in the background. This reduces latency but accepts a small window where a failure could result in data loss.

Replication lag is the delay between a write landing on the primary and becoming visible on replicas. In write-heavy systems this lag can be meaningful. Reads from replicas may return data that is slightly behind what was just written.

PostgreSQL streaming replication, MySQL replication and Redis Sentinel are the standard implementations.

What you give up: replication copies the same data to every replica. It does not help when your dataset is too large to fit on a single server. That problem is what sharding solves.

7. Partitioning

Partitioning divides a large table into smaller, more manageable pieces within the same database instance. Unlike sharding, all partitions remain inside one database. The goal is query performance and easier data management, not cross-server distribution.

There are two main types. Horizontal partitioning splits rows across partitions based on a value, for example all orders from January go to one partition and February orders to another. Vertical partitioning splits columns, moving rarely accessed columns to a separate table so queries on frequently used columns load faster.

Range partitioning works well for time-series data. A logs table partitioned by month means a query for last week only scans the current month’s partition instead of years of history. Dropping a partition to delete old data is also much faster than running a DELETE against millions of rows.

Hash partitioning distributes rows by applying a hash function to the partition key, producing more even distribution than range partitioning when there is no natural time-based structure.

PostgreSQL, MySQL and Oracle all support native declarative table partitioning.

What you give up: the partition key must align with your most common query patterns. Queries that span multiple partitions still scan each relevant partition and a poorly chosen key removes most of the performance benefit.

8. Autoscaling

Autoscaling adjusts the number of running instances dynamically based on current demand. When traffic rises, new instances are launched automatically. When traffic drops, extra instances are terminated to reduce cost.

This is horizontal scaling with the manual step removed. AWS Auto Scaling Groups, Google Cloud Managed Instance Groups, Azure Virtual Machine Scale Sets and Kubernetes Horizontal Pod Autoscaler all implement this. Scaling decisions are triggered by metrics: CPU utilisation above a threshold, requests per second crossing a limit, memory pressure or custom application metrics.

Stateless applications autoscale cleanly because any new instance can immediately handle requests without needing to synchronise any local state. Stateful services like databases and message brokers autoscale with significantly more difficulty and require specific tooling.

The cold start problem is real. New instances take time to boot, pull dependencies and warm caches. A traffic spike that lasts 30 seconds may resolve itself before new instances finish starting. Predictive autoscaling, which pre-emptively scales ahead of expected traffic peaks, addresses some of this.

What you give up: autoscaling is not instantaneous. Systems that receive sudden sharp spikes need either pre-warmed capacity or graceful degradation while new instances come up.

9. Microservices

A microservices architecture breaks an application into a collection of small, independently deployable services. Each service owns its own data, runs its own process and communicates with other services over a network using HTTP, gRPC or a message broker.

The scaling benefit is targeted resource allocation. If your search service receives ten times the traffic of your billing service, you scale search independently rather than scaling the entire application. Teams can deploy individual services without coordinating a full application release.

The tradeoff is significant operational complexity. Every service needs its own deployment pipeline, monitoring, logging and alerting. Network calls between services introduce latency and failure modes that do not exist in a monolith. Distributed tracing becomes necessary to follow a request across service boundaries.

Conway’s Law is relevant here. Organisations tend to produce systems that mirror their communication structures. Microservices work best when team ownership boundaries align with service boundaries. A small team adopting microservices before reaching that scale often creates more coordination overhead than the architecture removes.

What you give up: simplicity. A well-structured monolith with clear internal boundaries can scale further than most teams expect before microservices become the right call.

10. Event-Driven Architecture

In an event-driven system, components communicate by producing and consuming events rather than making direct synchronous calls to each other. A service publishes an event, for example “order placed”, to a message broker. Other services that care about that event subscribe and react independently.

Apache Kafka, RabbitMQ and AWS SQS are the standard message brokers for event-driven systems.

The scaling benefit is decoupling. The producer does not wait for consumers to process the event. If a downstream service is slow or temporarily unavailable, events queue up and get processed when that service recovers, without the producer experiencing any degradation.

Event-driven design pairs naturally with async processing. Order confirmations, email notifications, inventory updates and analytics events are all good candidates. For anything requiring an immediate synchronous response, like a user waiting on a payment confirmation, direct calls may still be more appropriate.

What you give up: debugging. Tracing the flow of a request through asynchronous events is harder than following a synchronous call stack. Good observability tooling and consistent event schema discipline are not optional when running this pattern in production.

What comes next

These 10 concepts form the shared vocabulary of almost every system design conversation. Part 2 covers ways 11 through 20: queueing, stateless design, indexing, timeouts, retries, rate limiting, circuit breakers, backpressure, GeoDNS and multi-region deployments.

Read the full series

Found this helpful?

If this article saved you time or solved a problem, consider supporting — it helps keep the writing going.

Originally published on Medium.

View on Medium
40 Ways to Scale a System, Part 1: The Foundations — Hafiq Iqmal — Hafiq Iqmal