Containers, orchestration, service mesh, CDN, monitoring, distributed tracing, failover, graceful degradation, consistent hashing and bulkhead for engineers who have seen a system die in production

Read the full series
Part 1: The Foundations — ways 1 to 10
Part 2: Stop the System From Eating Itself — ways 11 to 20
Part 3: The Infrastructure Layer — ways 21 to 30 (you are here)
Part 4: The Techniques Most Lists Skip — ways 31 to 40
There is a pattern in how systems fail. It is rarely one dramatic event. It is usually a chain: a downstream service slows down, threads pile up waiting for it, the thread pool exhausts, requests to unrelated parts of the system start timing out, alerts fire for things that have nothing to do with the original problem and by the time the root cause is identified the incident has spread far past where it started.
Part 3 is about the infrastructure layer that sits between your application code and that kind of cascading mess. It will not prevent every failure. Nothing does. But the techniques here reduce blast radius, speed up diagnosis and ensure that when something breaks, users feel as little of it as possible.
21. Containers
A container packages an application together with its dependencies, configuration and runtime into a single portable unit. That container runs identically in a developer’s local environment, a CI pipeline and a production server.
Docker made containers accessible enough to become the default unit of deployment for most backend services. The underlying technology, Linux namespaces and cgroups, had existed for years. Docker made the developer experience practical.
Containers start in seconds compared to minutes for virtual machines, consume far less memory per instance and can be packed densely onto physical servers. A host running 200 containers is significantly more resource-efficient than 200 VMs on the same hardware.
Containers are immutable by design. You do not SSH into a running container to change a configuration file. You rebuild the image, push it to a registry and redeploy. This forces reproducibility and makes rollbacks straightforward.
Container images also isolate dependencies. Two services requiring conflicting library versions run in separate containers without conflict, which was a real source of production pain in the shared-library era.
What you give up: containers are not a security boundary. A misconfigured container with elevated privileges can access the host system. Image scanning, non-root process execution and seccomp profiles are real requirements, not optional hardening.
22. Orchestration
Container orchestration automates the deployment, scaling and management of containerised applications across a cluster of machines.
Kubernetes is the dominant orchestration platform. You describe your desired state: five replicas of this container, each with 512 MB of memory, exposed on port 8080. Kubernetes works continuously to make the actual state match that description. If a container crashes, Kubernetes restarts it. If a node fails, it reschedules the containers on healthy nodes.
Beyond keeping containers running, Kubernetes handles service discovery so containers find each other by name without hardcoded IP addresses, rolling deployments that replace old containers gradually while watching health checks, autoscaling based on CPU or memory metrics and secret management.
Amazon ECS and Google Cloud Run are managed container platforms that provide orchestration without the overhead of running Kubernetes yourself. For teams without dedicated platform engineers, managed services often make more sense than self-hosting a Kubernetes cluster.
The learning curve for Kubernetes is steep. YAML configuration, cluster networking, RBAC, persistent volumes, ingress controllers and namespaces all require operational understanding. Many teams adopt Kubernetes before they need it and spend more time managing the cluster than shipping product.
What you give up: simplicity. A well-deployed monolith on a VM with good monitoring can outperform a poorly managed Kubernetes cluster. Orchestration is a tool, not a status symbol.
23. Service Mesh
A service mesh handles the network communication between services in a microservices architecture, taking concerns like mutual TLS encryption, retries, circuit breaking, load balancing and observability out of application code and into the infrastructure layer.
Without a service mesh, every service implements its own retry logic, timeout configuration and tracing instrumentation. Those implementations diverge over time. Some services handle failures gracefully. Others do not. Visibility into traffic between services is inconsistent across languages and teams.
A service mesh injects a sidecar proxy alongside each service container. All inbound and outbound traffic passes through this proxy. The control plane configures all proxies centrally. Istio, Linkerd and Consul Connect are the main options.
The benefits are substantial: uniform traffic policies across all services regardless of what language they are written in, automatic mutual TLS for all inter-service communication, granular traffic metrics without any application code changes and fine-grained routing control for canary deployments and A/B testing.
The cost is complexity and added latency. Every request passes through two additional proxy hops. For most applications the added latency is under a millisecond, but it is real overhead at high request volumes.
What you give up: simplicity. If your architecture has five services, a service mesh is overhead you do not need. If you are running 50 services and struggling with inconsistent network policy, it starts to pay for itself.
24. CDN
A Content Delivery Network is a globally distributed network of edge servers that cache and serve static content close to users.
Instead of every user’s browser fetching your JavaScript bundle, CSS files and images from your origin server in a single data centre, they fetch those assets from a CDN edge node that may be physically nearby. Cloudflare operates edge nodes in over 300 cities. Akamai’s network spans over 4,300 points of presence globally as of February 2025.
Modern CDNs do more than serve files. They absorb DDoS traffic at the edge before it reaches your origin. They terminate TLS connections at edge nodes, reducing processing overhead on your servers. They handle HTTP/2 and HTTP/3 termination, compression and image optimisation at scale.
Cache invalidation in CDNs is a frequent source of bugs during deployments. Shipping a new version of your JavaScript bundle while old cached versions still serve some users creates inconsistency. Content-addressed URLs, where the filename includes a hash of the file content, ensure new deployments always result in new cache keys.
What you give up: cost per gigabyte served. For high-traffic applications this cost is typically much lower than the savings in origin server bandwidth and capacity.
25. Monitoring
Monitoring is the continuous measurement of a system’s health and behaviour through metrics, logs and alerts.
You cannot debug what you cannot observe. You cannot improve what you do not measure. The number of production systems running without meaningful monitoring is larger than anyone in the industry wants to admit.
Prometheus is the most widely adopted metrics collection system for cloud-native environments according to the CNCF Annual Survey. It scrapes numeric measurements from services at regular intervals and stores them in a time-series database. Grafana provides dashboards on top of that data. For logs, the OpenSearch stack is commonly used. Cloud-native options include AWS CloudWatch, Google Cloud Monitoring and Datadog.
The four golden signals from Google’s Site Reliability Engineering book provide a minimal but effective starting point for what to monitor: latency which is how long requests take, traffic which is how many requests arrive, errors which is how many requests fail and saturation which is how close the system is to its resource limits.
What you give up: monitoring takes time to build and money to run. Collecting every possible metric generates noise that makes it harder to find real signals. Good monitoring requires deliberate choices about what matters and which alerts are actually actionable.
26. Distributed Tracing
Distributed tracing records the path of a request as it flows through multiple services, producing a timeline showing which service handled which part of the request and how long each step took.
In a system where one user request triggers calls to five different microservices, debugging a latency spike requires understanding which service in the chain was slow. Logs from individual services tell you what each service did in isolation. A trace tells you the full story across all of them in sequence.
OpenTelemetry is the current standard for instrumenting applications. It provides vendor-neutral SDKs for most languages that emit trace data compatible with backends including Jaeger, Zipkin, AWS X-Ray, Datadog and Honeycomb.
A trace is made up of spans. Each span represents one unit of work: a service call, a database query or an external API request. Spans are linked by a trace ID propagated through all downstream calls, usually via HTTP headers.
Sampling is necessary in high-volume systems. Tracing every request at scale produces enormous data volumes and measurable overhead. Head-based sampling decides whether to trace at the start of a request. Tail-based sampling decides after the request completes, allowing you to bias toward tracing slow or failed requests.
What you give up: instrumentation overhead and storage costs. Retrofitting tracing into an existing system with many services is a significant project.
27. Failover
Failover is the process of automatically switching to a standby system when the primary fails.
Database failover is the most common scenario. In a primary-replica setup, the primary handles all write traffic. When the primary becomes unavailable due to hardware failure, network partition or software crash, a replica is promoted to primary and traffic is redirected. A well-configured setup can complete this transition in under 30 seconds.
Automatic failover requires health checking where the primary is continuously tested for availability, leader election where the system decides which replica becomes the new primary and DNS or connection string updates so clients know where to connect.
PostgreSQL tools like Patroni and Repmgr manage PostgreSQL failover. MySQL Group Replication handles the same for MySQL. Cloud-managed databases like AWS RDS, Google Cloud SQL and Azure Database include automatic failover as a standard feature.
Failover is not instantaneous. There is always a brief window of unavailability during the transition. Applications must handle this with connection retries and appropriate messaging to users.
Testing failover is as important as building it. A failover configuration that has never been exercised in practice is likely to have unexpected failure modes. Regular exercises where failover is intentionally triggered in a production-like environment validate that the process actually works under real conditions.
What you give up: a brief period of unavailability during each failover event. The objective is to minimise that window, not eliminate it entirely.
28. Graceful Degradation
Graceful degradation is the ability of a system to continue operating in a reduced but functional state when one or more components fail.
The alternative is catastrophic failure: a dependency goes down and the entire application returns errors. Graceful degradation means the application identifies what it cannot do and continues doing everything it can.
An e-commerce site whose recommendation service is down should still show the product you searched for, just without personalised recommendations. A social platform whose notification service is unavailable should still let you post, just without delivering real-time notifications. Showing cached or default results is better than a full-page error.
Feature flags and circuit breakers are the main implementation tools. Feature flags let you disable specific features at runtime without a code deployment. Circuit breakers detect when a dependency is failing and return fallback responses automatically.
The engineering work is defining what the degraded state looks like for each feature, what data is needed to serve it and what fallback is acceptable. This requires explicit product decisions: what is core functionality, what is enhancement and what can disappear without users noticing.
What you give up: engineering time and product complexity. Every feature with a graceful degradation path requires additional logic and testing. Prioritise the paths that protect core user journeys.
29. Consistent Hashing
Consistent hashing distributes keys across a set of nodes in a way that minimises disruption when nodes are added or removed.
The problem it solves starts with naive modulo hashing. With 10 nodes, you assign each key to a node using key mod 10. This gives even distribution. But add an 11th node and the formula becomes key mod 11, which remaps almost every key to a different node. In a distributed cache, this means near-total cache miss.
Consistent hashing places both keys and nodes on a circular ring representing the full hash space. Each key is assigned to the first node clockwise from its position on the ring. When a node is added, only the keys between the new node and its predecessor in the ring need to move. When a node is removed, only the keys assigned to that node move to the next node clockwise. On average, only 1/n keys are remapped when a node is added or removed.
Virtual nodes address uneven distribution that occurs when physical nodes cluster on the ring. Each physical node is represented by multiple virtual nodes at different positions, producing more balanced distribution across the actual machines.
Consistent hashing is used in distributed caches, Amazon DynamoDB and Apache Cassandra. It is one of the foundational algorithms in modern distributed systems.
What you give up: slightly more implementation complexity than modulo hashing. Distribution is not perfectly uniform even with virtual nodes.
30. Bulkhead
The bulkhead pattern is taken from shipbuilding, where a hull is divided into watertight compartments. If one compartment floods, the others stay intact and the ship stays afloat.
In software, a bulkhead isolates failures in one component from spreading to others through resource isolation: separate thread pools, connection pools or process groups for different parts of the system.
Without bulkheads, a slow downstream API can exhaust your entire thread pool. Every thread blocks waiting for the slow response. New requests for any part of the application start queuing. Eventually the whole service appears down because of one slow dependency that is completely unrelated to most of what it was doing.
With bulkheads, you allocate a fixed pool of threads for calls to the slow downstream API. When that pool exhausts, requests to that API fail fast. The rest of the application’s thread pool remains available. Other features keep working.
Resilience4j implements bulkheads in Java and Kotlin. The pattern also applies at the infrastructure level through separate Kubernetes namespaces with resource quotas, separate database connection pools for different workload types and separate queues for different priority levels.
What you give up: resource efficiency. Reserved thread pools sized for peak load will often run well below capacity during normal traffic. The cost of isolation is accepting that some capacity sits idle as a buffer against failure.
Infrastructure as a scaling strategy
The techniques in part 3 are less about adding capacity and more about making sure the capacity you have is used reliably. Containers and orchestration make deployments consistent and recoverable. Service meshes enforce uniform network policies. CDN reduces load on origin infrastructure. Monitoring and tracing make problems visible. Failover, graceful degradation and bulkhead ensure that when something breaks, the damage stays contained.
These are not one-time implementations. They require ongoing operational discipline. A monitoring setup no one reviews produces alerts no one believes. A failover configuration never tested in production is a liability.
Part 4 covers ways 31 through 40: prefetching, lazy loading, capacity planning, read replicas, write batching and five techniques most scaling lists skip entirely.
Read the full series
- Part 1: The Foundations — ways 1 to 10
- Part 2: Stop the System From Eating Itself — ways 11 to 20
- Part 3: The Infrastructure Layer — ways 21 to 30 (you are here)
- Part 4: The Techniques Most Lists Skip — ways 31 to 40