Your Client Asked for DR. They Actually Need HA. There Is a Difference.

Disaster Recovery and High Availability are not the same thing. One keeps you running. The other brings you back. Confusing the two costs real money.

The tender document was 47 pages long. Buried on page 31 was a single line under technical requirements: “System must support Disaster Recovery.”

No RPO. No RTO. No definition of what “disaster” meant in this context. Just two words that someone had copy-pasted from a previous tender, which had probably copy-pasted it from the one before that.

I asked the client what they meant by DR. They said the system should not go down. I asked what their acceptable downtime was if something went wrong. They said none. I asked what their acceptable data loss window was. They looked at me like I had asked them to explain the internal combustion engine.

They did not want Disaster Recovery. They wanted High Availability. Those are different problems. They have different architectures. They have very different price tags.

Nobody in that room knew the difference, and the developer who eventually built the system would be held to a requirement that nobody could actually define.

This happens in almost every SME and government project tender in Malaysia. “DR” gets written into requirements the same way “cloud-based” and “AI-powered” do. As signals of seriousness rather than actual technical specifications. The developer nods, the client feels covered and the system gets built without either.

Let me explain what these terms actually mean, what they cost and when you genuinely need each one.

What High Availability Actually Is

High Availability is about keeping your system running through failure. Not recovering from failure. Running through it.

The core idea is redundancy. You eliminate single points of failure by duplicating the components that matter. If one server dies, another takes over. If one database node fails, a replica promotes itself. If one availability zone has a network issue, traffic routes to another. The user experiences a slow request at worst. Ideally they notice nothing.

HA is measured by uptime percentage. You have probably seen these numbers:

99% uptime means roughly 3.65 days of downtime per year
99.9% means about 8.7 hours per year
99.95% means about 4.4 hours per year
99.99% means about 52 minutes per year
99.999% means about 5 minutes per year

Each additional nine is significantly more expensive and complex to achieve than the last. Most Malaysian SME applications have no contractual uptime requirement at all. Most government systems sit somewhere around 99.9% in practice regardless of what the tender says.

HA is achieved through things like:

Load balancers that distribute traffic across multiple application servers so a single server failure does not take down the site.

Database replication where a primary database continuously replicates to one or more replicas. If the primary fails, a replica can be promoted. In MySQL this is called source-replica replication. In PostgreSQL it is streaming replication.

Multi-AZ deployments on cloud providers like AWS where your resources run across multiple physical data centers in the same region. If one data center has an issue, the other continues serving traffic.

Auto-scaling groups that spin up new instances automatically when demand spikes or when a running instance becomes unhealthy.

HA keeps your system alive. It is about uptime, not recovery.

What Disaster Recovery Actually Is

Disaster Recovery is about what happens when HA fails. Or when the failure is so severe that no amount of redundancy within your normal environment can save you.

Think: the entire data center floods. A ransomware attack encrypts every database in your primary region. A misconfigured deployment script drops your production database and the replica replicated the deletion before anyone noticed. A cloud provider has a region-wide outage that takes down every availability zone simultaneously.

HA cannot help you in these scenarios because the redundancy you built lives in the same environment that just failed.

DR is the plan for getting back. It is measured by two numbers that every developer should know and every client should be able to answer before writing “DR” into a tender:

RPO (Recovery Point Objective). How much data loss is acceptable? If your RPO is one hour, you can afford to lose up to one hour of transactions in a disaster. If your RPO is zero, you need synchronous replication to a completely separate environment at all times. Zero RPO is extremely expensive. Most businesses that say they need zero RPO actually mean they would prefer to lose as little as possible, which is a very different and more achievable goal.

RTO (Recovery Time Objective). How long can the system be down during a disaster? If your RTO is four hours, your DR plan needs to get you back online within four hours of declaring a disaster. If your RTO is fifteen minutes, you need a warm standby environment running and ready to take over almost immediately. Each step down in RTO multiplies cost significantly.

AWS classifies DR strategies into four tiers by cost and recovery speed:

Backup and restore. Cheapest. Highest RTO (hours to days). You back up regularly and restore from backup when disaster strikes. Fine for non-critical systems.
Pilot light. Core infrastructure runs at minimal scale in the DR environment. You scale it up when needed. RTO measured in tens of minutes to hours.
Warm standby. A scaled-down but fully functional version runs continuously in the DR environment. Faster RTO, higher cost.
Multi-site active-active. Full capacity runs in multiple regions simultaneously. Near-zero RTO and RPO. Most expensive by a significant margin.

Most Malaysian SME applications need tier one or two at most. Government systems with genuine recovery requirements might need tier two or three. Tier four is for banks, national infrastructure and systems where every minute of downtime has direct financial or safety consequences.

Why the Confusion Is So Common

The conflation of DR and HA comes from a reasonable place. Both are about resilience. Both involve redundancy to some degree. Both show up in the same section of a technical requirements document. But they operate at completely different failure scales and serve completely different purposes.

HA says: we expect failures and we are designed to survive them without the user noticing.

DR says: something catastrophic happened and here is the documented plan to bring us back.

A system can have excellent HA and no DR plan. If the primary region fails completely, that well-designed HA setup goes down with it. A system can also have a solid DR plan and poor HA. It might go down several times a year from ordinary failures, but when something truly catastrophic happens, it recovers within the defined RTO.

The ideal is both. But they are separate budgets, separate architectures and separate conversations.

When a client writes “DR” in a tender and means “it should not go down,” they are asking for HA and calling it DR. This matters because the developer who quotes for a proper DR implementation will price it correctly and lose the tender to someone who quotes for HA dressed up as DR. The client gets neither properly. The system gets built, launched and eventually fails in a way nobody planned for because the DR plan that was promised exists as a single paragraph in a technical document that nobody has ever tested.

Nobody Tells You What This Actually Costs

This is the part that ends tender meetings fast.

Most clients who write “DR” into requirements have never priced it. They assume it is a configuration, not a budget line. It is not. Every step up in recovery capability has a direct and compounding cost that most Malaysian SME and government project budgets have not accounted for.

Here is a realistic cost breakdown using AWS ap-southeast-1 (Singapore), which is the region most Malaysian deployments use. All figures are in USD per month as AWS bills in USD. Treat these as directional estimates, not quotes. Your exact numbers depend on instance size, storage and traffic.

Basic single-server setup (no HA, no DR) One EC2 instance, one RDS Single-AZ database, basic S3 backups.

Approximate monthly cost: USD 150 to USD 300.

This is what most small projects actually run on. One server. One database. Backups to S3. If the server dies, you are down until you restore. RTO measured in hours. RPO equal to your last backup.

HA setup (Multi-AZ, load balanced) Two EC2 instances behind an Application Load Balancer, RDS Multi-AZ with automatic standby failover.

Approximate monthly cost: USD 400 to USD 700.

Multi-AZ RDS deployments cost roughly twice as much as Single-AZ because you are running two database instances continuously. Add the second EC2 instance and the load balancer, and your infrastructure bill roughly doubles compared to the basic setup. This is real HA. Routine failures are handled automatically. Users notice nothing.

DR added on top of HA (Pilot Light to another region) The HA setup above, plus a minimal pilot light environment in a second region (ap-southeast-3 Jakarta or ap-east-1 Hong Kong), cross-region RDS read replica, S3 cross-region replication.

Approximate monthly cost: USD 800 to USD 1,400.

You are now paying for infrastructure in two regions. The second region sits mostly idle, but it is not free. The cross-region RDS read replica runs continuously. Cross-region data transfer incurs its own charges on top of instance costs. Your monthly bill is roughly three to five times the basic single-server setup. RTO with pilot light is measured in tens of minutes to a few hours. It still requires manual intervention to promote the replica and scale up compute.

Warm standby DR Full HA in primary region, plus a scaled-down but fully operational environment in a second region running at all times, ready to accept traffic on short notice.

Approximate monthly cost: USD 1,500 to USD 3,000 or more.

You are running two environments simultaneously. The secondary is smaller but it is live. Failover is faster than pilot light because nothing needs to spin up cold. This is where the cost becomes serious. Most Malaysian SME projects have a total infrastructure budget that does not reach this number, let alone sustain it annually.

Multi-site active-active Two or more full production environments running simultaneously across regions, both serving live traffic.

Approximate monthly cost: USD 4,000 and above, scaling with traffic and data volume.

This is not for SME projects. This is for banks, payment processors and national infrastructure. The cost is not just compute. It is the operational complexity of managing multi-region failover, data consistency, routing and testing. Most teams that quote for this have not operated it.

The compounding costs nobody mentions

The instance and database costs above are the starting point. The real budget includes:

Operational overhead. Someone has to monitor this, respond to alerts and maintain the runbooks. That is time your team is spending on infrastructure, not product. For a solo technical lead, that cost is invisible on paper but very real in hours.

DR testing. An untested DR plan is not a DR plan. Testing means scheduled failover drills, data restore validation and post-drill remediation. At minimum once per quarter. That is four planned disruptions to your ops calendar per year.

Data egress. Moving data between regions is not free. Cross-region replication on AWS ap-southeast-1 to a second region costs extra per GB transferred. On a system with active writes, this adds up faster than most estimates account for.

The conversation a developer should have with a client is not just “what DR tier do you need.” It is “here is what tier one costs, here is what tier two costs and here is what your budget can actually support.” Then you work backward to the architecture.

If the client has a USD 300 per month infrastructure budget and a DR requirement in the tender, one of those things needs to change. Your job is to make that visible before the contract is signed, not after you have built something that does not match either expectation.

The Questions to Ask Before Accepting Any DR Requirement

When a client or tender specifies DR, ask these before scoping a single line of architecture.

What is your RPO? How much data can you afford to lose? An hour? A day? Zero? If they cannot answer this, they have not thought through the business impact of data loss. Help them think through it. A payroll system has a very different RPO from a content management site.

What is your RTO? How long can the system be offline before it causes serious business damage? If they say “none,” ask what happens in practice if the system is down for an hour, a day, a week. The real number usually surfaces quickly when you frame it in business terms rather than technical ones.

What counts as a disaster? A single server failure is not a disaster. That is a routine failure that HA handles. A disaster is an event that takes out your entire environment. Has that ever happened to them? What triggered the DR requirement in the first place?

What is the budget? This is the most clarifying question. A genuine multi-region DR setup with aggressive RPO and RTO is expensive to build and expensive to maintain. If the budget does not match the stated requirement, one of them needs to change.

Has anyone ever tested the DR plan? For existing systems claiming to have DR, ask when they last did a failover drill. An untested DR plan is not a DR plan. It is a document.

When DR Is Genuinely Warranted

Not every system needs full DR. But some do and the decision should be based on actual business impact, not checkbox requirements.

DR is genuinely warranted when:

The system handles financial transactions and data loss directly translates to unrecoverable monetary loss. A payment system that processes thousands of ringgit per hour has a very different RPO than a company blog.

The system is regulated and compliance requires documented recovery capabilities with tested procedures. Bank Negara Malaysia’s regulatory framework covers this directly. The Risk Management in Technology (RMiT) policy covers technology resilience for financial institutions, while the Business Continuity Management (BCM) Policy issued in December 2022 explicitly requires formal DR plans with defined RPO and RTO that are regularly tested.

The system supports public services where downtime has downstream human impact. A hospital appointment system, an emergency services portal or a utility billing platform needs both HA to stay running and a DR plan for when an event takes out the entire environment. The consequences of data loss or extended outage are not just technical. They affect real people.

The client has experienced a real disaster before and the requirement comes from institutional memory rather than a checkbox. These clients usually have specific RTO and RPO numbers already. They have felt what happens when those numbers are not met.

For everything else, HA is almost certainly what the situation actually demands. Build it properly, document it clearly and stop calling it DR.

What Good Looks Like in Practice

For a typical Malaysian B2B or government-adjacent application, a practical resilience architecture has two distinct layers. They solve different problems and should be budgeted separately.

The HA layer: keeping you running through ordinary failure

Application layer: Two or more instances behind a load balancer across multiple availability zones. Auto-healing enabled so failed instances are replaced automatically.

Database layer: Primary with at least one replica in a separate availability zone. Automated failover configured so the replica promotes itself without manual intervention.

Auto-scaling: Instances scale out under load and unhealthy instances are replaced. The system handles routine failures without anyone waking up at 3am.

This is your HA investment. It handles the 99% of failure scenarios: a crashed instance, a bad deployment, a database node going unresponsive. The user might see a slow request. They do not see downtime.

The DR layer: getting you back after catastrophic failure

Backup strategy: Daily snapshots retained for thirty days minimum. Backups stored in a separate region or storage account so a primary region failure does not take your backups with it.

Restore procedure: Documented, versioned and tested at least once per quarter. A backup you have never restored is not a backup. It is a hope.

Monitoring and alerting: Uptime monitoring with alert thresholds that catch degradation before it becomes failure. Clear escalation path documented so the right person gets the alert, not just whoever happens to be online.

This DR layer gives you a backup-and-restore capability. Your RPO is roughly the time since your last snapshot. Your RTO is however long it takes to spin up a new environment and restore from backup, typically hours. For most SME applications, that is acceptable.

If the client pushes back and asks where the “real DR” is, show them the RPO and RTO this architecture delivers and ask them to confirm whether that meets their business requirements. Usually it does. Usually the conversation ends there.

The Actual Problem

Developers who accept vague DR requirements without challenging them are not doing their clients any favors. You are accepting responsibility for a requirement that nobody has defined, which means you can never meet it and the client can always claim you did not.

Make them define it. RPO. RTO. Budget. Test frequency. Failure scenarios.

If they cannot answer those questions, you are not scoping a DR implementation. You are scoping an HA system. Name it correctly, architect it correctly and price it correctly.

Two words in a 47-page tender should not determine how a production system gets built. But they will if you let them.