HA vs DR: The Two Concepts Architects Confuse the Most

In this edition

Azure Architect Tip of the Day
Azure Good-Reads
HA vs DR: The Two Concepts Architects Confuse the Most

Azure Architect Tip of the Day

Use Azure Resource Graph for Lightning-Fast Multi-Subscription Queries

If you’re still using scripts or portal clicks to inventory resources across subscriptions, it’s time to switch to Azure Resource Graph (ARG). It lets you run SQL-like KQL queries across your entire Azure estate with sub-second performance, even in large multi-subscription environments.

Why This Matters

Traditional methods of querying resources (Azure CLI loops, PowerShell scripts, or portal navigation) are slow and cumbersome when you need to answer questions like:

"Show me all VMs without backup enabled across all subscriptions"
"Which resources are in deprecated regions?"
"Find all public IP addresses not associated with a firewall"

Azure Resource Graph indexes your entire Azure estate and returns results in milliseconds.

Quick Start Examples

Find all VMs by size across all subscriptions:

Resources
| where type == "microsoft.compute/virtualmachines"
| summarize count() by tostring(properties.hardwareProfile.vmSize)
| order by count_ desc

Identify unattached managed disks (cost optimization):

Resources
| where type == "microsoft.compute/disks"
| where properties.diskState == "Unattached"
| project name, resourceGroup, subscriptionId, sku.name, properties.diskSizeGB

Find resources without required tags:

Resources
| where tags !has "Environment"
| project name, type, resourceGroup, subscriptionId
| limit 100

Pro Implementation Tips

Use Azure Resource Graph Explorer in the portal to test queries before automating them
Export results to CSV directly from the portal for reporting
Integrate with Azure Workbooks for interactive dashboards
Automate with PowerShell/CLI for scheduled compliance reports

Cost benefit: The use of Azure Resource Graph for governance and reporting is effectively free. While the service is subject to internal throttling quotas to manage load, the limit is generous enough that most manual governance and typical reporting needs will not incur any cost or hit the throttle limit.

Azure Good-Reads

HA vs DR: The Two Concepts Architects Confuse the Most

Over the years, I’ve worked with and interviewed many aspiring Azure Architects. Surprisingly, quite a few of them — even some with solid hands-on experience — were confused between two of the most fundamental terms in cloud architecture: High Availability (HA) and Disaster Recovery (DR).

I don’t blame them. The two are often mentioned in the same breath and appear side by side in every architecture deck. Yet, they solve very different problems.

Think of it like air travel: High Availability is having multiple engines on your aircraft — if one engine fails, the others keep the plane flying smoothly. Disaster Recovery is having a standby aircraft and crew at another airport — ready to take over if the entire plane becomes inoperable.

One keeps you in the air. The other ensures you can still complete your journey.
That distinction — between continuity and recovery — is the foundation of resilience in Azure architecture.

HA and DR are like twins — they look similar from a distance but behave very differently when things go wrong.

First Things First: HA ≠ DR

Concept	What It Is	Objective	When It Helps	Typical Azure Design
High Availability (HA)	Resilience within the same region or system	Minimize downtime	Component or system-level failure	Availability Zones, Load Balancers, Azure SQL Zone Redundancy
Disaster Recovery (DR)	Resilience across regions	Minimize data loss & restore service continuity	Region-wide outage or catastrophic failure	Azure Site Recovery (ASR), Geo-replication, Paired Regions

Designing HA in Azure

Let’s talk design patterns.

Layer	HA Approach	Azure Services / Features
Compute	Deploy across Availability Zones	Mutiple VMs, Virtual Machine Scale Sets, Availability Zones
App Layer	Distribute traffic intelligently	Azure Front Door, Application Gateway, Azure Load Balancer
Database	Zone-redundant or clustered setup	SQL MI (Zone redundant), Cosmos DB multi-region writes
Storage	Zone-redundant replication for local resilience	ZRS Storage Accounts
Networking	Redundant gateways & routes	Active/active VPN, ExpressRoute with dual circuits, ExpressRoute with VPN

In short, HA is really about smart local redundancy. If a VM, a component, or even a zone fails, another one steps in immediately and the application keeps running. And when you break down the SLAs, the difference is striking:

A single VM offers 99.9% uptime — roughly 43 minutes of downtime per month.
Using an Availability Set increases that to 99.95%.
Spreading workloads across Availability Zones takes you to 99.99%, which is just about 4.4 minutes of downtime a month.

A small architectural decision, but a massive leap in real-world resilience — and almost always worth the extra cost.

Designing DR in Azure

Now, DR is about surviving the unthinkable:
a regional disaster, large-scale outage, or even ransomware attack.

Your focus shifts from uptime to RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

Here’s how I usually frame it in workshops:

DR Metric	Meaning	Target
RTO	How long can we afford to be down?	e.g., 2 hours
RPO	How much data can we afford to lose?	e.g., 15 minutes

And then choose Azure solutions accordingly:

Layer	DR Approach	Azure Services / Features
Compute	Replicate VMs and workloads to secondary region	Azure Site Recovery (ASR) for VM replication, Azure Backup, cross-region VM images, Infrastructure as Code (ARM/Bicep/Terraform) for rapid redeploy
App Layer	Deploy standby or cold secondary instances in paired region	Traffic Manager (for failover routing), Front Door (multi-region failover)
Database	Use geo-replication or failover groups to replicate data	Azure SQL Database Auto-failover Groups, SQL MI Geo-replication, Cosmos DB Multi-region replication
Storage	Replicate or restore data to secondary region	Geo-Redundant Storage (GRS) or Geo-Zone-Redundant Storage (GZRS) with Read-access GRS (RA-GRS) for DR testing, Azure Backup vaults
Networking	Design multi-region connectivity and DNS-based failover	Azure DNS with Traffic Manager/Front Door for global failover, ExpressRoute Global Reach, dual VPN gateways across paired regions

💰 Cost insight: DR setups can add 15–30% extra monthly cost, but compared to the potential loss during a region-wide outage (easily USD 50,000–100,000/hour for enterprise workloads), it’s an investment, not an expense.

Four common DR strategies

1. Cold DR (Backup & Restore)

Cold DR is the most basic and cost-effective strategy. In this model, no infrastructure runs in the DR region. You only store backups of your application and databases — typically in Azure Recovery Services Vaults, Blob Storage snapshots, or Azure Backup.

If the primary region fails, you must:
✔️ Recreate the infrastructure
✔️ Restore data from backups
✔️ Reconfigure networking & dependencies

This can take hours to days, but the cost is minimal because you’re paying only for storage.

Best For: Non-critical workloads, dev/test environments
Typical Cost: $50–$200/month (storage only)
RTO: Hours–Days | RPO: Hours

2. Warm DR (Pilot Light Strategy)

Warm DR keeps the minimum critical components running in the DR region so that you can “ignite” the rest quickly during a disaster.

In this model:

Databases or stateful components are geo-replicated and always on
Some lightweight services (identity, config) are live
Most compute (VMs / App Services) are not deployed or are fully stopped
Infrastructure is recreated or scaled out during failover

When disaster hits, automation (ARM/Bicep templates, DevOps pipelines, or ASR recovery plans) spins up the remaining compute and app tiers. This takes tens of minutes, but keeps costs moderate.

Best For: Business-critical systems needing recovery within ~1 hour
Typical Cost: $500–$2,000/month
RTO: 30–60 minutes | RPO: Minutes

3. Hot DR (Active–Passive)

Hot DR represents a fully deployed standby environment in the DR region that mirrors production — but does not serve traffic during normal operations.
This is different from Warm DR in a key way:

✔️ Warm DR = Only partial infra deployed

✔️ Hot DR = Full infra deployed (but idle or low-sized)

In Hot DR:

All tiers (web/app/APIs/DB) already exist in the DR region
Data is continuously replicated (SQL geo-replication, ASR replication)
Networking, firewalls, Key Vault, private endpoints — everything is pre-configured
Compute often runs at minimal capacity (e.g., 1 small instance instead of 5 large ones)

During failover, Azure Front Door or Traffic Manager simply reroutes traffic. There is usually no provisioning step — only a traffic flip and optional scale-out.

You pay significantly less than Active–Active because the DR site is not running full load, even though it is fully deployed.

Best For: Mission-critical workloads requiring quick recovery
Typical Cost: $5,000–$10,000/month (~0.5×–0.8× production cost)
RTO: 5–10 minutes | RPO: Near-zero

4. Active–Active

Active–Active is the highest level of resiliency. Both regions are fully deployed and actively serving traffic simultaneously.

This is made possible through:

Azure Front Door load balancing across regions
Multi-region write databases (Cosmos DB, or custom multi-master patterns)
Active synchronization of state, sessions, and storage

If one region goes down, the other continues seamlessly — users don’t even notice.
Because both regions run full production load 24×7, the cost is effectively double.

Best For: Global, real-time, customer-facing apps (banking, e-commerce, SaaS)
Typical Cost: 2× production
RTO: Zero | RPO: Zero

Testing: The Forgotten Pillar

Most DR plans fail not because of design — but because they were never tested.

At one customer, their DR runbook looked perfect.
But during the first real failover attempt, the automation script hit a “permissions error” — it hadn’t been updated after a role change six months earlier.

Lesson learned:
Schedule quarterly DR drills. Automate validations. Review your Azure runbooks and network failover flows regularly.

Azure Site Recovery makes this surprisingly easy with non-disruptive test failovers — use them.

Final thoughts

Just like twins with distinct personalities, HA and DR serve completely different purposes once you get close enough to understand them. One keeps your workload flying through everyday turbulence — a failed VM, a faulty NIC, a zone outage — while the other steps in when the entire aircraft is grounded and you need an alternate runway to complete the journey.

When architects blur the two, they either overspend on the wrong solution or leave critical workloads exposed. But when you treat HA and DR as complementary layers of the same resilience strategy, everything falls into place: HA absorbs the small hits, DR shields you from the catastrophic ones, and together they give your applications the continuity and recovery they truly need.

In the end, resilience in Azure isn’t achieved by choosing between HA and DR — it’s achieved by knowing exactly what each twin is meant to protect you from, and designing with both in mind from day one.

If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.

HA vs DR: The Two Concepts Architects Confuse the Most

In this edition

Azure Architect Tip of the Day

Use Azure Resource Graph for Lightning-Fast Multi-Subscription Queries

Why This Matters

Quick Start Examples

Pro Implementation Tips

Azure Good-Reads

HA vs DR: The Two Concepts Architects Confuse the Most

First Things First: HA ≠ DR

Designing HA in Azure

Designing DR in Azure

Four common DR strategies

1. Cold DR (Backup & Restore)

2. Warm DR (Pilot Light Strategy)

3. Hot DR (Active–Passive)

✔️ Warm DR = Only partial infra deployed

✔️ Hot DR = Full infra deployed (but idle or low-sized)

4. Active–Active

Testing: The Forgotten Pillar

Final thoughts

Keep Reading

The Azure Architect's Playbook