HA/DR : A Field Guide to the Most Painful Mistakes

In this edition

Azure Architect Tip of the Day
Azure Good-Reads
HA/DR : A Field Guide to the Most Painful Mistakes

Azure Architect Tip of the Day

Leverage Azure Policy for Consistent Resource Governance at Scale

One of the most overlooked yet powerful features for Azure architects is Azure Policy. Instead of relying on manual reviews or post-deployment audits, use Azure Policy to enforce organizational standards and compliance requirements before resources are even created.

Why This Matters

As your Azure environment grows, maintaining consistency across subscriptions, resource groups, and teams becomes increasingly difficult. Azure Policy acts as guardrails that prevent configuration drift and non-compliant deployments automatically.

Quick Implementation Tips

Start with built-in policies - Azure provides hundreds of pre-configured policies for common scenarios (enforce tags, allowed VM SKUs, require encryption, etc.)
Use Policy Initiatives - Group related policies together into initiatives (policy sets) for easier management. For example, create a "Production Workload" initiative that bundles networking, security, and compliance policies.
Leverage Deny vs Audit effects strategically:
- Use Audit during initial rollout to identify non-compliant resources without blocking deployments
- Switch to Deny for critical policies once teams are familiar with requirements
Apply at the right scope - Assign policies at Management Group level for organization-wide standards, but allow exceptions at subscription or resource group levels when needed

Real-World Example

// Require specific tags on all resources
{
  "policyRule": {
    "if": {
      "field": "tags['CostCenter']",
      "exists": "false"
    },
    "then": {
      "effect": "deny"
    }
  }
}

This simple policy ensures every resource has a CostCenter tag, enabling accurate cost tracking and chargeback - saving hours of manual tagging work later.

Pro tip: Combine Azure Policy with Azure Blueprints for repeatable, compliant environment deployments that include RBAC, policies, and ARM templates in one package.

Azure Good-Reads

HA/DR : A Field Guide to the Most Painful Mistakes

If there’s one thing 25 years in IT industry has taught me, it’s this: high availability and disaster recovery are simple on paper, but brutal in real life. Not because the concepts are hard—anyone can repeat “zones = HA, regions = DR”—but because the real learning happens only after you’ve been burned by an outage, a bad assumption, or a design that looked perfect in Visio but collapsed in production.

In this post, I want to share the lessons I’ve collected the hard way—the HA/DR strategies that work, the ones that don’t, and the painful mistakes teams have made. If you’ve ever mixed up backup with DR, trusted a single region too much, or discovered during a failover drill that nothing actually failed over… you’ll feel right at home.

Let’s dive in.

Mistake 1: Thinking Backup = DR

This is one of the most common misunderstandings in HA/DR planning.
A team enables Azure Backup, sees green checkmarks everywhere, updates their compliance sheet, and assumes they’re “covered for DR.”
Backups are running, alerts are quiet, auditors are satisfied — so it feels like everything is safe.

But backup is not disaster recovery.

Reality:
Backup helps you recover your data.
DR helps you recover your systems.
Those are very different goals.

It’s important to say this clearly:
Backup can be a DR approach — but only for dev/test or non-critical workloads where long downtime is acceptable.
If an app can be offline for 24–48 hours without hurting the business, restoring from backup is perfectly fine.

Where things go wrong is when teams use the same approach for production systems with strict RTO/RPO.

A common real-world scenario looks like this:
A business experiences a regional outage. Their backups are all intact. But restoring a multi-terabyte database or rebuilding an entire environment takes hours — sometimes far more than the organisation’s RTO allows.
Nothing “failed.” Backups worked exactly as designed.
They just weren’t meant to be used as a rapid recovery solution.

The fix:
Use backup as the last line of defense — protection against corruption, accidental deletion, and ransomware.
But for actual DR, you need systems that can take over quickly:

Active-active or active-passive designs
Synchronous replication within a region (for HA)
Asynchronous replication across regions (for DR)
Automated environment rebuilds where applicable

Backups get your data back.
DR gets your business back.

And that difference is where many teams unintentionally set themselves up for long outages.

Talking of replication, here’s mistake number 2.

Mistake 2: Trying to Use Synchronous Replication Everywhere

This mistake usually starts with a well-intentioned request from the business:
“We cannot afford to lose any data. Make our RPO zero.”

It sounds reasonable.
But it often leads some architects into one of the most damaging design choices — enabling synchronous replication across Azure regions.

What actually happens:
Synchronous replication requires every write operation to be acknowledged by both regions before it’s committed.
Across regions, this introduces 50–200 ms of latency per write, depending on the region pair.

And that slows everything down.

The fix:
Be realistic about where RPO = 0 is possible.

Use synchronous replication within a region (for HA).
Use asynchronous replication across regions (for DR).

For most organisations, the trade-off is simple:
Losing 20–30 seconds of data during a rare regional outage is far better than slowing down every user interaction by several seconds.

When framed clearly —

❝

“It’s a choice between a few seconds of data loss once in many years, or a slower experience for every customer right now.”

— the decision becomes much easier and far more rational.

Mistake 3: Not Updating the HA/DR Plan

What happens:
Teams spend weeks building a solid HA/DR setup. Everything is documented nicely, tested during go-live, and celebrated as “done.”
Then the project moves on… and the HA/DR setup is never touched again.

Meanwhile, the application continues to evolve. New services are added. Old servers are retired. Dependencies change.
But the HA/DR plan stays frozen in time.

Reality:
It’s common to see DR documentation that references servers that no longer exist, architecture diagrams that haven’t been updated in years, or critical new microservices that were never added to the DR plan.
In many organisations, the person who originally designed the DR strategy eventually leaves — and the remaining team isn’t fully sure how the failover is supposed to work anymore.

The fix: Treat HA/DR as living architecture. It needs continuous care. Here’s what actually works:

Quarterly reviews of all HA/DR documentation
HA/DR updates in the deployment checklist — every major application change should include an HA/DR impact assessment
Biannual DR tests (real tests, not “let’s check if backups exist”)
Architecture Decision Records (ADRs) that explain not just what was built, but why

And most importantly, assign clear ownership.
Someone on the team should always be able to answer:

❝

“Is our DR plan current, and when did we last test it?”

HA/DR only works when it stays up to date.

Mistake 4: The “Active-Passive” Illusion

What happens:
Teams build an active site in one region and a passive site in another. On paper, it feels like a complete DR setup.

Reality:
Passive environments often drift out of sync because they aren’t used day-to-day. When a real failover happens, teams commonly find:

The passive site is running older app versions
Config changes in production were never applied
Some DR resources were deallocated to reduce cost
Network routes and integrations weren’t updated

A failover planned for 30 minutes quickly becomes hours of recovery.
Not because the design was wrong — but because the passive site wasn’t kept ready.

The fix:
If you choose active-passive, make sure the passive site is genuinely operational:

Keep it in version parity with production
Run regular synthetic checks to confirm it works
Sync configs, secrets, certificates, and networking rules
Monitor it the same way you monitor production

And the final test is simple:

❝

Could you switch to your DR site right now and be confident it will work?

If the answer isn’t a clear “yes,” the passive site isn’t truly ready.

Mistake 5: Ignoring Regional Dependencies and Cascading Failures

What happens:
Teams design with two regions in mind — for example, primary in East US and DR in West US. Because the compute and data layers are separated across regions, it feels like full regional redundancy.

Reality:
Many setups still depend on services that aren’t regionally distributed. During a failover, these hidden dependencies become the real points of failure.

Common issues seen in DR drills and real outages include:

Authentication failing because Azure AD B2C, Entra endpoints, or custom identity services weren’t deployed or planned for multi-region use
DNS not resolving correctly because Azure DNS was tied to a single hosted zone
Applications unable to start because both primary and DR sites pulled secrets from a single Key Vault in the affected region

In all these cases, workloads fail over successfully — but the supporting services don’t — causing the entire system to break.

The fix:
Map every dependency your application needs, not just the infrastructure layer. Review each one through an HA/DR lens:

Identity & Authentication: Are you using global or region-specific endpoints?
Secrets: Is Key Vault geo-replicated or duplicated in your DR region?
DNS: Will DNS resolution work if the primary region is unavailable?
Third-party services: Where are they hosted, and what is their DR strategy?
Control plane vs. data plane: Some Azure services rely on regional control planes even if data itself is global.

Create a clear dependency chain diagram for your critical paths — covering authentication, authorization, routing, integrations, and transaction flow.

Those small “red boxes” in the dependency diagram — the ones people often ignore — are usually the exact single points of failure that can take down both your primary and DR sites.

Mistake 6: DR Tests That Don’t Reflect Real-World Conditions

What happens:
Most organisations run DR tests in a controlled, predictable manner. The team follows a predefined script, systems are switched over in a planned window, and everything appears to work. A report is filed, compliance is satisfied, and the process is marked as a success.

There’s nothing wrong with this — it’s practical, reduces risk, and keeps production safe.
But these tests often validate the plan, not the reality.

Reality:
Controlled tests rarely capture the conditions of an actual outage. In the real world, failovers happen during peak load, with active users, queued transactions, session caches, and pressure from multiple teams.
That’s where gaps surface — not in a carefully orchestrated dry run.

Common challenges that usually remain untested include:

Teams knowing the exact test timing and preparing in advance
Failovers happening during quiet traffic windows
Business processes not being exercised end-to-end
Reduced or simplified test scenarios
Issues logged but not always re-tested under load

As a result, a DR plan may look perfect on paper, yet behave very differently when real customers and real traffic are involved.

The fix:

Introducing just a bit of variation or real user activity into your testing can reveal issues long before an actual incident does. You don’t need disruptive or dramatic tests; you simply need tests that resemble how the system actually runs on an ordinary day.

And here’s a controversial opinion:

❝

A failed DR test is often more valuable than a successful one.
If your DR tests always pass, you’re probably not testing hard enough.

Closing Thoughts: HA/DR Is About Discipline, Not Technology

Azure gives you incredible tools—Availability Zones, paired regions, geo-replication, automated failover. The tech is world-class. But here’s the truth most people don’t like to hear: HA/DR failures are rarely technology failures. They’re failures of process, assumptions, and discipline.

The goal of HA/DR isn’t zero downtime.
The goal is understood, acceptable, and tested downtime that matches business expectations.

If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.

HA/DR : A Field Guide to the Most Painful Mistakes

In this edition

Azure Architect Tip of the Day

Leverage Azure Policy for Consistent Resource Governance at Scale

Why This Matters

Quick Implementation Tips

Real-World Example

Azure Good-Reads

HA/DR : A Field Guide to the Most Painful Mistakes

Mistake 1: Thinking Backup = DR

Mistake 2: Trying to Use Synchronous Replication Everywhere

Mistake 3: Not Updating the HA/DR Plan

Mistake 4: The “Active-Passive” Illusion

Mistake 5: Ignoring Regional Dependencies and Cascading Failures

Mistake 6: DR Tests That Don’t Reflect Real-World Conditions

Closing Thoughts: HA/DR Is About Discipline, Not Technology

Keep Reading

The Azure Architect's Playbook