In this edition
Azure Architect Tip of the Day
Use Azure Resource Graph for Lightning-Fast Multi-Subscription Queries
If you’re still using scripts or portal clicks to inventory resources across subscriptions, it’s time to switch to Azure Resource Graph (ARG). It lets you run SQL-like KQL queries across your entire Azure estate with sub-second performance, even in large multi-subscription environments.
Why This Matters
Traditional methods of querying resources (Azure CLI loops, PowerShell scripts, or portal navigation) are slow and cumbersome when you need to answer questions like:
"Show me all VMs without backup enabled across all subscriptions"
"Which resources are in deprecated regions?"
"Find all public IP addresses not associated with a firewall"
Azure Resource Graph indexes your entire Azure estate and returns results in milliseconds.
Quick Start Examples
Find all VMs by size across all subscriptions:
Resources
| where type == "microsoft.compute/virtualmachines"
| summarize count() by tostring(properties.hardwareProfile.vmSize)
| order by count_ descIdentify unattached managed disks (cost optimization):
Resources
| where type == "microsoft.compute/disks"
| where properties.diskState == "Unattached"
| project name, resourceGroup, subscriptionId, sku.name, properties.diskSizeGBFind resources without required tags:
Resources
| where tags !has "Environment"
| project name, type, resourceGroup, subscriptionId
| limit 100Pro Implementation Tips
Use Azure Resource Graph Explorer in the portal to test queries before automating them
Export results to CSV directly from the portal for reporting
Integrate with Azure Workbooks for interactive dashboards
Automate with PowerShell/CLI for scheduled compliance reports
Cost benefit: The use of Azure Resource Graph for governance and reporting is effectively free. While the service is subject to internal throttling quotas to manage load, the limit is generous enough that most manual governance and typical reporting needs will not incur any cost or hit the throttle limit.
Azure Good-Reads
HA vs DR: The Two Concepts Architects Confuse the Most
Over the years, I’ve worked with and interviewed many aspiring Azure Architects. Surprisingly, quite a few of them — even some with solid hands-on experience — were confused between two of the most fundamental terms in cloud architecture: High Availability (HA) and Disaster Recovery (DR).
I don’t blame them. The two are often mentioned in the same breath and appear side by side in every architecture deck. Yet, they solve very different problems.
Think of it like air travel: High Availability is having multiple engines on your aircraft — if one engine fails, the others keep the plane flying smoothly. Disaster Recovery is having a standby aircraft and crew at another airport — ready to take over if the entire plane becomes inoperable.
One keeps you in the air. The other ensures you can still complete your journey.
That distinction — between continuity and recovery — is the foundation of resilience in Azure architecture.
HA and DR are like twins — they look similar from a distance but behave very differently when things go wrong.
First Things First: HA ≠ DR
Concept | What It Is | Objective | When It Helps | Typical Azure Design |
|---|---|---|---|---|
High Availability (HA) | Resilience within the same region or system | Minimize downtime | Component or system-level failure | Availability Zones, Load Balancers, Azure SQL Zone Redundancy |
Disaster Recovery (DR) | Resilience across regions | Minimize data loss & restore service continuity | Region-wide outage or catastrophic failure | Azure Site Recovery (ASR), Geo-replication, Paired Regions |
Designing HA in Azure
Let’s talk design patterns.
Layer | HA Approach | Azure Services / Features |
|---|---|---|
Compute | Deploy across Availability Zones | Mutiple VMs, Virtual Machine Scale Sets, Availability Zones |
App Layer | Distribute traffic intelligently | Azure Front Door, Application Gateway, Azure Load Balancer |
Database | Zone-redundant or clustered setup | SQL MI (Zone redundant), Cosmos DB multi-region writes |
Storage | Zone-redundant replication for local resilience | ZRS Storage Accounts |
Networking | Redundant gateways & routes | Active/active VPN, ExpressRoute with dual circuits, ExpressRoute with VPN |
In short, HA is really about smart local redundancy. If a VM, a component, or even a zone fails, another one steps in immediately and the application keeps running. And when you break down the SLAs, the difference is striking:
A single VM offers 99.9% uptime — roughly 43 minutes of downtime per month.
Using an Availability Set increases that to 99.95%.
Spreading workloads across Availability Zones takes you to 99.99%, which is just about 4.4 minutes of downtime a month.
A small architectural decision, but a massive leap in real-world resilience — and almost always worth the extra cost.
Designing DR in Azure
Now, DR is about surviving the unthinkable:
a regional disaster, large-scale outage, or even ransomware attack.
Your focus shifts from uptime to RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Here’s how I usually frame it in workshops:
DR Metric | Meaning | Target |
|---|---|---|
RTO | How long can we afford to be down? | e.g., 2 hours |
RPO | How much data can we afford to lose? | e.g., 15 minutes |

And then choose Azure solutions accordingly:
Layer | DR Approach | Azure Services / Features |
|---|---|---|
Compute | Replicate VMs and workloads to secondary region | Azure Site Recovery (ASR) for VM replication, Azure Backup, cross-region VM images, Infrastructure as Code (ARM/Bicep/Terraform) for rapid redeploy |
App Layer | Deploy standby or cold secondary instances in paired region | Traffic Manager (for failover routing), Front Door (multi-region failover) |
Database | Use geo-replication or failover groups to replicate data | Azure SQL Database Auto-failover Groups, SQL MI Geo-replication, Cosmos DB Multi-region replication |
Storage | Replicate or restore data to secondary region | Geo-Redundant Storage (GRS) or Geo-Zone-Redundant Storage (GZRS) with Read-access GRS (RA-GRS) for DR testing, Azure Backup vaults |
Networking | Design multi-region connectivity and DNS-based failover | Azure DNS with Traffic Manager/Front Door for global failover, ExpressRoute Global Reach, dual VPN gateways across paired regions |
💰 Cost insight: DR setups can add 15–30% extra monthly cost, but compared to the potential loss during a region-wide outage (easily USD 50,000–100,000/hour for enterprise workloads), it’s an investment, not an expense.
Four common DR strategies
1. Cold DR (Backup & Restore)
Cold DR is the most basic and cost-effective strategy. In this model, no infrastructure runs in the DR region. You only store backups of your application and databases — typically in Azure Recovery Services Vaults, Blob Storage snapshots, or Azure Backup.
If the primary region fails, you must:
✔️ Recreate the infrastructure
✔️ Restore data from backups
✔️ Reconfigure networking & dependencies
This can take hours to days, but the cost is minimal because you’re paying only for storage.
Best For: Non-critical workloads, dev/test environments
Typical Cost: $50–$200/month (storage only)
RTO: Hours–Days | RPO: Hours
2. Warm DR (Pilot Light Strategy)
Warm DR keeps the minimum critical components running in the DR region so that you can “ignite” the rest quickly during a disaster.
In this model:
Databases or stateful components are geo-replicated and always on
Some lightweight services (identity, config) are live
Most compute (VMs / App Services) are not deployed or are fully stopped
Infrastructure is recreated or scaled out during failover
When disaster hits, automation (ARM/Bicep templates, DevOps pipelines, or ASR recovery plans) spins up the remaining compute and app tiers. This takes tens of minutes, but keeps costs moderate.
Best For: Business-critical systems needing recovery within ~1 hour
Typical Cost: $500–$2,000/month
RTO: 30–60 minutes | RPO: Minutes
3. Hot DR (Active–Passive)
Hot DR represents a fully deployed standby environment in the DR region that mirrors production — but does not serve traffic during normal operations.
This is different from Warm DR in a key way:
✔️ Warm DR = Only partial infra deployed
✔️ Hot DR = Full infra deployed (but idle or low-sized)
In Hot DR:
All tiers (web/app/APIs/DB) already exist in the DR region
Data is continuously replicated (SQL geo-replication, ASR replication)
Networking, firewalls, Key Vault, private endpoints — everything is pre-configured
Compute often runs at minimal capacity (e.g., 1 small instance instead of 5 large ones)
During failover, Azure Front Door or Traffic Manager simply reroutes traffic. There is usually no provisioning step — only a traffic flip and optional scale-out.
You pay significantly less than Active–Active because the DR site is not running full load, even though it is fully deployed.
Best For: Mission-critical workloads requiring quick recovery
Typical Cost: $5,000–$10,000/month (~0.5×–0.8× production cost)
RTO: 5–10 minutes | RPO: Near-zero
4. Active–Active
Active–Active is the highest level of resiliency. Both regions are fully deployed and actively serving traffic simultaneously.
This is made possible through:
Azure Front Door load balancing across regions
Multi-region write databases (Cosmos DB, or custom multi-master patterns)
Active synchronization of state, sessions, and storage
If one region goes down, the other continues seamlessly — users don’t even notice.
Because both regions run full production load 24×7, the cost is effectively double.
Best For: Global, real-time, customer-facing apps (banking, e-commerce, SaaS)
Typical Cost: 2× production
RTO: Zero | RPO: Zero
Testing: The Forgotten Pillar
Most DR plans fail not because of design — but because they were never tested.
At one customer, their DR runbook looked perfect.
But during the first real failover attempt, the automation script hit a “permissions error” — it hadn’t been updated after a role change six months earlier.
Lesson learned:
Schedule quarterly DR drills. Automate validations. Review your Azure runbooks and network failover flows regularly.
Azure Site Recovery makes this surprisingly easy with non-disruptive test failovers — use them.
Final thoughts
Just like twins with distinct personalities, HA and DR serve completely different purposes once you get close enough to understand them. One keeps your workload flying through everyday turbulence — a failed VM, a faulty NIC, a zone outage — while the other steps in when the entire aircraft is grounded and you need an alternate runway to complete the journey.
When architects blur the two, they either overspend on the wrong solution or leave critical workloads exposed. But when you treat HA and DR as complementary layers of the same resilience strategy, everything falls into place: HA absorbs the small hits, DR shields you from the catastrophic ones, and together they give your applications the continuity and recovery they truly need.
In the end, resilience in Azure isn’t achieved by choosing between HA and DR — it’s achieved by knowing exactly what each twin is meant to protect you from, and designing with both in mind from day one.
If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.
