Cloudflare outage - Special edition

Azure Architect Tip of the Day

Your application's biggest vulnerability isn't in your code—it's in the services you've stopped worrying about.

What Happened Yesterday

Cloudflare experienced a 6-hour outage that took down millions of websites globally. Here's what makes this interesting for architects:

The root cause wasn't a cyberattack or infrastructure failure. It was a database permission change that doubled the size of a configuration file, which then exceeded a hard-coded limit in their proxy software.

Result: Global cascading failure across 330+ edge locations.

The lesson? Even robust platforms with 99.99% uptime guarantees can fail in unexpected ways.

Why This Affects You

If you're using Cloudflare (or any CDN/security layer) in front of your Azure applications, yesterday's outage meant:

  • Your perfectly healthy Azure App Services were unreachable

  • Your auto-scaling and redundancy didn't matter

  • Your multi-region deployment couldn't help

  • Users saw error pages for 6 hours

The invisible layer became a single point of failure.

Three Questions to Ask This Week

1. "Can users reach our application if [Service X] completely fails?"

Replace [Service X] with:

  • Your CDN provider

  • Your DNS provider

  • Your authentication service

  • Your API gateway

If the answer is "no" for any of these, you've found a single point of failure.

2. "How long would it take us to route around a failed service?"

Options typically include:

  • Automatic (seconds to minutes) - requires pre-configured failover

  • Manual (30 minutes to 2 hours) - requires runbooks and testing

  • Rebuild (hours to days) - means you haven't planned for this

3. "Would we even know if this service failed?"

Critical: Your monitoring must be independent from the services being monitored.

Cloudflare couldn't access their own dashboard during the outage because the login system depended on the failing infrastructure. Don't let this happen to you.

Remember

The Cloudflare outage happened despite:

  • World-class engineering team

  • Sophisticated monitoring

  • Gradual rollout processes

  • 99.99% uptime track record

Your systems will fail too. Plan accordingly.

The goal isn't to prevent all failures—that's impossible. The goal is to ensure failures don't become catastrophes.

Cloudflare Outage Explained: The What, Why, and Key Lessons for Modern Architects

Every once in a while, the internet gives us a reminder: beneath all the glossy interfaces and polished apps lies a delicate mesh of interconnected systems. On most days, these systems work smoothly, quietly enabling the world’s digital ambitions. But occasionally, one key layer fails - and suddenly half the world is staring at error pages.

That’s what happened during yesterday’s Cloudflare outage.

This incident sparked global disruption and renewed the same question many architects quietly ask themselves: How much of our architecture depends on Cloudflare, and what would happen if it fails again?

Let’s unpack that question properly - starting with the basics.

Understanding Cloudflare's Architecture

What is Cloudflare

The easiest way to explain Cloudflare is this:
Cloudflare is an edge platform, a global layer that sits between your users and your applications. It improves performance, protects against attacks, and keeps traffic flowing even when your origin systems are under pressure.

The Cloudflare Service Layer

Cloudflare has evolved from a simple CDN into a comprehensive edge platform that provides multiple critical services:

Key Cloudflare Components

Key Components:

  • CDN: Caches static assets (images, CSS, JavaScript) globally

  • Security: WAF rules, DDoS protection (multi-terabit capacity), bot detection

  • Core Proxy (FL/FL2): Routes and processes all traffic through the network

  • Workers/KV: Serverless functions and key-value storage at the edge

  • Access: Identity-aware proxy for Zero Trust security

Why Organizations Depend on Cloudflare

Performance Optimization: By caching content at 330+ edge locations, Cloudflare reduces latency and offloads traffic from Azure origins. For a globally distributed application, this can reduce response times from 500ms to under 50ms.

Cost Efficiency: Cloudflare Pro at USD 20/month per zone includes unlimited bandwidth.

Security at Scale: Cloudflare's DDoS protection handles attacks up to 15+ Tbps.

Multi-Cloud Flexibility: Cloudflare sits vendor-neutral, allowing routing across Azure, AWS, on-premises, or hybrid configurations through a single control plane.

The November 18 Outage:

Easy Explanation (Want technical deep dive, read next section)

Cloudflare published a detailed post-mortem that’s worth reading, but here’s a simplified explanation.

1. A configuration change led to a file becoming unexpectedly large

A permissions change to an internal database caused Cloudflare’s bot-management system to generate a much larger “feature file” than intended.

2. This file propagated globally across Cloudflare’s network

Cloudflare edge nodes began receiving the oversized file.

3. Cloudflare’s traffic-handling software wasn’t built for that file size

The software expected a fixed upper limit. The file broke that assumption. Processes began to crash.

4. When edge nodes misbehaved, traffic broke

Since Cloudflare sits in front of thousands of services, everything behind it suffered:
ChatGPT, X, e-commerce sites, gaming services, SaaS platforms, even outage trackers themselves.

Chaos at the edge quickly becomes chaos everywhere.

5. Cloudflare engineers rolled back the change and replaced the file

Once the source of the issue was corrected, edge nodes recovered, and global traffic slowly normalised.

Technical Deep Dive

The outage began at 11:20 UTC and lasted until 17:06 UTC, with core services restored by 14:30 UTC. Let's examine the technical failure chain.

The Root Cause: A Permission Change Gone Wrong

Cloudflare uses ClickHouse (a columnar database) to generate configuration files for its Bot Management system. These files are distributed to edge servers every 5 minutes to keep bot detection models current.

The Failure Cascade Explained

Step 1: Database Schema Change (11:05 UTC)

Cloudflare modified ClickHouse permissions to improve security. The database has two schemas:

  • default: Distributed tables (what users normally see)

  • r0: Underlying shard tables (where data actually lives)

The change made r0 tables visible to improve query authorization.

Step 2: Query Behavior Changed (11:05-11:28 UTC)

The Bot Management config generator runs this query:

SELECT name, type 
FROM system.columns 
WHERE table = 'http_requests_features'
ORDER BY name;

Before the change: Returns 60 rows (from default database only) After the change: Returns 120+ rows (from both default and r0)

The query doesn't filter by database name, so it now returned duplicate column metadata.

Step 3: Feature File Distribution (11:28 UTC)

The doubled feature file (120+ features vs expected 60) was automatically distributed to all 330+ edge locations globally. Distribution happens every 5 minutes and propagates rapidly.

Step 4: Hard Limit Exceeded

The FL2 proxy (written in Rust) has a pre-allocated memory limit:

const MAX_FEATURES: usize = 200;

if features.len() > MAX_FEATURES {
    panic!("Feature count exceeds limit");
}

The limit exists for performance optimization (pre-allocating memory). When the 120+ feature file arrived, validation logic detected issues and the proxy panicked, crashing the entire traffic processing pipeline.

Step 5: Cascading Service Failures

The core proxy failure cascaded to:

  • Workers KV: Depends on the proxy for request routing

  • Cloudflare Access: Authentication flows through the proxy

  • Dashboard: Required Turnstile (which failed) for login

  • All customer traffic: Returned 500 errors globally

Why Diagnosis Was Challenging

Intermittent Failures: The ClickHouse cluster was being updated gradually. Some nodes generated good files, others bad files. Every 5 minutes, the system alternated between working and failing, mimicking a DDoS attack pattern.

Misleading Symptoms: Cloudflare's status page (hosted externally) coincidentally went down, suggesting a coordinated attack rather than internal misconfiguration.

Behavioral Differences: FL2 (new proxy) showed 500 errors while FL (legacy proxy) silently failed by setting all bot scores to zero, creating confusion about the scope and nature of the issue.

Architectural Lessons for Azure Deployments

This incident provides critical insights for building resilient systems on Azure.

1. Map Your True Dependency Graph

Most architects underestimate the depth of their dependencies on edge providers.

What you perceive your architecture is:

What your actual architecture is:

Action Items:

  • Document every hop between users and Azure resources

  • Identify which services share infrastructure (Workers KV and Access both failed because they depend on the core proxy)

  • Map authentication flows separately—they're often the most fragile

2. Implement Multi-Layered Failover

Don't rely on a single vendor for critical path services. Design active-passive or active-active failover patterns.

Implementation Strategy:

Primary Path: Traffic → Azure Traffic Manager → Cloudflare → Azure App Gateway → App Service

Failover Path: Traffic → Azure Traffic Manager → Azure Front Door → Azure App Gateway → App Service

3. Monitor from Outside Your Stack

Cloudflare couldn't access its own dashboard during the outage because authentication depended on the failing infrastructure.

Essential monitoring:

  • Azure Monitor availability tests from multiple global locations

  • Third-party monitoring services (USD 30-50/month)

  • Multiple alerting channels (email, SMS, Teams)

4. Design for Graceful Degradation

Applications should continue functioning, even with reduced features, when dependencies fail.

Example: If your CDN fails, can users still access content directly from Azure (slower, but functional)? If authentication is unavailable, can the application serve cached or public content?

5. Have Emergency Access Plans

Ensure your team can manage Azure resources even if your primary access methods fail:

  • Configure Azure CLI and PowerShell for emergency access

  • Maintain offline runbooks (actual documents, not just wiki pages)

  • Set up break-glass accounts with Privileged Identity Management

  • Don't depend on your own infrastructure for admin access

6. Validate Configuration Changes

Cloudflare's issue started with a configuration change that wasn't properly validated.

Best practices:

  • Use Infrastructure as Code (Terraform, Bicep) with validation rules

  • Require approval for infrastructure changes (Implement Doer-checker strategy)

  • Test changes in staging environments first

  • Implement gradual rollouts with automatic rollback on errors

7. Multi-Cloud Isn't Always the Answer

Some organisations maintain multi-cloud architectures thinking they avoid vendor lock-in. But if Cloudflare sits in front of everything, both Azure and AWS become equally unavailable during a Cloudflare outage.

True resilience requires diversity at every layer, which significantly increases complexity and cost.

8. Calculate Your Acceptable Downtime

The key question: What does six hours of downtime cost your organisation?

If a 6-hour outage costs USD 100,000 in lost revenue, spending USD 500/month (USD 6,000/year) for additional resilience is clearly justified. If it costs USD 5,000, the calculation changes.

Final Thoughts

The Cloudflare outage is a timely reminder that resilience isn’t only about what we build—it’s also about the services we depend on quietly in the background. Some of the most important parts of an architecture are the ones we barely notice until they stop working.

Third-party platforms like Cloudflare offer tremendous value, but they also introduce dependencies that need to be understood, not assumed. Even well-engineered systems can fail, sometimes in unexpected ways, and yesterday made that very clear.

Eliminating risk entirely isn’t realistic, and trying to do so would be astronomically expensive. The real objective is to understand where your vulnerabilities lie, make conscious design decisions, and ensure the system bends under stress instead of breaking.

If there’s one takeaway, it’s this:
Design for failure. Test the ugly scenarios. Monitor the entire chain—not just your own stack. And always keep a fallback path ready.

If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.

Keep Reading

No posts found