Azure Databricks for Infrastructure Architects: A Platform Deep Dive

A ground-up walkthrough of Azure Databricks - told from the infrastructure side of the table, not the data science side.

If you are an Azure Infrastructure Architect, chances are someone has asked you to “just create an Azure Databricks workspace.”

At first glance, it looks simple. A managed Spark platform. A few clicks in the portal. Workspace created.

But the moment Databricks moves into production, the real questions begin:

How should the networking be designed?
Should clusters have public IPs?
Where should the data live?
How do you control costs and cluster sprawl?
How does identity and access actually work?

Azure Databricks is not just a data platform. It is an infrastructure platform that happens to run data workloads.

This article breaks down Azure Databricks the way an infrastructure architect needs to understand it.

1. What Is Azure Databricks

1.1 Introduction

Before diving into architecture, it helps to understand Azure Databricks from a platform perspective - not the marketing description, but the infrastructure reality.

Azure Databricks is a managed Apache Spark platform, jointly engineered by Microsoft and Databricks and delivered as a first-party Azure service.

To understand where it fits, it helps to separate the layers involved.

Layer	What it is	Role
Apache Spark	Open-source distributed data processing engine	Executes large-scale data processing across multiple machines
Databricks Platform	Managed platform built around Spark	Provides notebooks, cluster management, job orchestration, security, and governance
Azure Databricks	Databricks delivered as an Azure service	Integrates the platform with Azure identity, networking, storage, and monitoring

In simple terms:

Spark is the engine. Databricks is the platform. Azure Databricks is the Azure-native implementation of that platform.

A useful comparison is Kubernetes.

Kubernetes is an open-source container orchestration engine, but most enterprises do not run raw Kubernetes. Instead, they use managed platforms like Azure Kubernetes Service (AKS). Spark follows a similar pattern: Spark provides the processing engine, and Databricks provides the managed environment that operates it at enterprise scale.

Here is a simple way to visualize what Spark actually does.

Imagine running calculations across thousands of spreadsheets simultaneously. On a single machine, that process could take hours. Spark distributes the work across dozens or hundreds of machines, processes the data in parallel, and then combines the results.

Databricks is the platform that makes that distributed system practical to operate. It handles cluster provisioning, scaling, job scheduling, access control, and integration with cloud storage.

What makes Azure Databricks particularly interesting for infrastructure architects is that it is not purely a SaaS service. The platform operates using a split architecture, where some components are managed by Databricks while the compute resources that process your data run inside your Azure environment. This architectural model becomes central to how networking, security, and governance are designed.

1.2 Not just a Spark Cluster

Databricks is a unified analytics platform that covers:

Data Engineering - ETL pipelines, streaming ingestion, Delta Live Tables (now called Lakeflow Spark Declarative Pipelines)
Data Science - Collaborative notebooks, experiment tracking, feature engineering
Machine Learning - Model training, MLflow integration, feature stores, model registry
SQL Analytics - Databricks SQL with SQL Warehouses for BI workloads
AI Applications - Vector search, model serving, GenAI application development

From an infrastructure standpoint, this means a single platform deployment needs to satisfy very different compute and network requirements depending on the workload type running on it.

1.3 The Lakehouse Concept

The term "Lakehouse" is central to the Databricks story and worth understanding clearly. A traditional data architecture forced a choice:

Data Lake - Cheap, scalable storage (like ADLS Gen2), but without ACID transactions, schema enforcement, or query performance optimisation.
Data Warehouse - Structured, performant, governed - but expensive, inflexible, and siloed.

The Lakehouse pattern combines the economics of a data lake with the reliability and query performance of a data warehouse. The technology that makes this work is Delta Lake - an open-source storage layer built on top of Parquet files that adds:

ACID transaction support
Schema enforcement and evolution
Time travel (query historical versions of data)
Optimised file layout for query performance

1.4 Core Components of the Databricks Platform

Before going deeper into architecture, here is a quick orientation of the key building blocks:

Workspace - The primary organisational unit. Think of it as an environment (Dev, Test, Prod). Each workspace has its own users, clusters, notebooks, and jobs.

Clusters - The compute nodes where Spark runs. They spin up on-demand, autoscale, and terminate automatically.

Notebooks - Interactive, collaborative documents that mix code (Python, Scala, SQL, R) with output. Think of them as the primary development surface.

Jobs - Scheduled or triggered execution of notebooks or code. This is how pipelines run in production.

SQL Warehouses - Dedicated compute for SQL analytics workloads. Separate from Spark clusters, optimised for BI and dashboarding.

Unity Catalog - The centralised governance layer for data, models, and AI assets across all workspaces.

2. The Single Most Important Concept: Control Plane vs Compute Plane

If there is one thing to truly internalise before designing anything else around Databricks, it is the split between the Control Plane and the Compute Plane. Everything else - networking, security, identity, data flow - derives from understanding this boundary.

2.1 Why This Matters for Architects

This is not just product documentation vocabulary. This split has direct architectural implications:

Customer data processing does not occur in the Control Plane; data is processed in the Compute Plane, while the Control Plane handles orchestration, metadata, and workspace services.
Network design must accommodate both planes - with different trust levels and connectivity requirements.
Security hardening targets differ - you cannot harden the Control Plane (Databricks manages it), but you can lock down the Compute Plane extensively.

2.2 The Control Plane

The control plane includes the backend services that Azure Databricks manages in your Azure Databricks account. The control plane is located in the Azure Databricks account, not your cloud account. The web application is in the control plane.

Practically speaking, the Control Plane includes:

The Databricks web UI (what users browse to)
The job scheduler and orchestration engine
The cluster manager (decides when to spin up/down clusters)
Workspace services and APIs
Notebook metadata (the code itself, not the data)

The critical point: the Control Plane does not process your data. It tells your subscription what to do, but actual data processing happens elsewhere.

2.3 The Compute Plane

For classic Azure Databricks compute, the compute resources are in your Azure subscription in what is called the classic compute plane. This refers to the network in your Azure subscription and its resources.

The Compute Plane is where the actual work happens - Spark workloads run here, clusters live here, and storage is accessed from here. Because this runs in your subscription, you have full visibility and control over:

The VNet the clusters use
NSG rules governing traffic
UDRs for egress control
VM SKUs and scaling behaviour

2.4 Serverless vs Classic Compute

There are now two variants of the Compute Plane, and choosing between them is an architectural decision with real consequences:

Classic Compute Plane (Customer VNet)

Cluster VMs run inside your Azure subscription
You manage the VNet, NSGs, and UDRs
Full control, but operational overhead
Supports VNet injection and private endpoints

Serverless Compute Plane

For serverless compute, the serverless compute resources run in a serverless compute plane in your Azure Databricks account.
Databricks manages the infrastructure - no VMs appear in your subscription
Near-instant startup (no cluster warm-up)
No direct VNet control; connectivity is managed through Network Connectivity Configurations (NCC), private endpoints, and egress control policies.

3. Workspace Architecture

3.1 What a Databricks Workspace Really Creates

When a workspace is deployed, it is not simply a logical container. Behind the scenes, several Azure resources are provisioned automatically:

Managed Resource Group - Databricks creates a second resource group in your subscription (prefixed with databricks-rg-). This contains the infrastructure Databricks manages on your behalf: VMs when clusters are running, network interfaces, and the workspace storage account. You do not have full control over resources within this group - Databricks owns the lifecycle.

Workspace Storage Account - Classic workspaces have an associated storage account known as the workspace storage account. The workspace storage account is in your Azure subscription.

Networking Components - Depending on your deployment model, this includes a VNet (Databricks-managed or customer-provided), subnets, and NSGs.

3.2 Workspace Storage

The workspace storage account is not where your business data lives - it is the operational storage for the workspace itself. It contains:

Workspace system data - Notebook revisions, job run histories, command outputs, Spark logs
Unity Catalog workspace catalog - The default catalog if Unity Catalog was auto-enabled
DBFS root (legacy) - The Databricks File System root, now deprecated. Avoid using it for data storage in new deployments.

❝

Architect's Note: A common governance issue in early Databricks deployments is that teams start storing data on DBFS root, treating it like a shared file system. This creates uncontrolled data sprawl and bypasses governance entirely. Enforce the use of external ADLS Gen2 storage from day one.

3.3 Multi-Workspace Strategy

For any enterprise deployment, a single workspace is not sufficient. The typical pattern follows the same logic as Azure landing zones:

Key considerations for multi-workspace strategy:

One Unity Catalog metastore per region per Databricks account - All workspaces in the same region can share a single metastore, giving a unified view of data governance.
Separate subscriptions or resource groups per environment - Aligns with Azure landing zone principles and enables clean RBAC boundaries.
Cluster policies differ per workspace - Dev workspaces can allow more flexibility; Prod workspaces should enforce specific VM SKUs, autotermination, and autoscaling limits.

4. Compute Architecture

The Databricks portal provides several compute types, each targeting a different workload and user persona. It is worth grouping these logically rather than treating them as a flat list - they fall naturally into categories: core Spark compute, SQL analytics, AI and application services and compute infrastructure (Pools).

4.1 Core Spark Compute: All-Purpose and Job

All-Purpose Compute clusters are long-lived, shared across users, and built for interactive notebook work - exploration, prototyping, ad-hoc analysis. They carry the highest idle cost risk. Always configure autotermination; a cluster left running overnight can accumulate significant VM and DBU charges unnoticed.

Job Compute clusters are created for a single job run and terminated the moment it completes. No idle cost, no resource sharing between jobs. This is the correct pattern for all production workloads - ETL, ML training, scheduled processing. Pairing Job Compute with Azure Spot VMs for worker nodes cuts VM costs by 60–80% with minimal impact on non-time-sensitive jobs.

Both types support the Photon runtime - Databricks' vectorised C++ engine delivering 2x–10x performance gains over standard Spark for SQL and Delta Lake workloads. Enable it by default on production clusters.

4.2 SQL Warehouses

SQL Warehouses are dedicated compute for BI and SQL analytics - separate from Spark clusters entirely. Power BI, Tableau, and Databricks SQL all connect through them. Do not use All-Purpose clusters as a substitute; the execution engine, concurrency model, and cost structure are fundamentally different.

Provisioned warehouses offer defined sizing (2X-Small to 4X-Large) with ~30–60 second start times, suitable for sustained query loads. Serverless warehouses start in ~3 seconds, bill per query-second, and typically deliver better economics for intermittent BI workloads. Serverless is production-ready and should be the default choice where security posture permits.

4.3 AI and Application Services: Vector Search, Apps, and Lakebase

Vector Search is Databricks' managed vector database for GenAI workloads - primarily RAG pipelines, semantic search, and recommendations. Endpoints run in the Databricks-managed plane, not your VNet. Organisations with fully private deployments should validate network topology compatibility before adopting it.

Apps is a serverless runtime for lightweight internal data applications - Streamlit, Dash, or Gradio apps with native Lakehouse and Unity Catalog access, no separate App Service deployment required. Scope it to internal tooling; public-facing workloads needing WAF or custom domains are better served by Azure App Service or Container Apps.

Lakebase is Databricks' managed PostgreSQL offering - OLTP to complement the platform's OLAP strengths. Its key advantage over a standalone Azure Database for PostgreSQL is native Unity Catalog integration, keeping transactional and analytical data assets under a single governance framework. Infrastructure considerations follow standard Azure PaaS database patterns: private endpoints, backup policies, and HA configuration.

4.4 Pools

Pools hold pre-warmed VM instances so clusters can start in 30–60 seconds rather than the 3–5 minutes required to provision VMs from scratch. Idle VMs incur VM cost only - no DBU charges. Key configuration decisions are the minimum idle instance count (speed vs cost trade-off), maximum pool capacity (scaling guardrail), and idle auto-termination timeout. Note that a Pool is tied to a single VM SKU - multiple VM types means multiple pools.

5. Data Storage Architecture

5.1 External Storage - The Real Data Location

This is a point of confusion for many architects encountering Databricks for the first time: Databricks does not store your business data. It processes it.

The actual data lives in Azure Data Lake Storage Gen2 (ADLS Gen2) - typically in a storage account that your organisation owns and manages. Databricks clusters read from and write to this storage account during processing.

The separation is deliberate and beneficial:

Data persists independently of the Databricks workspace. Delete the workspace and the data is unaffected.
Multiple workspaces (Dev/Test/Prod) can be pointed at different storage containers in the same storage account, or entirely separate storage accounts.
Storage-level security (private endpoints, firewall rules, RBAC) is under your control.

5.2 Delta Lake

Delta Lake is the storage format that makes the Lakehouse work. From an infrastructure perspective, Delta tables look like directories of Parquet files with an accompanying _delta_log folder containing transaction logs. But the behaviour they enable is fundamentally different from raw Parquet:

ACID transactions - Multiple writers do not corrupt data; partial writes are rolled back automatically.
Time travel - Query any previous version of a table: SELECT * FROM my_table VERSION AS OF 10
Schema enforcement - New data that does not match the table schema is rejected at write time.
Z-ordering and OPTIMIZE - File layout optimisation for faster query performance.

❝

Architect's Note on Storage Costs: Delta tables accumulate transaction log entries and old file versions over time. The VACUUM command removes old data files that are no longer referenced, but it must be run regularly. Factor this into operational runbook design, and be aware that VACUUM by default retains 7 days of history.

5.3 Medallion Architecture

The Medallion Architecture is the universally adopted pattern for organising data in a Lakehouse. It gives infrastructure architects a clear mental model for understanding data flow and access patterns - which directly informs storage account structure, access control design, and network topology.

From a storage design perspective, Bronze, Silver, and Gold are typically separate containers or folders within ADLS Gen2, with different access controls. Bronze may be write-only for ingestion pipelines but read-only for most users; Gold is broadly readable by BI consumers.

6. Identity and Access Architecture

6.1 Authentication

Databricks integrates natively with Microsoft Entra ID (formerly Azure Active Directory). This is not optional in enterprise deployments - it is the foundation of all authentication. Users log in to Databricks using their existing corporate identities, and SSO works out of the box via SAML 2.0 or OIDC.

From an infrastructure standpoint, the critical configurations are:

SCIM provisioning - Automate user/group synchronisation from Entra ID to Databricks. Without this, user management becomes manual and drifts over time.
Conditional Access - Apply Entra ID Conditional Access policies to Databricks access (MFA enforcement, device compliance, location restrictions).

6.2 Authorisation Layers

Understanding the layers of authorisation in Databricks prevents a common mistake - assuming that Entra ID RBAC is sufficient.

Azure RBAC controls who can see the Databricks workspace resource in the Azure portal and perform ARM-level operations (like deleting the workspace). The built-in role is "Contributor" or "Owner" on the resource.

Workspace RBAC controls what users can do inside Databricks - who can create clusters, run jobs, manage users, and so on.

Unity Catalog is where fine-grained data access control lives - who can read which table, which schema, which catalog. This is the layer that actually protects data.

Cluster Policies define what kind of compute users are permitted to create - preventing, for example, a data scientist from accidentally spinning up a 64-node cluster in production.

6.3 Service Principals and Automation

Any automation - CI/CD pipelines, infrastructure-as-code, scheduled jobs accessing external systems - should authenticate using Service Principals, not user accounts. Best practices:

Create dedicated service principals per workspace (Dev, Test, Prod) rather than sharing one across environments.
Store service principal credentials in Azure Key Vault, not in notebooks or configuration files.
Use Managed Identity for cluster access to ADLS Gen2 wherever possible, avoiding the need to manage storage access keys entirely.

7. Networking Architecture

Azure Databricks operates across two planes: a control plane (Databricks-managed backend and web app) and a compute plane (where data is actually processed). The compute plane has two modes - classic, running inside your Azure subscription, and serverless, running in a Databricks-managed environment in the same region.

There are four network boundaries to secure:

Users → Control Plane: Securing user and application access to the Databricks workspace web UI and APIs.
Control Plane ↔ Classic Compute Plane: Securing the connection between Databricks-managed services and your clusters running in your Azure subscription.
Classic Compute Plane → Azure Storage/Services: Securing data access from your clusters via service endpoints or private endpoints in your VNet.
Serverless Compute Plane → Azure Storage/Services: Securing outbound connections from serverless compute to storage and other Azure services via NCCs and private endpoints.

7.1 Classic Compute Plane Networking

VNet Injection

By default, Databricks creates a locked managed VNet in your subscription.

VNet injection lets you deploy clusters into your own VNet, enabling custom routing, NSG rules, service/private endpoints, and on-premises connectivity via VPN or ExpressRoute.

Two dedicated subnets are required - a host subnet and a container subnet - each with an NSG enforcing Databricks-required rules.

Secure Cluster Connectivity (No Public IP)

Enabled by default on new workspaces. Clusters connect outbound-only to the control plane via a relay - no open inbound ports, no public IPs on cluster nodes. When using the managed VNet, Databricks automatically provisions a NAT gateway for outbound traffic.

Back-End Private Link

Deploys a private endpoint directly in your workspace VNet, routing all cluster-to-control-plane traffic over the Microsoft backbone - never the public internet. Requires Premium tier + VNet injection. Recommended topology is hub-spoke, with a dedicated isolated browser authentication workspace in the transit VNet.

VNet Peering

Peers the Databricks VNet (managed or injected) with another Azure VNet. Requires connections in both directions.

7.2 Front-End Networking (Users → Databricks)

IP Access Lists

Allow/block specific public IPs or subnets at the workspace level (admin-configured via REST API) or account level (across all workspaces). Block lists are evaluated first. Private Link traffic bypasses IP access lists.

Front-End Private Link

Routes all user access - web UI and API - through a private endpoint in your VNet. Two modes are supported: no public access (fully private, all traffic via private endpoint) or a hybrid model (private endpoint + approved public IPs, implemented via hub-spoke topology).

7.3 Serverless Compute Plane Networking

Network Connectivity Configurations (NCCs)

NCCs are account-level objects that manage private endpoint rules for serverless compute. Key limits: up to 10 NCCs per region, 100 private endpoints per region, and each NCC can attach to up to 50 workspaces. You create a private endpoint rule per target Azure resource (storage account, SQL database, etc.); running serverless resources must be restarted after changes take effect.

Serverless Egress Control

Network policies (Premium tier only) restrict outbound connections from serverless compute to prevent data exfiltration. Managed via the account console.

Storage Firewall

Both classic and serverless compute can connect to storage accounts protected by Azure Storage firewall rules - classic compute uses service endpoints or private endpoints; serverless compute uses stable Databricks service tags.

8. Governance with Unity Catalog

8.1 Why Governance Became a Big Deal

Prior to Unity Catalog, Databricks governance was workspace-scoped. Every workspace maintained its own metastore, its own access controls, its own data catalogue. In a multi-workspace enterprise, this created a governance nightmare - the same table might be defined differently in three workspaces, access controls were duplicated and inconsistent, and there was no cross-workspace data discovery.

Unity Catalog solves this by providing a single, centralised governance layer that spans all workspaces in a region. It also supports data mesh architectures - where domain teams own their data products but governance rules are centrally enforced.

8.2 Key Concepts

The Unity Catalog namespace is a three-level hierarchy:

<catalog-name>.<schema-name>.<object-name>

For example: prod_catalog.sales.customer_orders

Metastore - The top-level governance construct. One per region per Databricks account. Attached to one or more workspaces.

Catalog - Equivalent to a database in traditional terms. Typically used to separate environments (prod, dev) or business domains (finance, sales).

Schema - A logical grouping of tables within a catalog.

Table/View/Function - The actual data objects. Tables can be managed (Databricks controls the data lifecycle) or external (data lives in your ADLS Gen2, Databricks only holds metadata).

Final Thoughts

Azure Databricks is one of the most architecturally rich services you will encounter in the Azure ecosystem. It is not a point solution - it is a platform that spans networking, compute, storage, identity, and governance in ways that require genuine infrastructure expertise to get right.

A few things are worth keeping front of mind as you approach your first or next Databricks engagement:

Start with networking. The VNet injection model, NSG rules, and private endpoint design will determine whether everything else works. Get the networking wrong and nothing else matters. Take the time to understand the Control Plane/Compute Plane boundary before touching subnet CIDRs.

Unity Catalog is not optional. Deploying Databricks without Unity Catalog in 2025 is the equivalent of deploying without Azure RBAC. It will create technical debt that is painful to unwind later. Enable it from day one, even on Dev workspaces.

Cluster policies save more than money. Yes, they prevent runaway compute costs - but they also enforce security posture, autotermination behaviour, and approved VM types. Treat them as infrastructure policy, not just a cost-control mechanism.

Audit everything. Databricks generates rich audit logs. Ship them to Azure Monitor or a SIEM from day one. The compliance team will ask for them eventually, and retroactive log collection is not possible.

Serverless is the strategic direction of the platform, though many enterprises will continue to operate hybrid (classic + serverless) models depending on workload and regulatory constraints. Serverless SQL Warehouses are already production-ready and in wide enterprise use. Serverless compute for notebooks and jobs is maturing rapidly. Design your architecture to accommodate a gradual shift toward serverless - it simplifies operations significantly.

The infrastructure architect's role in a Databricks deployment is not to understand every Spark optimisation or Delta Lake feature. It is to ensure the platform is secure, governed, well-networked, and operationally sound - so that the data teams can focus on building things that matter.

If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.

Azure Databricks for Infrastructure Architects: A Platform Deep Dive

Azure Databricks for Infrastructure Architects: A Platform Deep Dive

1. What Is Azure Databricks

1.1 Introduction

1.2 Not just a Spark Cluster

1.3 The Lakehouse Concept

1.4 Core Components of the Databricks Platform

2. The Single Most Important Concept: Control Plane vs Compute Plane

2.1 Why This Matters for Architects

2.2 The Control Plane

2.3 The Compute Plane

2.4 Serverless vs Classic Compute

Classic Compute Plane (Customer VNet)

Serverless Compute Plane

3. Workspace Architecture

3.1 What a Databricks Workspace Really Creates

3.2 Workspace Storage

3.3 Multi-Workspace Strategy

4. Compute Architecture

4.1 Core Spark Compute: All-Purpose and Job

4.2 SQL Warehouses

4.3 AI and Application Services: Vector Search, Apps, and Lakebase

4.4 Pools

5. Data Storage Architecture

5.1 External Storage - The Real Data Location

5.2 Delta Lake

5.3 Medallion Architecture

6. Identity and Access Architecture

6.1 Authentication

6.2 Authorisation Layers

6.3 Service Principals and Automation

7. Networking Architecture

7.1 Classic Compute Plane Networking

VNet Injection

Secure Cluster Connectivity (No Public IP)

Back-End Private Link

VNet Peering

7.2 Front-End Networking (Users → Databricks)

IP Access Lists

Front-End Private Link

7.3 Serverless Compute Plane Networking

Network Connectivity Configurations (NCCs)

Serverless Egress Control

Storage Firewall

8. Governance with Unity Catalog

8.1 Why Governance Became a Big Deal

8.2 Key Concepts

Final Thoughts

Reply

Keep Reading

The Azure Architect's Playbook