In this edition
Azure Architect Tip of the Day
Use AKS System and User Node Pools for Better Reliability and Cost Control
A simple way to make AKS clusters more stable—and cheaper—is to separate system and user workloads into different node pools. Running everything on the default node pool often leads to system pods being evicted when app workloads spike, causing unnecessary instability and over-provisioning.
Why This Matters
When system and app pods share the same nodes, you risk:
System components (CoreDNS, metrics-server, etc.) getting evicted under load
Oversized nodes just to support system overhead
Difficulty scaling app workloads independently
A Better Approach
System Node Pool
Dedicated to AKS system components
Use small, cost-efficient VMs (e.g.,
Standard_D2s_v3)Add the taint:
CriticalAddonsOnly=true:NoScheduleKeep at least 1–2 nodes for stability (system pools cannot scale to zero)
User Node Pools
Run all application workloads
Enable cluster autoscaler
Choose VM sizes that match workload patterns (CPU/memory heavy)
Can scale to zero for dev/test if
min-count=0
Cost Benefit
Teams typically see 20–30% savings by right-sizing system nodes and letting user nodes autoscale based on demand.
Azure Good-Reads
Azure VM Rightsizing: Optimize Performance & Cloud Costs Smartly
There’s a peculiar phenomenon that happens in cloud environments: VMs tend to grow, but they rarely shrink.
A project usually starts with the mindset of “let’s provision something safe,” and soon you’re running a Standard_D16s_v5 for a workload that barely touches 15% CPU. Sounds familiar?
The cost impact is where things really start to hurt.
An oversized VM can quietly add far more to your monthly bill than necessary. When this pattern repeats across multiple workloads, the overall cost impact becomes significant, and it quickly raises questions about whether your cloud resources are being used efficiently.
But here’s the real challenge: rightsizing isn’t about randomly downsizing until your bill stops hurting. Done poorly, it leads to performance issues, SLA breaches, urgent messages from application owners, and those unwanted 2 AM escalation calls.
The real art is finding the sweet spot where performance and cost efficiency meet.
The Real Cost of Oversizing
Let’s talk numbers:
Take a typical environment with around 50 VMs. If even half of them are oversized by a single tier, you’re wasting close to $7,500 USD every month—that’s $90,000 USD a year. That’s not a small amount; that’s the cost of a full-time senior engineer.
Unlike on-prem hardware that just depreciates quietly in the corner, cloud resources bill you by the hour—730 times each month. Every oversized VM is a recurring decision.
Why do teams overprovision?
Because they are cautious.
Cautious about performance complaints.
Cautious about outages.
Cautious about being the architect who slowed down production.
Is that mindset understandable? Yes.
Is it costly? Very.
So the cost is high, the waste is real, and the reasons behind oversizing are understandable. But cautious provisioning isn’t a strategy—it’s just a reaction. The only way to rightsize confidently is to replace assumptions with data. And that starts with understanding how your workloads actually behave.
Understanding Your Workload Patterns
Before changing any VM size, it’s important to understand how your workloads actually behave. Azure Monitor and Log Analytics give you the visibility you need to make informed decisions.
Here are the key things to focus on:
CPU utilisation:
This is the obvious starting point, but don’t rely on averages alone. A VM sitting at 20% CPU most of the day might still jump to 80% during batch jobs or peak activity. Those peaks matter.
Memory pressure:
For many applications—especially databases—memory is often more important than CPU. If the VM doesn’t have enough memory, it will fall back to disk more frequently, and performance will drop sharply even if CPU looks fine.
Disk IOPS and throughput:
This is one of the most commonly overlooked areas. In Azure, storage performance is closely tied to VM size. Downsizing a VM may cut your IOPS limit significantly, which can cause issues for disk-heavy workloads.
Network throughput:
Larger VM sizes typically come with higher network bandwidth. If your application depends on frequent or heavy data movement, downsizing could introduce unexpected bottlenecks.
Local/Temporary SSDs:
Local SSDs offer high performance, but their capacity is linked directly to the VM size. When you downsize, the temporary disk may shrink—or disappear entirely. This can be problematic if a workload, such as a database or caching layer, relies on that disk for fast temporary storage. It’s one of the most common oversights during rightsizing.
And finally, don’t ignore the monitoring window.
A week of data may miss monthly processing jobs. A month may miss quarterly spikes. For production workloads, aim to collect at least 30 days of metrics—and ideally cover any known business cycles—to get an accurate picture.
The Rightsizing Process
Here’s a practical, step-by-step approach that works well across most environments.
Phase 1: Discovery and Assessment
Start with Azure Advisor. It’s the most straightforward place to begin, and while many teams know about it, surprisingly few review its recommendations regularly. Advisor looks at your utilisation patterns and highlights where you may be overprovisioned.
However, take its guidance as a starting point—not the final answer. Advisor uses conservative thresholds (for example, workloads running below 5% CPU or 10% network utilisation over long periods). This helps identify obvious waste but won’t always catch more subtle optimisation opportunities.
To get a clearer picture, build custom Azure Monitor workbooks that track:
P95 and P99 CPU utilisation
Memory usage trends
Disk performance compared to VM limits
Network throughput relative to VM capacity
This deeper analysis helps distinguish between workloads that are genuinely oversized and those that actually need the resources they currently have.
Phase 2: Creating Your Rightsizing Candidates
Not every VM is an ideal target for rightsizing, so it helps to prioritise thoughtfully.
Low-risk environments: Development and test systems are great starting points. They’re safer to experiment with, and what you learn there often applies directly to production.
High-cost instances: Focus on large, expensive VMs first. Downsizing a Standard_D32s_v3 saves far more than adjusting a smaller VM.
Risk profile: Avoid beginning with critical production databases or anything where a rollback could cause major disruption.
Business timing: Don’t resize workloads tied to important business cycles—like financial reporting—just before a major deadline.
Phase 3: Testing and Validation
This is where many teams underestimate the effort required. Proper testing is essential and takes time, but it's what ensures you can rightsize with confidence.
For each target VM:
Capture baseline performance:
Record CPU, memory, storage metrics, application response times, and any user-reported behaviours.Understand how the application behaves:
Is it CPU-heavy? Memory-dependent? IO-bound? Does it scale vertically or horizontally? These answers influence what kind of VM makes sense.Evaluate different VM families:
Rightsizing isn’t always about going smaller within the same family. An E-series VM may suit a memory-heavy workload better than a D-series of similar size.Test in a non-production environment:
Clone the workload, resize the clone, and run realistic tests. Tools like Apache JMeter or Azure Load Testing can help simulate load.Plan the change window:
Resizing requires a VM restart (typically a few minutes). Schedule appropriately, get approvals, and prepare a rollback plan.
Phase 4: Implementation
Once testing is complete and you’re ready to proceed:
Take a snapshot or backup first:
It’s relatively inexpensive and gives you a safety net in case anything unexpected happens.Use IaC (ARM/Bicep/Terraform) rather than manual changes:
This ensures consistency, reduces errors, and creates a repeatable process.Monitor closely after the resize:
Track performance for 48–72 hours. Some issues don’t surface immediately.Keep a rollback option available for a week:
This helps if problems appear several days later, which is more common than many expect.
Special Considerations
Databases
They’re the most oversized—and the most dangerous to undersize.
Reasons:
Databases love memory
IOPS limits shrink with VM size
Query performance craters when memory gets tight
Sometimes the right answer is shifting from D-series to E-series, not going smaller.
Web & App Servers
These scale horizontally better than vertically.
Running three smaller VMs instead of one large one improves resilience—and sometimes lowers cost.
Batch Jobs & Background Workers
These workloads are perfect for scheduled up/down scaling.
Scale up during nightly jobs → scale down after.
Or move to Azure Batch or Azure Functions for real elasticity.
Final Thoughts
Rightsizing isn’t a one-time task—it’s a habit. With the right data and a careful approach, you can reduce costs without affecting performance. Start small, validate each change, and make optimisation a regular part of how you run your environment. Over time, these adjustments lead to a leaner, more efficient cloud footprint.
If you found this useful, tap Subscribe at the bottom of the page to get future updates straight to your inbox.
