Reducing Karpenter Costs by Up to 56% Through Kubernetes Resource Right-Sizing with Goldilocks + VPA
If you've ever watched your infrastructure costs creep up and had a nagging feeling that "something is being wasted," you're not alone. I once ran an EKS cluster where I set generous CPU requests and forgot about them — then one day I looked at the Karpenter node list and saw a row of m5.2xlarge instances running at 20% actual utilization. That's when I first started digging seriously into VPA and Goldilocks. If you know how to use kubectl and understand Deployments and HPA, you'll be able to follow along without trouble.
The core idea is simple. Because Karpenter determines node size based on requests — not actual usage — precisely tuning your requests alone can reduce infrastructure costs by 30–56% for backend services with relatively uniform traffic. Goldilocks safely leverages VPA's analysis engine to display recommended requests values for each Deployment in a dashboard. Another major advantage is that it doesn't restart pods or touch them automatically, so you can attach it to a running cluster without worry.
By the end of this post, you'll have step-by-step configuration you can apply to a production cluster right away — covering how Goldilocks + VPA works under the hood, how it interacts with Karpenter Consolidation, and patterns for using it alongside HPA without conflicts.
Why Requests Determine Cost
Why Karpenter Looks at Requests
In one sentence, Karpenter's node provisioning logic works like this: it aggregates the resource requests of unscheduled pods and finds the most cost-efficient instance type that can accommodate them, then brings up a node.
Here's where the problem arises. If a pod that actually uses only 100m CPU is declared with requests: cpu: "500m", Karpenter sees it as if many 500m pods are piling up and selects a larger node than necessary. Conversely, setting it too low causes OOM.
Bin-packing: A strategy for packing containers onto nodes as densely as possible to reduce wasted space. The closer requests values are to actual usage, the higher the bin-packing density and the fewer nodes needed.
How VPA Generates Recommendations
VPA consists of three components: Recommender, Admission Controller, and Updater. Of these, the Recommender is responsible for actually collecting historical resource usage and calculating recommendations.
What matters is the algorithm. VPA Recommender doesn't use a simple average — it uses histogram-based percentile estimation (default p90–p95). Why does this matter? Because it means the top 5–10% peak usage is not reflected in the recommendation by default. For services where traffic fluctuates significantly by time of day, you might feel that the VPA target doesn't capture peaks — but that's by design. Understanding this also makes the "memory seasonality" drawback discussed later much easier to grasp.
Recommendations come in three flavors:
| Recommendation | Meaning | Usage Tips |
|---|---|---|
lowerBound |
Minimum value for operation without throttling | The hard floor — never go below this |
target |
VPA's recommended requests value (p90–p95 basis) | Use this as your baseline for stable operation |
upperBound |
Maximum estimated value the pod may need | Reference for limits settings (do NOT use this directly as limits) |
I was confused about this at first — I thought you could just use upperBound directly as limits. But upperBound is purely a reference, and it's better to set limits with additional headroom based on your own judgment. Also note that lowering limits to match requests can worsen CPU throttling.
How Goldilocks Layers on Top of VPA
Goldilocks doesn't touch VPA directly. Instead, when you label a namespace, it automatically creates VPA objects in updateMode: "Off" for all Deployments in that namespace.
You might wonder why updateMode: "Off" means pods aren't touched — VPA's Updater component is what actually restarts pods, and in Off mode the Updater simply doesn't run. The Recommender continues collecting data and calculating recommendations, but applying those values to pods is left to humans.
Goldilocks's role is to take those recommendations and display them in a clean web dashboard. Honestly, the first time I opened the dashboard and saw "this pod has 500m CPU set but target is 80m," it was quite a shock.
The full optimization loop looks like this:
Observe actual usage (VPA Recommender — histogram percentile-based)
↓
Review recommendations in Goldilocks dashboard
↓
Apply requests to Deployment YAML / GitOps
↓
Karpenter selects smaller nodes based on reduced requests
↓
Consolidation automatically removes idle nodesWhat You Must Know When Using HPA and VPA Together
Applying HPA and VPA simultaneously on the same metric (CPU) causes conflicts. HPA adds pods when CPU usage is high, while VPA tries to change CPU requests — two controllers interfering with each other toward different goals.
From my experience, ignoring this and attaching both with CPU metrics led to pods restarting continuously in the middle of the night. So now I use a pattern that clearly separates their roles.
Recommended pattern: Assign CPU scaling to HPA and memory requests tuning to VPA (Goldilocks) — this role separation lets you use both tools together without conflicts.
Attaching to Your Cluster: From Installation to Applying Recommendations
Example 1: Installing Goldilocks + VPA and Basic Configuration
First, install VPA. It's a prerequisite since Goldilocks uses VPA's Recommender internally.
One thing worth noting — the fairwinds-stable/vpa Helm chart used here is a Fairwinds-wrapped version and may differ somewhat in configuration structure from the official kubernetes/autoscaler VPA. If you're already using the official VPA, check the Goldilocks official docs for integration options first.
# Install VPA (Fairwinds wrapper version)
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install vpa fairwinds-stable/vpa \
--namespace vpa \
--create-namespace
# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks \
--namespace goldilocks \
--create-namespace
# Add label to the namespace you want to monitor
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
# Temporary access to dashboard (for local/dev inspection)
kubectl port-forward svc/goldilocks-dashboard 8080:80 -n goldilocksSecurity note:
kubectl port-forwardis for temporary local access only. To expose the dashboard persistently in production, use an Ingress or LoadBalancer Service and always configure authentication (OAuth, SSO, etc.) alongside it. Exposing it externally without authentication reveals your cluster's resource information in plain sight.
Once you apply the label, Goldilocks automatically creates VPA objects like the following:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-goldilocks
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Updater disabled — recommendations only, no pod restartsHere's an example of reviewing recommendations in the dashboard and applying them to a Deployment:
# Before: over-provisioned state
# After: adjusted based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
spec:
template:
spec:
containers:
- name: my-app
image: my-app:latest
resources:
requests:
cpu: "250m" # was 500m → 50% reduction
memory: "512Mi" # was 1Gi → 50% reduction
limits:
cpu: "800m" # limits kept with headroom above requests (prevents throttling)
memory: "1Gi"There's a reason limits weren't lowered to match requests. Setting CPU limits too low causes severe throttling and latency spikes during brief usage bursts. Adjusting limits should be done separately from requests, only after sufficient monitoring.
| Change | Before | After | Savings |
|---|---|---|---|
| CPU request | 500m | 250m | 50% |
| Memory request | 1Gi | 512Mi | 50% |
| Expected node size | m5.xlarge | m5.large | ~40% cost reduction |
Example 2: Integrating with Karpenter NodePool Consolidation
When requests are reduced, Karpenter's Consolidation works more aggressively. Since Karpenter v1 GA, spec.disruption.budgets lets you control node interruption during business hours — without this setting, enabling Consolidation can cause pods to be rescheduled without warning during the day.
consolidateAfter: 30s can be confusing at first — it means that 30 seconds after a node is judged as underutilized, Karpenter begins the event to move that node's pods to other nodes and remove it. 30 seconds is quite aggressive, so it's recommended to observe your traffic patterns thoroughly before tuning this value.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s # Consolidation event starts 30s after underutilized detection
budgets:
- nodes: "10%" # Max percentage of nodes that can be disrupted simultaneously
- nodes: "0" # Completely block node disruption during business hours (Mon–Fri 09:00–18:00)
schedule: "0 9 * * 1-5"
duration: 9hConsolidation: A Karpenter feature that reschedules pods spread across multiple nodes onto fewer nodes and deletes idle nodes. The closer requests are to actual usage, the more aggressively this operates.
Example 3: HPA + VPA Role Separation Pattern
A separation pattern where HPA handles CPU and Goldilocks (VPA) handles memory requests tuning.
# HPA: horizontal scaling based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale out when average CPU utilization across all pods exceeds 70%averageUtilization: 70 may not be intuitive at first — it means "sum of actual CPU usage across all pods ÷ (requests × pod count)" exceeds 70%, triggering scale-out; below that, it scales in. The more accurate your requests, the more accurate HPA's scaling decisions become.
# Resource settings within the Deployment
# CPU is the HPA basis, so keep it independent of Goldilocks recommendations
# Update only memory based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
spec:
template:
spec:
containers:
- name: my-app
image: my-app:latest
resources:
requests:
cpu: "250m" # HPA operation basis — use Goldilocks recommendation as reference only
memory: "512Mi" # Updated based on Goldilocks target
limits:
cpu: "800m"
memory: "1Gi"Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Maximized Karpenter efficiency | Precise requests → smaller instance type selection → higher bin-packing density |
| Cost savings | 30–56% infrastructure cost reduction for backend services with uniform traffic (real case: $52K → $23K/month) |
| Zero-downtime analysis | Operates exclusively in updateMode: Off — no pod restarts |
| Visualization | View recommendations per namespace and Deployment at a glance in a UI dashboard |
| Incremental adoption | When integrated with GitOps (Argo CD, Flux), supports safe PR-based gradual rollout |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Manual intervention required | Goldilocks only provides recommendations; automatic application requires separate implementation | Build a pipeline with Argo CD/Flux to auto-generate PRs from recommendations |
| HPA conflicts | Simultaneous use of VPA Auto mode and CPU-based HPA causes mutual interference | Separate roles: CPU → HPA, Memory → VPA |
| Cold start | Minimum several days of observation data needed to generate p90 recommendations | Wait at least 3–7 days after installation before trusting recommendations |
| Over-consolidation | Too short a consolidateAfter value causes excessive pod rescheduling |
Set between 30s–5m, disruption budgets are essential |
| Spot + no PDB | Karpenter + Spot combination without PDB leads to service instability | Set PodDisruptionBudget on all Deployments |
| Memory seasonality | If traffic patterns vary significantly by time of day, VPA p90 recommendations may miss peaks | Apply recommendations only after collecting data that includes sufficient peak-hour coverage |
PodDisruptionBudget (PDB): A setting that limits how many pods can go down simultaneously when Karpenter removes a node. Even just
minAvailable: 1is enough to prevent service interruption.
The Most Common Mistakes in Practice
-
Applying recommendations immediately after installing VPA. With only a few hours of data, recommendations won't reflect actual peaks at all. It's recommended to observe for at least a week before applying, to let the p90 histogram stabilize.
-
Connecting both HPA and VPA to the CPU metric simultaneously. I made this mistake early on and ended up with pods restarting continuously overnight. When two controllers each try to adjust requests toward different goals, pods become unstable. Follow the pattern of separating roles: CPU → HPA, Memory → VPA.
-
Enabling Karpenter Consolidation without configuring disruption budgets. This can lead to situations where nodes go down during business hours and pods receiving traffic are all rescheduled at once. Setting
nodes: "0"during business hours on the NodePool is essentially mandatory.
Closing Thoughts
Precisely tuning requests with Goldilocks + VPA is the work of giving Karpenter the correct information so it can do its job more intelligently. Real-world cases have reported savings in the range of $52K → $23K/month using this approach. Cost optimization starts not from adding new tools, but from giving accurate data to the scheduler you already have.
Three steps you can start right now:
-
Start with a diagnosis. If your current cluster has two or more nodes with CPU utilization below 50%, this approach is likely to show immediate results. Label just one of your most expensive namespaces with Goldilocks and collect recommendations for a week — you'll immediately see how over-provisioned your current requests are.
-
Apply recommendations to a single Deployment in a low-traffic environment. Lower requests to a level slightly above
lowerBound, monitor for a few days to confirm no OOM or throttling occurs, then expand gradually. Keep limits with sufficient headroom separate from requests. -
Set disruption budgets on your Karpenter NodePool and enable Consolidation. With reduced requests in place, enabling
consolidationPolicy: WhenUnderutilizedlets Karpenter automatically clean up idle nodes — and you'll see real cost savings materialize.
Next post: How to integrate with Argo CD to automatically generate PRs from Goldilocks recommendations and incorporate them into a GitOps pipeline
References
- Right-Sizing Kubernetes Resources with VPA and Karpenter | DEV Community
- Kubernetes Resource Optimization & Best Practices with Goldilocks | Fairwinds
- Right-size your Kubernetes Applications Using Open Source Goldilocks | AWS Open Source Blog
- How to Use Goldilocks VPA Recommendations to Right-Size Kubernetes Pod Resources | OneUptime
- GitHub - FairwindsOps/goldilocks
- Goldilocks Installation Official Docs
- Karpenter Official Docs - Disruption
- Kubernetes Cost Optimization: From $50K to $22K/Month with Karpenter, Spot, and VPA | ZeonEdge
- From VPA & Goldilocks to Automation with ScaleOps
- Answering Your Goldilocks Questions About How HPA and VPA Work Together | Fairwinds
- Understanding Karpenter Consolidation | StormForge
- Goldilocks vs Karpenter vs KRR for Kubernetes | Overcast Blog