Reducing Karpenter Costs by Up to 56% Through Kubernetes Resource Right-Sizing with Goldilocks + VPA

If you've ever watched your infrastructure costs creep up and had a nagging feeling that "something is being wasted," you're not alone. I once ran an EKS cluster where I set generous CPU requests and forgot about them — then one day I looked at the Karpenter node list and saw a row of m5.2xlarge instances running at 20% actual utilization. That's when I first started digging seriously into VPA and Goldilocks. If you know how to use kubectl and understand Deployments and HPA, you'll be able to follow along without trouble.

The core idea is simple. Because Karpenter determines node size based on requests — not actual usage — precisely tuning your requests alone can reduce infrastructure costs by 30–56% for backend services with relatively uniform traffic. Goldilocks safely leverages VPA's analysis engine to display recommended requests values for each Deployment in a dashboard. Another major advantage is that it doesn't restart pods or touch them automatically, so you can attach it to a running cluster without worry.

By the end of this post, you'll have step-by-step configuration you can apply to a production cluster right away — covering how Goldilocks + VPA works under the hood, how it interacts with Karpenter Consolidation, and patterns for using it alongside HPA without conflicts.

Why Requests Determine Cost

Why Karpenter Looks at Requests

In one sentence, Karpenter's node provisioning logic works like this: it aggregates the resource requests of unscheduled pods and finds the most cost-efficient instance type that can accommodate them, then brings up a node.

Here's where the problem arises. If a pod that actually uses only 100m CPU is declared with requests: cpu: "500m", Karpenter sees it as if many 500m pods are piling up and selects a larger node than necessary. Conversely, setting it too low causes OOM.

Bin-packing: A strategy for packing containers onto nodes as densely as possible to reduce wasted space. The closer requests values are to actual usage, the higher the bin-packing density and the fewer nodes needed.

How VPA Generates Recommendations

VPA consists of three components: Recommender, Admission Controller, and Updater. Of these, the Recommender is responsible for actually collecting historical resource usage and calculating recommendations.

What matters is the algorithm. VPA Recommender doesn't use a simple average — it uses histogram-based percentile estimation (default p90–p95). Why does this matter? Because it means the top 5–10% peak usage is not reflected in the recommendation by default. For services where traffic fluctuates significantly by time of day, you might feel that the VPA target doesn't capture peaks — but that's by design. Understanding this also makes the "memory seasonality" drawback discussed later much easier to grasp.

Recommendations come in three flavors:

Recommendation	Meaning	Usage Tips
`lowerBound`	Minimum value for operation without throttling	The hard floor — never go below this
`target`	VPA's recommended requests value (p90–p95 basis)	Use this as your baseline for stable operation
`upperBound`	Maximum estimated value the pod may need	Reference for limits settings (do NOT use this directly as limits)

I was confused about this at first — I thought you could just use upperBound directly as limits. But upperBound is purely a reference, and it's better to set limits with additional headroom based on your own judgment. Also note that lowering limits to match requests can worsen CPU throttling.

How Goldilocks Layers on Top of VPA

Goldilocks doesn't touch VPA directly. Instead, when you label a namespace, it automatically creates VPA objects in updateMode: "Off" for all Deployments in that namespace.

You might wonder why updateMode: "Off" means pods aren't touched — VPA's Updater component is what actually restarts pods, and in Off mode the Updater simply doesn't run. The Recommender continues collecting data and calculating recommendations, but applying those values to pods is left to humans.

Goldilocks's role is to take those recommendations and display them in a clean web dashboard. Honestly, the first time I opened the dashboard and saw "this pod has 500m CPU set but target is 80m," it was quite a shock.

The full optimization loop looks like this:

Observe actual usage (VPA Recommender — histogram percentile-based)
      ↓
Review recommendations in Goldilocks dashboard
      ↓
Apply requests to Deployment YAML / GitOps
      ↓
Karpenter selects smaller nodes based on reduced requests
      ↓
Consolidation automatically removes idle nodes

What You Must Know When Using HPA and VPA Together

Applying HPA and VPA simultaneously on the same metric (CPU) causes conflicts. HPA adds pods when CPU usage is high, while VPA tries to change CPU requests — two controllers interfering with each other toward different goals.

From my experience, ignoring this and attaching both with CPU metrics led to pods restarting continuously in the middle of the night. So now I use a pattern that clearly separates their roles.

Recommended pattern: Assign CPU scaling to HPA and memory requests tuning to VPA (Goldilocks) — this role separation lets you use both tools together without conflicts.

Attaching to Your Cluster: From Installation to Applying Recommendations

Example 1: Installing Goldilocks + VPA and Basic Configuration

First, install VPA. It's a prerequisite since Goldilocks uses VPA's Recommender internally.

One thing worth noting — the fairwinds-stable/vpa Helm chart used here is a Fairwinds-wrapped version and may differ somewhat in configuration structure from the official kubernetes/autoscaler VPA. If you're already using the official VPA, check the Goldilocks official docs for integration options first.

bash

# Install VPA (Fairwinds wrapper version)
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install vpa fairwinds-stable/vpa \
  --namespace vpa \
  --create-namespace
 
# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace
 
# Add label to the namespace you want to monitor
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
 
# Temporary access to dashboard (for local/dev inspection)
kubectl port-forward svc/goldilocks-dashboard 8080:80 -n goldilocks

Security note: kubectl port-forward is for temporary local access only. To expose the dashboard persistently in production, use an Ingress or LoadBalancer Service and always configure authentication (OAuth, SSO, etc.) alongside it. Exposing it externally without authentication reveals your cluster's resource information in plain sight.

Once you apply the label, Goldilocks automatically creates VPA objects like the following:

yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-goldilocks
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # Updater disabled — recommendations only, no pod restarts

Here's an example of reviewing recommendations in the dashboard and applying them to a Deployment:

yaml

# Before: over-provisioned state
# After: adjusted based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        resources:
          requests:
            cpu: "250m"     # was 500m → 50% reduction
            memory: "512Mi" # was 1Gi → 50% reduction
          limits:
            cpu: "800m"     # limits kept with headroom above requests (prevents throttling)
            memory: "1Gi"

There's a reason limits weren't lowered to match requests. Setting CPU limits too low causes severe throttling and latency spikes during brief usage bursts. Adjusting limits should be done separately from requests, only after sufficient monitoring.

Change	Before	After	Savings
CPU request	500m	250m	50%
Memory request	1Gi	512Mi	50%
Expected node size	m5.xlarge	m5.large	~40% cost reduction

Example 2: Integrating with Karpenter NodePool Consolidation

When requests are reduced, Karpenter's Consolidation works more aggressively. Since Karpenter v1 GA, spec.disruption.budgets lets you control node interruption during business hours — without this setting, enabling Consolidation can cause pods to be rescheduled without warning during the day.

consolidateAfter: 30s can be confusing at first — it means that 30 seconds after a node is judged as underutilized, Karpenter begins the event to move that node's pods to other nodes and remove it. 30 seconds is quite aggressive, so it's recommended to observe your traffic patterns thoroughly before tuning this value.

yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s       # Consolidation event starts 30s after underutilized detection
    budgets:
    - nodes: "10%"              # Max percentage of nodes that can be disrupted simultaneously
    - nodes: "0"                # Completely block node disruption during business hours (Mon–Fri 09:00–18:00)
      schedule: "0 9 * * 1-5"
      duration: 9h

Consolidation: A Karpenter feature that reschedules pods spread across multiple nodes onto fewer nodes and deletes idle nodes. The closer requests are to actual usage, the more aggressively this operates.

Example 3: HPA + VPA Role Separation Pattern

A separation pattern where HPA handles CPU and Goldilocks (VPA) handles memory requests tuning.

yaml

# HPA: horizontal scaling based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale out when average CPU utilization across all pods exceeds 70%

averageUtilization: 70 may not be intuitive at first — it means "sum of actual CPU usage across all pods ÷ (requests × pod count)" exceeds 70%, triggering scale-out; below that, it scales in. The more accurate your requests, the more accurate HPA's scaling decisions become.

yaml

# Resource settings within the Deployment
# CPU is the HPA basis, so keep it independent of Goldilocks recommendations
# Update only memory based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        resources:
          requests:
            cpu: "250m"      # HPA operation basis — use Goldilocks recommendation as reference only
            memory: "512Mi"  # Updated based on Goldilocks target
          limits:
            cpu: "800m"
            memory: "1Gi"

Pros and Cons Analysis

Advantages

Item	Details
Maximized Karpenter efficiency	Precise requests → smaller instance type selection → higher bin-packing density
Cost savings	30–56% infrastructure cost reduction for backend services with uniform traffic (real case: $52K → $23K/month)
Zero-downtime analysis	Operates exclusively in `updateMode: Off` — no pod restarts
Visualization	View recommendations per namespace and Deployment at a glance in a UI dashboard
Incremental adoption	When integrated with GitOps (Argo CD, Flux), supports safe PR-based gradual rollout

Disadvantages and Caveats

Item	Details	Mitigation
Manual intervention required	Goldilocks only provides recommendations; automatic application requires separate implementation	Build a pipeline with Argo CD/Flux to auto-generate PRs from recommendations
HPA conflicts	Simultaneous use of VPA Auto mode and CPU-based HPA causes mutual interference	Separate roles: CPU → HPA, Memory → VPA
Cold start	Minimum several days of observation data needed to generate p90 recommendations	Wait at least 3–7 days after installation before trusting recommendations
Over-consolidation	Too short a `consolidateAfter` value causes excessive pod rescheduling	Set between 30s–5m, disruption budgets are essential
Spot + no PDB	Karpenter + Spot combination without PDB leads to service instability	Set PodDisruptionBudget on all Deployments
Memory seasonality	If traffic patterns vary significantly by time of day, VPA p90 recommendations may miss peaks	Apply recommendations only after collecting data that includes sufficient peak-hour coverage

PodDisruptionBudget (PDB): A setting that limits how many pods can go down simultaneously when Karpenter removes a node. Even just minAvailable: 1 is enough to prevent service interruption.

The Most Common Mistakes in Practice

Applying recommendations immediately after installing VPA. With only a few hours of data, recommendations won't reflect actual peaks at all. It's recommended to observe for at least a week before applying, to let the p90 histogram stabilize.
Connecting both HPA and VPA to the CPU metric simultaneously. I made this mistake early on and ended up with pods restarting continuously overnight. When two controllers each try to adjust requests toward different goals, pods become unstable. Follow the pattern of separating roles: CPU → HPA, Memory → VPA.
Enabling Karpenter Consolidation without configuring disruption budgets. This can lead to situations where nodes go down during business hours and pods receiving traffic are all rescheduled at once. Setting nodes: "0" during business hours on the NodePool is essentially mandatory.

Closing Thoughts

Precisely tuning requests with Goldilocks + VPA is the work of giving Karpenter the correct information so it can do its job more intelligently. Real-world cases have reported savings in the range of $52K → $23K/month using this approach. Cost optimization starts not from adding new tools, but from giving accurate data to the scheduler you already have.

Three steps you can start right now:

Start with a diagnosis. If your current cluster has two or more nodes with CPU utilization below 50%, this approach is likely to show immediate results. Label just one of your most expensive namespaces with Goldilocks and collect recommendations for a week — you'll immediately see how over-provisioned your current requests are.
Apply recommendations to a single Deployment in a low-traffic environment. Lower requests to a level slightly above lowerBound, monitor for a few days to confirm no OOM or throttling occurs, then expand gradually. Keep limits with sufficient headroom separate from requests.
Set disruption budgets on your Karpenter NodePool and enable Consolidation. With reduced requests in place, enabling consolidationPolicy: WhenUnderutilized lets Karpenter automatically clean up idle nodes — and you'll see real cost savings materialize.

Next post: How to integrate with Argo CD to automatically generate PRs from Goldilocks recommendations and incorporate them into a GitOps pipeline

References

Reducing Karpenter Costs by Up to 56% Through Kubernetes Resource Right-Sizing with Goldilocks + VPA | DEV BAK - 기술블로그

DevOps

Reducing Karpenter Costs by Up to 56% Through Kubernetes Resource Right-Sizing with Goldilocks + VPA

Why Requests Determine Cost

Why Karpenter Looks at Requests

Bin-packing: A strategy for packing containers onto nodes as densely as possible to reduce wasted space. The closer requests values are to actual usage, the higher the bin-packing density and the fewer nodes needed.

How VPA Generates Recommendations

Recommendations come in three flavors:

Recommendation	Meaning	Usage Tips
`lowerBound`	Minimum value for operation without throttling	The hard floor — never go below this
`target`	VPA's recommended requests value (p90–p95 basis)	Use this as your baseline for stable operation
`upperBound`	Maximum estimated value the pod may need	Reference for limits settings (do NOT use this directly as limits)

How Goldilocks Layers on Top of VPA

Goldilocks doesn't touch VPA directly. Instead, when you label a namespace, it automatically creates VPA objects in updateMode: "Off" for all Deployments in that namespace.

The full optimization loop looks like this:

Observe actual usage (VPA Recommender — histogram percentile-based)
      ↓
Review recommendations in Goldilocks dashboard
      ↓
Apply requests to Deployment YAML / GitOps
      ↓
Karpenter selects smaller nodes based on reduced requests
      ↓
Consolidation automatically removes idle nodes

What You Must Know When Using HPA and VPA Together

From my experience, ignoring this and attaching both with CPU metrics led to pods restarting continuously in the middle of the night. So now I use a pattern that clearly separates their roles.

Recommended pattern: Assign CPU scaling to HPA and memory requests tuning to VPA (Goldilocks) — this role separation lets you use both tools together without conflicts.

Attaching to Your Cluster: From Installation to Applying Recommendations

Example 1: Installing Goldilocks + VPA and Basic Configuration

First, install VPA. It's a prerequisite since Goldilocks uses VPA's Recommender internally.

bash

# Install VPA (Fairwinds wrapper version)
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install vpa fairwinds-stable/vpa \
  --namespace vpa \
  --create-namespace
 
# Install Goldilocks
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace
 
# Add label to the namespace you want to monitor
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
 
# Temporary access to dashboard (for local/dev inspection)
kubectl port-forward svc/goldilocks-dashboard 8080:80 -n goldilocks

Security note: kubectl port-forward is for temporary local access only. To expose the dashboard persistently in production, use an Ingress or LoadBalancer Service and always configure authentication (OAuth, SSO, etc.) alongside it. Exposing it externally without authentication reveals your cluster's resource information in plain sight.

Once you apply the label, Goldilocks automatically creates VPA objects like the following:

yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-goldilocks
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # Updater disabled — recommendations only, no pod restarts

Here's an example of reviewing recommendations in the dashboard and applying them to a Deployment:

yaml

# Before: over-provisioned state
# After: adjusted based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        resources:
          requests:
            cpu: "250m"     # was 500m → 50% reduction
            memory: "512Mi" # was 1Gi → 50% reduction
          limits:
            cpu: "800m"     # limits kept with headroom above requests (prevents throttling)
            memory: "1Gi"

Change	Before	After	Savings
CPU request	500m	250m	50%
Memory request	1Gi	512Mi	50%
Expected node size	m5.xlarge	m5.large	~40% cost reduction

Example 2: Integrating with Karpenter NodePool Consolidation

yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s       # Consolidation event starts 30s after underutilized detection
    budgets:
    - nodes: "10%"              # Max percentage of nodes that can be disrupted simultaneously
    - nodes: "0"                # Completely block node disruption during business hours (Mon–Fri 09:00–18:00)
      schedule: "0 9 * * 1-5"
      duration: 9h

Consolidation: A Karpenter feature that reschedules pods spread across multiple nodes onto fewer nodes and deletes idle nodes. The closer requests are to actual usage, the more aggressively this operates.

Example 3: HPA + VPA Role Separation Pattern

A separation pattern where HPA handles CPU and Goldilocks (VPA) handles memory requests tuning.

yaml

# HPA: horizontal scaling based on CPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale out when average CPU utilization across all pods exceeds 70%

yaml

# Resource settings within the Deployment
# CPU is the HPA basis, so keep it independent of Goldilocks recommendations
# Update only memory based on Goldilocks target
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        resources:
          requests:
            cpu: "250m"      # HPA operation basis — use Goldilocks recommendation as reference only
            memory: "512Mi"  # Updated based on Goldilocks target
          limits:
            cpu: "800m"
            memory: "1Gi"

Pros and Cons Analysis

Advantages

Item	Details
Maximized Karpenter efficiency	Precise requests → smaller instance type selection → higher bin-packing density
Cost savings	30–56% infrastructure cost reduction for backend services with uniform traffic (real case: $52K → $23K/month)
Zero-downtime analysis	Operates exclusively in `updateMode: Off` — no pod restarts
Visualization	View recommendations per namespace and Deployment at a glance in a UI dashboard
Incremental adoption	When integrated with GitOps (Argo CD, Flux), supports safe PR-based gradual rollout

Disadvantages and Caveats

Item	Details	Mitigation
Manual intervention required	Goldilocks only provides recommendations; automatic application requires separate implementation	Build a pipeline with Argo CD/Flux to auto-generate PRs from recommendations
HPA conflicts	Simultaneous use of VPA Auto mode and CPU-based HPA causes mutual interference	Separate roles: CPU → HPA, Memory → VPA
Cold start	Minimum several days of observation data needed to generate p90 recommendations	Wait at least 3–7 days after installation before trusting recommendations
Over-consolidation	Too short a `consolidateAfter` value causes excessive pod rescheduling	Set between 30s–5m, disruption budgets are essential
Spot + no PDB	Karpenter + Spot combination without PDB leads to service instability	Set PodDisruptionBudget on all Deployments
Memory seasonality	If traffic patterns vary significantly by time of day, VPA p90 recommendations may miss peaks	Apply recommendations only after collecting data that includes sufficient peak-hour coverage

PodDisruptionBudget (PDB): A setting that limits how many pods can go down simultaneously when Karpenter removes a node. Even just minAvailable: 1 is enough to prevent service interruption.

The Most Common Mistakes in Practice

Applying recommendations immediately after installing VPA. With only a few hours of data, recommendations won't reflect actual peaks at all. It's recommended to observe for at least a week before applying, to let the p90 histogram stabilize.
Connecting both HPA and VPA to the CPU metric simultaneously. I made this mistake early on and ended up with pods restarting continuously overnight. When two controllers each try to adjust requests toward different goals, pods become unstable. Follow the pattern of separating roles: CPU → HPA, Memory → VPA.
Enabling Karpenter Consolidation without configuring disruption budgets. This can lead to situations where nodes go down during business hours and pods receiving traffic are all rescheduled at once. Setting nodes: "0" during business hours on the NodePool is essentially mandatory.

Closing Thoughts

Three steps you can start right now:

Start with a diagnosis. If your current cluster has two or more nodes with CPU utilization below 50%, this approach is likely to show immediate results. Label just one of your most expensive namespaces with Goldilocks and collect recommendations for a week — you'll immediately see how over-provisioned your current requests are.
Apply recommendations to a single Deployment in a low-traffic environment. Lower requests to a level slightly above lowerBound, monitor for a few days to confirm no OOM or throttling occurs, then expand gradually. Keep limits with sufficient headroom separate from requests.
Set disruption budgets on your Karpenter NodePool and enable Consolidation. With reduced requests in place, enabling consolidationPolicy: WhenUnderutilized lets Karpenter automatically clean up idle nodes — and you'll see real cost savings materialize.

Next post: How to integrate with Argo CD to automatically generate PRs from Goldilocks recommendations and incorporate them into a GitOps pipeline

Why Requests Determine Cost

Why Karpenter Looks at Requests

How VPA Generates Recommendations

How Goldilocks Layers on Top of VPA

What You Must Know When Using HPA and VPA Together

Attaching to Your Cluster: From Installation to Applying Recommendations

Example 1: Installing Goldilocks + VPA and Basic Configuration

Example 2: Integrating with Karpenter NodePool Consolidation

Example 3: HPA + VPA Role Separation Pattern

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Why Requests Determine Cost

Why Karpenter Looks at Requests

How VPA Generates Recommendations

How Goldilocks Layers on Top of VPA

What You Must Know When Using HPA and VPA Together

Attaching to Your Cluster: From Installation to Applying Recommendations

Example 1: Installing Goldilocks + VPA and Basic Configuration

Example 2: Integrating with Karpenter NodePool Consolidation

Example 3: HPA + VPA Role Separation Pattern

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Building a GitOps Pipeline to Automate Goldilocks VPA Recommendations with Argo CD Pull Request Generator

Automating Multi-Cluster Progressive Deployment with Argo CD ApplicationSet Matrix Generator

Automating PR Preview Environments with ArgoCD ApplicationSet

Pattern Guide: Reducing EKS Spot Costs by 56% with OpenCost + Karpenter

Kubernetes Cost Optimization in Practice — From Namespace-Level Cost Tracking with OpenCost & Kubecost to HPA/VPA Tuning

MLOps Model Deployment Automation: Building a CI/CD/CT Pipeline with GitHub Actions + Kubeflow