Building a GitOps Pipeline to Automate Goldilocks VPA Recommendations with Argo CD Pull Request Generator
Anyone who has operated a Kubernetes cluster has likely encountered this situation: everything seems to be running fine, but the monthly cloud bill feels suspiciously high. CPU requests are set at 5x actual usage, memory at 3x, and that compounds across dozens of services. At first I thought, "Isn't it safer to over-provision?" — but it turns out this directly impacts the cost of Karpenter, the node autoprovisioner. Because Karpenter determines node size based on the sum of pod requests, inflated requests cause it to provision nodes far larger than necessary. After auditing dozens of services, over half showed more than 20% over-allocation.
So I introduced Goldilocks to start collecting VPA recommendations — and ran into another problem. The recommendations looked good, but having a human manually translate them into Helm values.yaml updates every time simply wasn't practical at scale. With dozens of services, that's not even semi-automated. What was really needed was an end-to-end pipeline: read recommendations, automatically open a Git PR, and have Argo CD apply the changes to the cluster after merge.
Following this guide will completely eliminate the manual values.yaml updates you've been doing every week. We'll walk through the full pipeline step by step — extracting Goldilocks VPA recommendations via a CronJob, auto-generating GitHub PRs, and setting up preview environments with the Argo CD ApplicationSet Pull Request Generator. This guide is aimed at teams already using Kubernetes, Helm, and Argo CD.
Core Concepts
What Goldilocks Does: Using VPA "Safely"
Enabling VPA naively causes pods to restart the moment a recommendation is generated — a fairly dangerous behavior in production. Goldilocks works around this with updateMode: "Off". It creates VPA objects but never actually applies them to pods, instead accumulating recommendations in .status.recommendation.
There's one prerequisite I missed initially and spent a while debugging ("why aren't recommendations showing up?"): the VPA Recommender requires Metrics Server (or Prometheus Adapter) to be installed. It's worth checking that kubectl top pods works correctly in your cluster first.
# Example VPA object auto-created by Goldilocks
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-api-server
updatePolicy:
updateMode: "Off" # Key: recommendations only, no restarts
status:
recommendation:
containerRecommendations:
- containerName: app
target:
cpu: "250m"
memory: "512Mi"
lowerBound:
cpu: "100m"
memory: "256Mi"
upperBound:
cpu: "500m"
memory: "1Gi"Recommendations come in three flavors — target, lowerBound, and upperBound — and which one you use is a surprisingly important decision. target is based on average usage, while upperBound reflects peak traffic. For services with heavy batch workloads or frequent traffic spikes, it's safer to apply a target * 1.2 safety margin or refer to upperBound.
Usage is straightforward: add a single label to the namespace you want to monitor, and the Goldilocks Controller will detect the Deployments in that namespace and automatically create VPA objects.
kubectl label namespace production goldilocks.fairwinds.com/enabled=trueVPA recommendation convergence time: VPA needs at least 7–14 days of real traffic data to produce reliable recommendations. Pulling recommendations immediately after a deployment will yield meaningless numbers.
End-to-End Flow of the GitOps PR Pipeline
The core idea is simple. Instead of a human copying recommendations, a script polls VPA objects, updates Helm values.yaml, and submits a PR. Argo CD then syncs to the cluster once the PR is merged.
Goldilocks Controller
↓ (watches namespaces → creates VPA objects)
VPA Recommender
↓ (accumulates recommendations based on actual usage patterns)
Kubernetes CronJob
↓ (extracts recommendations via kubectl/API)
Automated GitHub PR
↓ (modifies values.yaml + submits PR)
Argo CD ApplicationSet
↓ (PR Generator → auto-deploys preview environment)
Code Review & Merge
↓
Argo CD Sync → applied to cluster
↓
Prometheus + Grafana (cost/performance monitoring)Here's a summary of each component's role:
| Component | Role |
|---|---|
| Goldilocks Controller | Watches namespaces → auto-creates VPA objects |
| VPA Recommender | Analyzes actual usage patterns → accumulates .status.recommendation |
| CronJob + Script | Extracts VPA recommendations → updates values.yaml → creates PR |
| Argo CD ApplicationSet | Auto-deploys preview environments via PR Generator |
| Argo CD | Syncs merged PRs to the cluster |
Practical Implementation
Step 1: Automated PR Creation via CronJob
This is the lowest-barrier pattern to get started. Run a CronJob inside the cluster that reads VPA recommendations weekly and creates PRs. All you need is kubectl, jq, yq, and the gh CLI — no third-party commercial tooling required.
First, register your GitHub Token as a Secret. The token needs contents: write and pull-requests: write permissions; a fine-grained PAT is recommended.
kubectl create secret generic github-token \
--from-literal=token=<YOUR_FINE_GRAINED_PAT> \
-n goldilocksNext, configure a ServiceAccount and RBAC so the CronJob can read VPA objects. The CronJob needs explicit permissions to read VPA objects from the API server.
# vpa-reader-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: vpa-reader
namespace: goldilocks
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: vpa-reader
rules:
- apiGroups: ["autoscaling.k8s.io"]
resources: ["verticalpodautoscalers"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: vpa-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: vpa-reader
subjects:
- kind: ServiceAccount
name: vpa-reader
namespace: goldilocksThe CronJob itself is the critical piece. It runs every Monday at 9 AM, extracts recommendations, and invokes the PR creation script. Since bitnami/kubectl:latest does not include jq, yq, or gh by default, you should build a custom image containing all three tools. The example below uses my-registry/pr-bot:1.0.0 as a placeholder image name.
# goldilocks-pr-bot-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: goldilocks-pr-bot
namespace: goldilocks
spec:
schedule: "0 9 * * 1" # Every Monday at 9 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: vpa-reader
containers:
- name: pr-bot
image: my-registry/pr-bot:1.0.0 # Custom image with kubectl + jq + yq + gh
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-token
key: token
- name: DELTA_THRESHOLD
value: "0.2" # Only create PR if difference exceeds 20%
command:
- /bin/sh
- -c
- |
# 1. Extract VPA recommendations (excluding sidecar containers)
kubectl get vpa -A -o json | jq -r '
.items[] |
.metadata.namespace + "/" + .metadata.name + " " +
(.status.recommendation.containerRecommendations[]? |
select(.containerName != "istio-proxy") |
select(.containerName != "linkerd-proxy") |
.containerName + " cpu=" + .target.cpu +
" mem=" + .target.memory)
' > /tmp/recommendations.txt
# 2. Invoke PR creation script
/scripts/create-pr.sh /tmp/recommendations.txt
restartPolicy: OnFailureThe most important part of the PR creation script is delta filtering. Honestly, I left this out initially and ended up with dozens of PRs flooding in every week. You need to skip updates where the change is less than 20% from the current value to keep PR noise under control.
Another easy mistake is handling the git clone directory. If a previous run left the directory behind, the clone will fail on the next execution. The script below handles this explicitly with rm -rf, and uses set -euo pipefail throughout to fail immediately on any error.
#!/bin/bash
# create-pr.sh
set -euo pipefail
BRANCH="chore/resource-update-$(date +%Y%m%d)"
REPO_DIR="/workspace/k8s-manifests"
THRESHOLD=${DELTA_THRESHOLD:-0.2}
# Git configuration
git config --global user.email "bot@example.com"
git config --global user.name "Goldilocks Bot"
# Clean up any directory left from a previous run
rm -rf "$REPO_DIR"
git clone "https://x-access-token:${GITHUB_TOKEN}@github.com/my-org/k8s-manifests" "$REPO_DIR" \
|| { echo "ERROR: git clone failed"; exit 1; }
cd "$REPO_DIR"
git checkout -b "$BRANCH"
CHANGED=false
while IFS=' ' read -r ns_name container cpu_rec mem_rec; do
NAMESPACE=$(echo "$ns_name" | cut -d/ -f1)
APP=$(echo "$ns_name" | cut -d/ -f2)
VALUES_FILE="helm/${APP}/values.yaml"
[ ! -f "$VALUES_FILE" ] && continue
# Compare current value against recommendation (delta filtering)
CURRENT_CPU=$(yq e ".resources.requests.cpu" "$VALUES_FILE") \
|| { echo "WARN: yq parse failed for $VALUES_FILE, skipping"; continue; }
CPU_VAL=$(echo "$cpu_rec" | sed 's/cpu=//')
# Only update if the change exceeds the threshold (applying target * 1.2 safety margin)
if python3 -c "
import sys
def parse_cpu(v):
if v.endswith('m'): return float(v[:-1])
return float(v) * 1000
cur = parse_cpu('$CURRENT_CPU')
rec = parse_cpu('$CPU_VAL') * 1.2 # 20% safety margin
diff = abs(cur - rec) / cur if cur > 0 else 1
sys.exit(0 if diff >= $THRESHOLD else 1)
"; then
MEM_VAL=$(echo "$mem_rec" | sed 's/mem=//')
# Update with safety margin applied
SAFE_CPU=$(python3 -c "
def parse_cpu(v):
if v.endswith('m'): return float(v[:-1])
return float(v) * 1000
print(str(int(parse_cpu('$CPU_VAL') * 1.2)) + 'm')
")
yq e ".resources.requests.cpu = \"${SAFE_CPU}\"" -i "$VALUES_FILE"
yq e ".resources.requests.memory = \"${MEM_VAL}\"" -i "$VALUES_FILE"
CHANGED=true
echo "Updated: $APP (cpu: $CURRENT_CPU → $SAFE_CPU)"
fi
done < "$1"
# Only create PR if there are changes
if [ "$CHANGED" = true ]; then
git add -A
git commit -m "chore: update resource requests from Goldilocks recommendations $(date +%Y-%m-%d)"
git push origin "$BRANCH" \
|| { echo "ERROR: git push failed"; exit 1; }
gh pr create \
--title "chore: Goldilocks resource right-sizing $(date +%Y-%m-%d)" \
--body-file /scripts/pr-template.md \
--label "resource-optimization" \
--label "automated"
fiPreparing a PR body template in advance makes life much easier for reviewers. Here's a suggested format for /scripts/pr-template.md:
## Goldilocks Resource Right-sizing
This PR was auto-generated based on VPA recommendations.
### Changes
| Service | Before CPU | After CPU | Before Memory | After Memory |
|---------|-----------|-----------|--------------|-------------|
| (populated by script) | | | | |
### Estimated Cost Savings
- Estimated monthly savings: $XXX (including Karpenter node size reduction)
### Verification Checklist
- [ ] Preview environment deployed successfully
- [ ] No CrashLoopBackOff after pod restarts
- [ ] Key metrics (response time, error rate) within normal range| Code Point | Description |
|---|---|
select(.containerName != "istio-proxy") |
Filters out sidecar container recommendations |
DELTA_THRESHOLD=0.2 |
No PR for changes under 20% (noise prevention) |
target * 1.2 safety margin |
Adds spike headroom on top of average-based recommendations |
set -euo pipefail |
Fail immediately on error; prevent execution in a broken state |
rm -rf "$REPO_DIR" |
Prevents clone directory conflicts on CronJob re-runs |
$CHANGED = true check |
Prevents empty PR creation when nothing changed |
Step 2: Adding Preview Environments with Argo CD ApplicationSet
Auto-generating PRs is great, but without a way to verify "is this change actually safe?", there's always some anxiety. The most alarming case I personally encountered was a PR with under-provisioned recommendations getting merged. Using the ApplicationSet Pull Request Generator, you can automatically deploy changes to a preview namespace the moment a PR is opened, allowing validation before merge.
One important caveat: using {{branch}} directly as a namespace name causes two problems. First, branch names containing slashes (/) are not valid as Kubernetes namespace names. Second, they may exceed the 63-character length limit. The pattern below sanitizes the branch name to avoid both issues.
# resource-update-preview-appset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: resource-update-previews
namespace: argocd
spec:
generators:
- pullRequest:
github:
owner: my-org
repo: k8s-manifests
labels:
- resource-optimization # Filter only Goldilocks PRs
tokenRef:
secretName: github-token
key: token
requeueAfterSeconds: 300 # Re-check PR status every 5 minutes
template:
metadata:
# Remove slashes + truncate to 50 chars for valid namespace name
name: "preview-{{branch | replace '/' '-' | truncate 50}}"
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests
targetRevision: "{{head_sha}}"
path: helm/my-app
helm:
valueFiles:
- values.yaml
destination:
server: https://kubernetes.default.svc
namespace: "preview-{{branch | replace '/' '-' | truncate 50}}"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=truePull Request Generator: Argo CD detects PRs with the specified label and automatically creates Application objects.
requeueAfterSeconds: 300sets the interval for re-checking PR status and detecting newly opened PRs. When a PR is closed, the Application is automatically deleted.
This ApplicationSet is used alongside your existing Argo CD project configuration. Rather than project: default, it's better to specify a project that matches your team's RBAC policies. Isolating preview namespace access in a dedicated AppProject keeps it from interfering with your production environment.
With this setup, PR reviewers can confirm that "the app starts up correctly with these new recommendations" directly in the preview environment before merging. After merge, Argo CD automatically syncs to the production namespace, completing the full loop.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Audit trail | Every resource change is recorded as a Git commit, making it traceable who changed what and why |
| Safe review process | VPA proposes recommendations as PRs rather than restarting pods directly, allowing team review before applying |
| Incremental rollout | Use namespace labels to scope targets; apply progressively to services you're confident about |
| Cost visibility | Include estimated cost savings in PR descriptions to quantify business value |
| Easy rollback | Instantly revert to previous resource settings with a single git revert |
Drawbacks and Caveats
The most jarring issue I personally ran into was HPA conflicts. Running CPU-based HPA and VPA simultaneously can create situations where they attempt to scale in opposite directions. The table below summarizes the key pitfalls and how to address them.
| Item | Details | Mitigation |
|---|---|---|
| Recommendation convergence time | Recommendations are inaccurate immediately after a new deployment | Filter PR targets to workloads at least 14 days past their last deployment |
| Traffic spikes | Risk of under-provisioning for batch jobs or event-driven traffic | Apply target * 1.2 safety margin; consider upperBound-based settings |
| PR noise | Can generate dozens to hundreds of PRs per week | Delta threshold filtering (±20%) to generate PRs only for meaningful changes |
| HPA conflicts | CPU scaling conflicts when HPA and VPA are used simultaneously | Apply VPA to memory only, or exclude CPU via resourcePolicy |
| Pod restarts | Pod restarts occur when Argo CD auto-syncs | Protect availability with ServerSideApply=true + PodDisruptionBudget |
| Multi-container pods | Risk of sidecar recommendations contaminating main container settings | Filter by container name (select(.containerName != "istio-proxy")) |
QualityOfService (QoS): Kubernetes assigns
Guaranteedwhenrequests == limits,Burstablewhenrequests < limits, andBestEffortwhen requests are not set. Changing resource settings can shift the QoS class. For stability-critical services, it's recommended to update limits alongside requests to maintainGuaranteedclass.
Most Common Mistakes in Practice
-
Attaching PR automation before VPA data has converged: If you spin up the CronJob alongside Goldilocks installation, PRs will be generated from recommendations based on only a few days of data — which are essentially meaningless. Let data accumulate for at least two weeks before enabling automation.
-
Running without delta filtering: It seems fine at first, but as the number of services grows, you end up with dozens of "reduced CPU by 1m" PRs every week. Once PR fatigue sets in, the team starts ignoring them entirely, and the automation loses its value.
-
Enabling Argo CD AutoSync without PodDisruptionBudgets: The moment a resource PR is merged, Argo CD may restart all pods simultaneously. If this happens during a high-traffic period, it can result in a brief service outage.
Closing Thoughts
One thing that changes after introducing this pipeline is how the team thinks about resource settings: they shift from "something you set once and are afraid to touch" to "something you review weekly via PR." A culture naturally emerges where the cluster proposes sensible values, humans apply final judgment, and changes land safely with a full Git history.
Here are three steps you can start with right now:
-
Install Goldilocks and start collecting data: After installing with the commands below, add a label to just one of your most expensive namespaces and let it collect data for two weeks.
bashhelm repo add fairwinds-stable https://charts.fairwinds.com/stable helm install goldilocks fairwinds-stable/goldilocks \ -n goldilocks --create-namespace kubectl label namespace <ns> goldilocks.fairwinds.com/enabled=true -
Review recommendations and establish filtering criteria: Use
kubectl get vpa -A -o json | jq '.items[].status.recommendation.containerRecommendations'to inspect the recommendations directly, compare them with your currentvalues.yamlsettings, and decide on a delta threshold (20–30%) that fits your team. -
Deploy the CronJob and run a pilot: Apply the CronJob YAML above, but initially configure it to create Draft PRs only using the
--draftflag. After observing for 2–3 weeks, you'll have a clear sense of the noise level and recommendation quality.
Coming Up Next
We'll cover how to use Argo CD ApplicationSet's Matrix Generator and Cluster Generator to progressively roll out resource optimization PRs across environments (dev/staging/prod) in a multi-cluster setup. (Link will be updated after publication)
References
- GitHub - FairwindsOps/goldilocks
- Goldilocks Official Documentation | Fairwinds
- Kubernetes right-sizing with metrics-driven GitOps automation | AWS Blog
- Goldilocks: Fairwinds Insights Integration Documentation
- Fairwinds Insights Automated Fix PR Release Notes
- Argo CD Pull Request Generator Official Documentation
- Automate CI/CD on pull requests with Argo CD ApplicationSets | Red Hat Developer
- Mastering Argo CD Image Updater with Helm | CNCF Blog
- GitHub - wI2L/kubectl-vpa-recommendation
- Grafana VPA Recommendations Dashboard (ID: 14588)
- What is Goldilocks? | CNCF Blog
- Kubernetes Resource Optimization & Best Practices with Goldilocks | Security Boulevard