ArgoCD + Kubernetes GitOps Zero-Downtime Deployment in Practice — From Rolling Updates to Canary, How to Safely Change Production with a Single Git Commit
Have you ever been afraid of deployments? When I first started operating Kubernetes, I'd post "Deploying 🙏" in Slack every deployment day, stare holes through the monitoring dashboard, and break into a cold sweat. But after adopting ArgoCD, deployments became just another "ordinary Git PR merge." Seeing production changes reflected within 5 minutes of merging a PR became routine, and after experiencing automatic rollbacks in the middle of the night without a single alert, late-night overtime naturally faded away.
This post covers everything from the concept of GitOps to Rolling Update, Blue-Green, and Canary deployment strategies using ArgoCD, complete with real-world code. Rather than a simple "here's what exists" overview, I'll share configurations you can apply immediately in production, along with honest accounts of the traps we commonly fell into. By the end of this post, you'll have a concrete picture of how to design a zero-downtime deployment pipeline with ArgoCD on your own.
Here's the flow: we'll first clarify how GitOps differs from traditional CI/CD, then understand ArgoCD's core components. Next, we'll walk through three deployment strategies — Rolling Update, Blue-Green, and Canary — with actual YAML code, sharing the pitfalls our team genuinely encountered along the way. We'll wrap up with a step-by-step guide you can start using right now.
What Is GitOps — "Git Is the Truth"
GitOps is an operational paradigm where the entire state of infrastructure and application deployments is declaratively recorded in a Git repository and continuously synchronized with the actual system. If traditional CI/CD pipelines work by "executing commands to change state," GitOps works by "declaring the desired state and letting the system figure out how to match it."
The core principle of GitOps: The Git repository is the Single Source of Truth. The moment you run
kubectl applydirectly against the cluster, this principle is broken.
ArgoCD is the tool that implements this principle on top of Kubernetes. As a CNCF graduated project (the official maturity certification in the cloud-native ecosystem), it is one of the most widely adopted CD tools in the Kubernetes ecosystem. In April 2025, ArgoCD v3.0 was released — the first major version since 2021 — bringing a host of enterprise-grade features including per-resource RBAC and API server load reduction.
ArgoCD Architecture — Understanding the Core Components
When you first encounter ArgoCD, the terminology can feel unfamiliar, but in practice, grasping a few key concepts is enough to see the whole picture.
| Component | Role |
|---|---|
| Application CRD | The core resource that defines "which path of which Git repo" to deploy to "which namespace of which cluster" |
| App of Apps Pattern | A parent Application that manages other Applications — suitable for managing a small number of services hierarchically |
| ApplicationSet | Automates multi-cluster and multi-environment deployments — handles dozens of environments from a single template |
| Sync Waves & Hooks | Controls deployment order — ensures things like DB migrations run first, app deployment second |
App of Apps vs ApplicationSet: The two patterns look similar, but there are criteria for choosing. App of Apps is suitable when the number of services is small and the structure is fixed. You manually write Application YAML files to build a hierarchical structure, which is intuitive. ApplicationSet, on the other hand, shines as environments (dev/staging/prod) or clusters grow. Adding a new cluster only requires adding one line of YAML. Our team started with App of Apps, then migrated to ApplicationSet once we exceeded three clusters.
The most fundamental Application resource looks like this:
# k8s/argocd/apps/my-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/my-app-config
targetRevision: HEAD
path: k8s/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Resources deleted from Git are also removed from the cluster
selfHeal: true # If cluster state diverges from Git, automatically restoreWhat
selfHeal: truemeans: Even if someone accidentally edits directly withkubectl edit, ArgoCD detects it and automatically restores the Git state. This is the Self-Healing characteristic of GitOps. You'll occasionally find people confused by "why does my change keep reverting?" — this setting is why.
Deployment Strategy Overview — Rolling, Blue-Green, Canary
Honestly, at first the differences between the three strategies were confusing. Summarized in a single line each:
| Strategy | Core Idea | When to Use |
|---|---|---|
| Rolling Update | Replace pods one by one sequentially | Most situations, when resources are limited |
| Blue-Green | Prepare a new environment and switch traffic all at once | When immediate rollback is needed or DB schema changes are involved |
| Canary | Send only a portion of traffic to the new version first | When the risk is high, when you want to validate with real traffic |
Rolling Update is built into Kubernetes, but Blue-Green and Canary require a separate tool called Argo Rollouts. Many people have installed only ArgoCD and wondered "why isn't my canary deployment working?" without knowing this.
Practical Application
Rolling Update in Practice — With a Production Checklist
Rolling Update looks simple, but missing a single setting can cause traffic interruptions mid-deployment. I remember deploying without a readinessProbe the first time and getting flooded with 500 errors the moment new pods came up. Here's the Deployment configuration for a safe Rolling Update:
# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
annotations:
argocd.argoproj.io/sync-wave: "2" # Runs after DB migration (wave 1)
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Always maintain 3 pods during deployment
maxSurge: 1 # Allow up to 4 pods running simultaneously
template:
spec:
terminationGracePeriodSeconds: 60 # Wait for in-flight requests to complete
containers:
- name: my-app
image: my-app:v2.0
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe: # Without this, traffic flows into unhealthy pods
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
# k8s/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # With replicas: 3, only 1 pod can be terminated at a time
selector:
matchLabels:
app: my-appTo clarify the relationship between minAvailable: 2 and replicas: 3: it means "at least 2 of the 3 pods must always be alive." Ultimately only 1 can be terminated at a time, ensuring service availability during node drains or rolling updates.
| Setting | Problem If Missing |
|---|---|
readinessProbe |
Traffic flows in before the app is ready → errors |
terminationGracePeriodSeconds |
Pod forcibly terminated while handling requests → connection drops |
resources.requests |
Scheduler fails to place pods due to insufficient node resources |
PodDisruptionBudget |
All pods can be terminated simultaneously during node drain |
For cases where order matters, such as DB migrations, you can use Sync Waves:
# k8s/jobs/db-migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
annotations:
argocd.argoproj.io/sync-wave: "1" # Runs before Deployment (wave 2)
argocd.argoproj.io/hook: PreSync
spec:
template:
spec:
restartPolicy: Never # Must be specified explicitly for Jobs
containers:
- name: migrate
image: my-app:v2.0
command: ["python", "manage.py", "migrate"]Blue-Green with Argo Rollouts — When You Need Immediate Rollback
Blue-Green is an approach where you run the "current blue environment" and the "new green environment" simultaneously, then switch traffic all at once after verification. The biggest advantage is the ability to roll back instantly — our team chose this strategy for a deployment involving DB schema changes. Resources are temporarily doubled, but since the old version automatically scales down after scaleDownDelaySeconds, it's not a permanent doubling.
First, you need two Services to receive traffic:
# k8s/services/my-app-services.yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-active # Real user traffic goes here
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: my-app-preview # For validating the new version
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080Then define the Rollout resource:
# k8s/rollouts/my-app-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
strategy:
blueGreen:
activeService: my-app-active # Service currently receiving traffic
previewService: my-app-preview # Preview service for the new version
autoPromotionEnabled: false # Switch only after manual approval
scaleDownDelaySeconds: 300 # Keep old version for 5 minutes after switch (for rollback)
template:
spec:
containers:
- name: my-app
image: my-app:v2.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5After deploying the new version, you can validate it against the my-app-preview service, then switch with the following command when ready. The Argo Rollouts CLI can be installed using the kubectl plugin method described in the official documentation:
# Manual promotion with Argo Rollouts CLI
kubectl argo rollouts promote my-app -n production
# When rollback is needed
kubectl argo rollouts abort my-app -n productionOur team ran with autoPromotionEnabled: false for the first six months. We had the QA team validate directly against the preview service before switching traffic, and that habit helped us preemptively prevent several potential incidents.
Canary + Prometheus Automated Analysis — Validating with Real Traffic
Canary is an approach of "sending a little ahead first for real-user validation." By integrating Argo Rollouts' AnalysisTemplate with Prometheus, you can configure automatic rollback when the error rate exceeds a threshold.
Canary also requires two Services:
# k8s/services/my-app-canary-services.yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-stable # Traffic for the existing stable version
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: my-app-canary # Canary traffic for the new version
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080# k8s/analysis/success-rate.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 2m
# result[0]: the first scalar value from the Prometheus query result (success rate 0.0~1.0)
successCondition: result[0] >= 0.95 # Must pass with 95%+ success rate
# failureLimit: 3 → rollback after 3 cumulative failures (total, not consecutive)
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="my-app",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{app="my-app"}[5m]))# k8s/rollouts/my-app-canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
steps:
- setWeight: 10 # 10% traffic → new version
- pause: {duration: 5m} # Observe for 5 minutes
- setWeight: 30 # Expand to 30%
- analysis:
templates:
- templateName: success-rate # Automatic error rate check
- setWeight: 60
- pause: {duration: 10m}
- setWeight: 100 # Full cutover if no issuesThe role of AnalysisTemplate: During a canary deployment, it runs Prometheus queries to automatically evaluate metrics like error rate and latency. If thresholds are exceeded, it rolls back automatically without human intervention. I actually experienced an automatic rollback during a nighttime deployment with no alerts at all — that's when I truly felt the value of this feature. I only found out what had happened overnight when I came in the next morning and reviewed the rollback history.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Self-Healing | Automatically detects and restores when cluster state diverges from Git — even manual edits are reverted |
| Audit Trail | Every deployment is recorded as a Git commit — perfect tracking of "who deployed what and when" |
| Instant Rollback | A single revert to a previous commit automatically restores the cluster to its previous state |
| Visual UI | Resource topology, sync status, and health status all visible at a glance |
| RBAC + SSO | Fine-grained access control and integration with in-house SSO (SAML, OIDC) |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Initial Adoption Cost | Installation, Git repo structure design, and team training typically takes over a week | Start with a single service and adopt incrementally |
| Kubernetes Only | Does not support VM or serverless environments | Run separate CD tools in parallel for non-K8s environments |
| Advanced Strategies Not Built-in | Blue-Green and Canary require separate Argo Rollouts installation | Set up Argo Rollouts as part of the bundle |
| Secret Management | Cannot store secrets directly in Git | Adopt External Secrets Operator or Sealed Secrets |
| No Multi-Environment Promotion | No built-in automatic promotion from dev → staging → prod | Supplement with the Kargo tool |
| UI Lag at Scale | Rendering slows down with thousands of apps or more | Use ApplicationSet for distributed management |
External Secrets Operator (ESO): An operator that synchronizes external secret stores — such as AWS Secrets Manager, GCP Secret Manager, and HashiCorp Vault — with Kubernetes Secrets. Only the secret reference path is stored in Git; the actual values are injected at runtime from the external store.
The Most Common Mistakes in Practice
-
Not setting Readiness Probes — Traffic flows in the moment a container comes up, causing initial request failures. It's best to always configure this alongside
initialDelaySeconds. In my very first deployment, this mistake caused errors to pour out for tens of seconds. -
Overusing
autoPromotionEnabled: true— Automatically switching without validation in Blue-Green defeats the purpose of Blue-Green. Our team ran manual approval only for the first six months, and only applied automation to specific services after the team process had stabilized. -
Not accounting for sync order in App of Apps — When managing multiple Applications hierarchically without configuring Sync Waves, resources with dependencies fall into a
Sync Failedstate due to ordering conflicts. The classic case is a resource that uses a CRD being applied before the CRD itself is deployed. It's safest to explicitly manage order by applyingargocd.argoproj.io/sync-waveannotations to Application CRDs as well.
Closing Thoughts
I still remember those days of typing "Deploying 🙏" and breaking into a cold sweat. After adopting ArgoCD, deployments became just another routine process, no different from a code review. ArgoCD is the tool that best embodies the GitOps principle of "treating deployments like code," and when configured alongside Argo Rollouts, you can complete a production-grade zero-downtime deployment pipeline covering everything from Rolling Updates to metric-based Canary deployments.
Three steps you can start right now:
-
Install ArgoCD and connect your first Application: Install ArgoCD with the commands below, then upload the Kubernetes manifests for one of your currently running services to a Git repository and create an Application CRD. Running
kubectl port-forward svc/argocd-server -n argocd 8080:443to see the UI and visually observe the sync status is a huge help for building understanding.bashkubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml -
Install Argo Rollouts and apply Canary strategy: Migrate your existing Deployment to a Rollout resource and use the Canary example code above to configure a 3-phase traffic shift of 10% → 50% → 100%. Starting without an AnalysisTemplate and using only manual pauses is perfectly fine at first.
bashkubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml -
Establish a secret management system: Install External Secrets Operator and integrate it with AWS Secrets Manager or whichever secret store your team uses. Once this is complete, you'll have a truly "Git is all you need to look at" GitOps environment.
Next post: Multi-stage GitOps with Kargo — Designing an automatic promotion pipeline from dev to prod
References
- Argo CD Official Documentation | argo-cd.readthedocs.io
- Argo Rollouts Official Documentation | argo-rollouts.readthedocs.io
- CNCF — Argo CD v3 Announcement | cncf.io
- Implementing Zero-Downtime Deployments with Argo CD | OpsMx
- Zero-Downtime Rollbacks in Kubernetes with ArgoCD | DEV Community
- Automating Blue-Green and Canary Deployments with Argo Rollouts | Akuity
- Multi-cluster GitOps with ArgoCD | Com2uS ON Tech Blog
- Maximizing DevOps Efficiency with GitOps | Kakao Cloud
- Argo CD Anti-Patterns | Codefresh
- Production-Grade GitOps with Argo CD | Medium
- Implementing GitOps and Canary Deployment with Argo Project and Istio | Tetrate
- Kargo Official Site