Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It
Whenever I think through deployment strategies, I always pause for a moment at "should I go with canary or BlueGreen?" At first, I vaguely assumed canary was safer — but then I tried pushing a DB schema change through a canary release and ended up in a pretty rough situation. Old Pods started throwing errors as they tried to read the new schema, and rolling back took 40 minutes. During that time, the service error rate spiked to 12%. After that day, I deeply understood: "This should have been BlueGreen."
This post aims to configure the BlueGreen strategy hands-on with Argo Rollouts while understanding the differences from canary in a practical context. Rather than just copying YAML, we'll examine why BlueGreen's instant cutover makes a decisive difference in certain situations, and what criteria to use when choosing between the two strategies.
Before reading this post: This is most useful for those already familiar with Kubernetes Services, Deployments, and ReplicaSets. It will be especially helpful if you're facing a release that includes a DB migration or an API breaking change.
Core Concepts
What "Instant Cutover" in BlueGreen Actually Means
The concept behind BlueGreen deployment is simple. You run the current production environment (Blue) and the new version environment (Green) simultaneously, then switch all traffic to Green at once when it's ready. The key is that this switch does not happen partially.
The way Argo Rollouts implements this in Kubernetes is quite elegant. The cutover happens with a single API call that updates the selector of the activeService Kubernetes Service to the new ReplicaSet hash. This API call to etcd is atomic, so there is no state where the switch is "half applied and half not." However, there can be a propagation delay of hundreds of milliseconds — or even a few seconds on large clusters — for each node's kube-proxy to actually update its iptables/ipvs rules. In practice, this rarely causes issues, but it's more accurate to think of the switch as "effectively instantaneous" rather than "perfectly simultaneous."
Atomic-like Switch: A cutover method where traffic is not simultaneously distributed between the old and new versions — meaning there is no window where both versions are concurrently serving production traffic. This is BlueGreen's defining characteristic.
Comparing with canary makes this difference even clearer.
| BlueGreen | Canary | |
|---|---|---|
| Traffic cutover method | Instantaneous (effectively atomic) switch | Gradual percentage shift |
| Period of concurrent production traffic | None | Coexists throughout the entire rollout |
| Rollback method | Re-point the service pointer | Scale weight back to 0% |
| Infrastructure cost | Requires 2x resources | Minimizes additional resources |
| Suitable for | Breaking changes, large-scale releases | Gradual feature validation, high-frequency deployments |
The Lifecycle of an Argo Rollouts BlueGreen
Once a Rollout begins, it proceeds internally in the following order:
- Green ReplicaSet is created → The
previewServiceis switched to point to Green. At this point, production traffic is still handled by Blue. prePromotionAnalysisruns (optional) → Automatically validates the state of Green based on Prometheus or Datadog metrics.- Promotion → The
activeServiceswitches to Green. This is the moment of the instantaneous traffic cutover. postPromotionAnalysisruns (optional) → Performs smoke tests or additional validation after the cutover.- Blue ReplicaSet is removed → The old version is cleaned up after
scaleDownDelaySeconds.
Practical Application
Example 1: Basic BlueGreen Rollout Configuration
To use BlueGreen with Argo Rollouts, you first need two Kubernetes Services: an activeService and a previewService. At first glance, the two services look nearly identical except for their selectors, which might make you wonder "why bother?" — but in fact, both services initially have the same app: my-app selector. Argo Rollouts works by injecting an additional ReplicaSet hash label into each service as the deployment progresses, which is what distinguishes Blue from Green.
# active-service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-active
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
---
# preview-service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-preview
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 8080# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:v2
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false # manual promotion
previewReplicaCount: 1 # cost savings: Green runs at 1 replica for validation, then scales up on promotion
scaleDownDelaySeconds: 300 # in production, be more generous than 30 seconds
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app-preview
postPromotionAnalysis:
templates:
- templateName: smoke-test| Field | Role |
|---|---|
activeService |
Service that receives production traffic (required) |
previewService |
Service for accessing Green (new version) — optional, used for QA validation |
autoPromotionEnabled: false |
Manual promotion — approve explicitly after validation |
previewReplicaCount: 1 |
Keeps Green at minimum replicas to cut costs to less than half |
scaleDownDelaySeconds: 300 |
Keeps Blue running for 5 minutes after promotion (provides rollback window) |
prePromotionAnalysis |
Automatic metrics gate before promotion |
To manually promote after deployment, use the following command:
kubectl argo rollouts promote my-appIf something goes wrong, you can immediately revert to the previous state:
kubectl argo rollouts undo my-appExample 2: Deploying a DB Breaking Change — A Situation That Requires BlueGreen, Not Canary
Now that you understand the basic configuration, it's time to look at the scenario where BlueGreen truly shines. The most typical real-world case is when you need to change a DB schema without backward compatibility. With canary, you get a window where both the old version (expecting the old schema) and the new version (using the new schema) are looking at the same DB simultaneously. I experienced data collision errors in this exact situation, and since then I always use BlueGreen for releases like this.
With BlueGreen, only the old version handles production traffic until the cutover, so you can safely design a flow where you manually promote after confirming the DB migration is complete.
# rollout-db-migration.yaml (strategy section)
strategy:
blueGreen:
activeService: api-active
previewService: api-preview
autoPromotionEnabled: false # manual promotion after confirming migration
scaleDownDelaySeconds: 600 # keep Blue for 10 minutes to allow rollback windowThe actual deployment flow looks like this. Using kubectl argo rollouts set image is the officially recommended way to update the Rollout image:
# 1. Deploy new image (Green environment is created, traffic still goes to Blue)
kubectl argo rollouts set image api app=api:v2
# 2. Verify Green status via preview service
curl http://api-preview/health
# 3. After confirming DB migration is complete, manually promote
kubectl argo rollouts promote api
# 4. If issues arise, immediately roll back
kubectl argo rollouts undo apiExpand-Contract Pattern: To safely handle breaking schema changes, two stages are required. First, deploy an intermediate version that supports both the old and new columns (Expand), then deploy a second time to remove the old columns (Contract). Specifically: ① Deploy a version that adds the new column while keeping the old column → ② After confirming the code only uses the new column, deploy a version that removes the old column. Because BlueGreen clearly separates each stage, it is the deployment strategy that best fits the Contract phase of this pattern.
Example 3: Prometheus-Based Automatic Promotion Gate
Once you're comfortable with manual promotion, you can attach an AnalysisTemplate that automatically determines whether to promote based on metrics. In this example, success rate is measured 5 times at 30-second intervals, for a total of 2.5 minutes. These numbers are chosen as "short enough to give fast feedback while long enough to average out transient spikes — the minimum measurement window." When using this for the first time, it's safer to start with a lenient successCondition and then observe actual metric patterns before tightening the threshold.
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
count: 5
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="{{ args.service-name }}",
status!~"5.."
}[2m]))
/
sum(rate(http_requests_total{
service="{{ args.service-name }}"
}[2m]))When this template is connected to the Rollout's prePromotionAnalysis, promotion proceeds automatically only when Green's error rate is below 5%. If the condition is not met, the Rollout is automatically aborted and Blue continues to handle production traffic.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Instant cutover | No intermediate window where both versions serve production traffic simultaneously — eliminates edge cases from version coexistence at the source |
| Instant rollback | Rollback completes within seconds by simply re-pointing the service pointer |
| Isolated validation environment | previewService allows fully testing the new version without any production traffic |
| Clear operational state | Always either Blue or Green — minimizes operational complexity |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| 2x infrastructure cost | Two sets of ReplicaSets run simultaneously until promotion | Use previewReplicaCount to keep Green at minimum replicas, then scale up at promotion time |
| No real-traffic validation | The new version doesn't receive real user load before cutover, limiting predictions of production behavior | Run load tests separately against the preview environment, or validate some features with canary first |
| Session continuity issues | Sticky sessions may be broken after cutover | Establish a session reissuance strategy before cutover, or recommend stateless design |
| scaleDownDelay misconfiguration | If too short, Blue may already be deleted by the time you attempt a rollback | In production, recommend 300 seconds or more instead of the default 30 seconds |
previewReplicaCount: You can specify the replica count for the Green environment separately via
spec.strategy.blueGreen.previewReplicaCount. If cost is a concern, run Green at 1 replica for validation, then let it scale up to the full replica count at promotion time — this can cut costs to less than half.
The Most Common Mistakes in Practice
-
Leaving
scaleDownDelaySecondsat the default value (30 seconds) — I actually made this mistake in production once. I discovered a problem right after promotion and tried to roll back, but Blue had already been deleted after just 30 seconds. That was a very long night. In production, I recommend keeping it at a minimum of 5 minutes (300 seconds). -
Setting
autoPromotionEnabled: falsebut forgetting to include the promotion step in the CI/CD pipeline — This results in Green being created but traffic never switching over, leaving two sets of ReplicaSets running indefinitely. It's recommended to explicitly include thekubectl argo rollouts promotecall in your pipeline. -
Skipping the Expand-Contract pattern when using BlueGreen with DB migrations — Even though BlueGreen guarantees an instantaneous cutover, you still need to separately design for whether a rollback to the previous version will conflict with the new schema. Without schema design that includes the rollback path, the fast rollback guarantee that BlueGreen provides becomes meaningless.
Closing Thoughts
BlueGreen is the strategy you choose when there's a constraint that two versions must never serve production traffic simultaneously, while canary is the strategy you choose when you want to validate gradually with real users.
The two strategies are not competitors — they are tools you select based on the situation. In practice, more and more teams are combining BlueGreen for stability-critical services and canary for services that require high-frequency deployments and feature validation.
Three steps you can take right now:
-
Install Argo Rollouts and configure the BlueGreen environment — Install Argo Rollouts on a local kind cluster (refer to the official documentation's installation guide), then apply the
active-service.yamlandrollout.yamlabove to see the basic BlueGreen flow in action. -
Monitor Rollout status with the kubectl plugin — Use
kubectl argo rollouts get rollout my-app --watchto watch in real time how the stages of Green creation → promotion → Blue removal progress. Running thepromote,undo, andabortcommands yourself is a natural way to get comfortable with the rollback flow. -
Connect an AnalysisTemplate — If you already have Prometheus, you can apply the
success-ratetemplate from Example 3 as-is to attach an automatic promotion gate. Start with a lenientsuccessConditionand observe actual metric values before adjusting the threshold — this is the safer approach.
References
- BlueGreen Deployment Strategy — Argo Rollouts official documentation — Reference for all BlueGreen configuration values
- Blue/green deployment strategy with Argo Rollouts — Red Hat Developer — Practical explanation focused on real-world application examples
- How to Automate Blue-Green & Canary Deployments with Argo Rollouts — Akuity — CI/CD pipeline automation patterns
- Blue/green Versus Canary Deployments: 6 Differences And How To Choose — Octopus Deploy — A comprehensive guide comparing the differences between the two strategies
- Chapter 1. Using Argo Rollouts for progressive deployment delivery — Red Hat OpenShift GitOps 1.11 — Reference for enterprise environment application
- GitOps in 2025: From Old-School Updates to the Modern Way — CNCF — Latest GitOps trends and where Argo Rollouts fits
- Blue-green vs canary deployments: safer API and DB changes — AppMaster — Strategy selection criteria for API and DB change scenarios
- Progressive Delivery on Kubernetes: From Blue-Green to GitOps-Powered Rollouts — Medium — A walkthrough from BlueGreen to GitOps integration