Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
DevOps

Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It

Whenever I think through deployment strategies, I always pause for a moment at "should I go with canary or BlueGreen?" At first, I vaguely assumed canary was safer — but then I tried pushing a DB schema change through a canary release and ended up in a pretty rough situation. Old Pods started throwing errors as they tried to read the new schema, and rolling back took 40 minutes. During that time, the service error rate spiked to 12%. After that day, I deeply understood: "This should have been BlueGreen."

This post aims to configure the BlueGreen strategy hands-on with Argo Rollouts while understanding the differences from canary in a practical context. Rather than just copying YAML, we'll examine why BlueGreen's instant cutover makes a decisive difference in certain situations, and what criteria to use when choosing between the two strategies.

Before reading this post: This is most useful for those already familiar with Kubernetes Services, Deployments, and ReplicaSets. It will be especially helpful if you're facing a release that includes a DB migration or an API breaking change.


Core Concepts

What "Instant Cutover" in BlueGreen Actually Means

The concept behind BlueGreen deployment is simple. You run the current production environment (Blue) and the new version environment (Green) simultaneously, then switch all traffic to Green at once when it's ready. The key is that this switch does not happen partially.

The way Argo Rollouts implements this in Kubernetes is quite elegant. The cutover happens with a single API call that updates the selector of the activeService Kubernetes Service to the new ReplicaSet hash. This API call to etcd is atomic, so there is no state where the switch is "half applied and half not." However, there can be a propagation delay of hundreds of milliseconds — or even a few seconds on large clusters — for each node's kube-proxy to actually update its iptables/ipvs rules. In practice, this rarely causes issues, but it's more accurate to think of the switch as "effectively instantaneous" rather than "perfectly simultaneous."

Atomic-like Switch: A cutover method where traffic is not simultaneously distributed between the old and new versions — meaning there is no window where both versions are concurrently serving production traffic. This is BlueGreen's defining characteristic.

Comparing with canary makes this difference even clearer.

BlueGreen Canary
Traffic cutover method Instantaneous (effectively atomic) switch Gradual percentage shift
Period of concurrent production traffic None Coexists throughout the entire rollout
Rollback method Re-point the service pointer Scale weight back to 0%
Infrastructure cost Requires 2x resources Minimizes additional resources
Suitable for Breaking changes, large-scale releases Gradual feature validation, high-frequency deployments

The Lifecycle of an Argo Rollouts BlueGreen

Once a Rollout begins, it proceeds internally in the following order:

  1. Green ReplicaSet is created → The previewService is switched to point to Green. At this point, production traffic is still handled by Blue.
  2. prePromotionAnalysis runs (optional) → Automatically validates the state of Green based on Prometheus or Datadog metrics.
  3. Promotion → The activeService switches to Green. This is the moment of the instantaneous traffic cutover.
  4. postPromotionAnalysis runs (optional) → Performs smoke tests or additional validation after the cutover.
  5. Blue ReplicaSet is removed → The old version is cleaned up after scaleDownDelaySeconds.

Practical Application

Example 1: Basic BlueGreen Rollout Configuration

To use BlueGreen with Argo Rollouts, you first need two Kubernetes Services: an activeService and a previewService. At first glance, the two services look nearly identical except for their selectors, which might make you wonder "why bother?" — but in fact, both services initially have the same app: my-app selector. Argo Rollouts works by injecting an additional ReplicaSet hash label into each service as the deployment progresses, which is what distinguishes Blue from Green.

yaml
# active-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-active
spec:
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 8080
---
# preview-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-preview
spec:
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 8080
yaml
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: my-app:v2
  strategy:
    blueGreen:
      activeService: my-app-active
      previewService: my-app-preview
      autoPromotionEnabled: false       # manual promotion
      previewReplicaCount: 1            # cost savings: Green runs at 1 replica for validation, then scales up on promotion
      scaleDownDelaySeconds: 300        # in production, be more generous than 30 seconds
      prePromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: my-app-preview
      postPromotionAnalysis:
        templates:
        - templateName: smoke-test
Field Role
activeService Service that receives production traffic (required)
previewService Service for accessing Green (new version) — optional, used for QA validation
autoPromotionEnabled: false Manual promotion — approve explicitly after validation
previewReplicaCount: 1 Keeps Green at minimum replicas to cut costs to less than half
scaleDownDelaySeconds: 300 Keeps Blue running for 5 minutes after promotion (provides rollback window)
prePromotionAnalysis Automatic metrics gate before promotion

To manually promote after deployment, use the following command:

bash
kubectl argo rollouts promote my-app

If something goes wrong, you can immediately revert to the previous state:

bash
kubectl argo rollouts undo my-app

Example 2: Deploying a DB Breaking Change — A Situation That Requires BlueGreen, Not Canary

Now that you understand the basic configuration, it's time to look at the scenario where BlueGreen truly shines. The most typical real-world case is when you need to change a DB schema without backward compatibility. With canary, you get a window where both the old version (expecting the old schema) and the new version (using the new schema) are looking at the same DB simultaneously. I experienced data collision errors in this exact situation, and since then I always use BlueGreen for releases like this.

With BlueGreen, only the old version handles production traffic until the cutover, so you can safely design a flow where you manually promote after confirming the DB migration is complete.

yaml
# rollout-db-migration.yaml (strategy section)
strategy:
  blueGreen:
    activeService: api-active
    previewService: api-preview
    autoPromotionEnabled: false   # manual promotion after confirming migration
    scaleDownDelaySeconds: 600    # keep Blue for 10 minutes to allow rollback window

The actual deployment flow looks like this. Using kubectl argo rollouts set image is the officially recommended way to update the Rollout image:

bash
# 1. Deploy new image (Green environment is created, traffic still goes to Blue)
kubectl argo rollouts set image api app=api:v2
 
# 2. Verify Green status via preview service
curl http://api-preview/health
 
# 3. After confirming DB migration is complete, manually promote
kubectl argo rollouts promote api
 
# 4. If issues arise, immediately roll back
kubectl argo rollouts undo api

Expand-Contract Pattern: To safely handle breaking schema changes, two stages are required. First, deploy an intermediate version that supports both the old and new columns (Expand), then deploy a second time to remove the old columns (Contract). Specifically: ① Deploy a version that adds the new column while keeping the old column → ② After confirming the code only uses the new column, deploy a version that removes the old column. Because BlueGreen clearly separates each stage, it is the deployment strategy that best fits the Contract phase of this pattern.

Example 3: Prometheus-Based Automatic Promotion Gate

Once you're comfortable with manual promotion, you can attach an AnalysisTemplate that automatically determines whether to promote based on metrics. In this example, success rate is measured 5 times at 30-second intervals, for a total of 2.5 minutes. These numbers are chosen as "short enough to give fast feedback while long enough to average out transient spikes — the minimum measurement window." When using this for the first time, it's safer to start with a lenient successCondition and then observe actual metric patterns before tightening the threshold.

yaml
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 30s
    count: 5
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{
            service="{{ args.service-name }}",
            status!~"5.."
          }[2m]))
          /
          sum(rate(http_requests_total{
            service="{{ args.service-name }}"
          }[2m]))

When this template is connected to the Rollout's prePromotionAnalysis, promotion proceeds automatically only when Green's error rate is below 5%. If the condition is not met, the Rollout is automatically aborted and Blue continues to handle production traffic.


Pros and Cons Analysis

Advantages

Item Details
Instant cutover No intermediate window where both versions serve production traffic simultaneously — eliminates edge cases from version coexistence at the source
Instant rollback Rollback completes within seconds by simply re-pointing the service pointer
Isolated validation environment previewService allows fully testing the new version without any production traffic
Clear operational state Always either Blue or Green — minimizes operational complexity

Disadvantages and Caveats

Item Details Mitigation
2x infrastructure cost Two sets of ReplicaSets run simultaneously until promotion Use previewReplicaCount to keep Green at minimum replicas, then scale up at promotion time
No real-traffic validation The new version doesn't receive real user load before cutover, limiting predictions of production behavior Run load tests separately against the preview environment, or validate some features with canary first
Session continuity issues Sticky sessions may be broken after cutover Establish a session reissuance strategy before cutover, or recommend stateless design
scaleDownDelay misconfiguration If too short, Blue may already be deleted by the time you attempt a rollback In production, recommend 300 seconds or more instead of the default 30 seconds

previewReplicaCount: You can specify the replica count for the Green environment separately via spec.strategy.blueGreen.previewReplicaCount. If cost is a concern, run Green at 1 replica for validation, then let it scale up to the full replica count at promotion time — this can cut costs to less than half.


The Most Common Mistakes in Practice

  1. Leaving scaleDownDelaySeconds at the default value (30 seconds) — I actually made this mistake in production once. I discovered a problem right after promotion and tried to roll back, but Blue had already been deleted after just 30 seconds. That was a very long night. In production, I recommend keeping it at a minimum of 5 minutes (300 seconds).

  2. Setting autoPromotionEnabled: false but forgetting to include the promotion step in the CI/CD pipeline — This results in Green being created but traffic never switching over, leaving two sets of ReplicaSets running indefinitely. It's recommended to explicitly include the kubectl argo rollouts promote call in your pipeline.

  3. Skipping the Expand-Contract pattern when using BlueGreen with DB migrations — Even though BlueGreen guarantees an instantaneous cutover, you still need to separately design for whether a rollback to the previous version will conflict with the new schema. Without schema design that includes the rollback path, the fast rollback guarantee that BlueGreen provides becomes meaningless.


Closing Thoughts

BlueGreen is the strategy you choose when there's a constraint that two versions must never serve production traffic simultaneously, while canary is the strategy you choose when you want to validate gradually with real users.

The two strategies are not competitors — they are tools you select based on the situation. In practice, more and more teams are combining BlueGreen for stability-critical services and canary for services that require high-frequency deployments and feature validation.

Three steps you can take right now:

  1. Install Argo Rollouts and configure the BlueGreen environment — Install Argo Rollouts on a local kind cluster (refer to the official documentation's installation guide), then apply the active-service.yaml and rollout.yaml above to see the basic BlueGreen flow in action.

  2. Monitor Rollout status with the kubectl plugin — Use kubectl argo rollouts get rollout my-app --watch to watch in real time how the stages of Green creation → promotion → Blue removal progress. Running the promote, undo, and abort commands yourself is a natural way to get comfortable with the rollback flow.

  3. Connect an AnalysisTemplate — If you already have Prometheus, you can apply the success-rate template from Example 3 as-is to attach an automatic promotion gate. Start with a lenient successCondition and observe actual metric values before adjusting the threshold — this is the safer approach.


References

  • BlueGreen Deployment Strategy — Argo Rollouts official documentation — Reference for all BlueGreen configuration values
  • Blue/green deployment strategy with Argo Rollouts — Red Hat Developer — Practical explanation focused on real-world application examples
  • How to Automate Blue-Green & Canary Deployments with Argo Rollouts — Akuity — CI/CD pipeline automation patterns
  • Blue/green Versus Canary Deployments: 6 Differences And How To Choose — Octopus Deploy — A comprehensive guide comparing the differences between the two strategies
  • Chapter 1. Using Argo Rollouts for progressive deployment delivery — Red Hat OpenShift GitOps 1.11 — Reference for enterprise environment application
  • GitOps in 2025: From Old-School Updates to the Modern Way — CNCF — Latest GitOps trends and where Argo Rollouts fits
  • Blue-green vs canary deployments: safer API and DB changes — AppMaster — Strategy selection criteria for API and DB change scenarios
  • Progressive Delivery on Kubernetes: From Blue-Green to GitOps-Powered Rollouts — Medium — A walkthrough from BlueGreen to GitOps integration
#ArgoRollouts#BlueGreen배포#Kubernetes#카나리배포#Prometheus#GitOps#DB마이그레이션#배포전략#ProgressiveDelivery#Expand-Contract패턴
Share

Table of Contents

Core ConceptsWhat "Instant Cutover" in BlueGreen Actually MeansThe Lifecycle of an Argo Rollouts BlueGreenPractical ApplicationExample 1: Basic BlueGreen Rollout ConfigurationExample 2: Deploying a DB Breaking Change — A Situation That Requires BlueGreen, Not CanaryExample 3: Prometheus-Based Automatic Promotion GatePros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook
DevOps

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook

When I first introduced canary deployments, staring intently at Grafana dashboards was part of the deployment process for quite some time. The workflow involved...

May 26, 202623 min read
Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty
DevOps

Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty

That morning, right after arriving at work, a colleague was typing 'rollback' into the Slack search bar. A canary deployment had silently failed overnight — the...

May 26, 202623 min read
How to Detect Argo Rollouts Rollbacks with Argo Events and Automatically Create Jira Incidents and Confluence Postmortems
DevOps

How to Detect Argo Rollouts Rollbacks with Argo Events and Automatically Create Jira Incidents and Confluence Postmortems

It's two in the morning, and a canary deployment is quietly rolling back. Argo Rollouts dutifully reverts to the previous version, but nobody knows it happened....

May 26, 202620 min read
Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline
DevOps

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Honestly, when I first introduced canary deployments, I was running deployment scripts by hand. I'd type in the terminal, post "canary is now at 5%" in Slack, ...

May 26, 202623 min read
Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy
DevOps

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

There was a time when I'd wait in Slack during every deployment and manually type rollback commands whenever error rates spiked. I thought introducing canary de...

May 26, 202620 min read
Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition
DevOps

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn...

May 26, 202624 min read