개인정보처리방침© 2026 DEV BAK - 기술블로그. All rights reserved.
DEV BAK - 기술블로그
DevOps

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn't that hard, but the moment you scale to hundreds of clusters, everything changes. The fear of "what if this change blows up across all of production?" makes you hesitate to deploy, and before you know it, even two releases a month feels like a stretch — and every hotfix has everyone holding their breath.

In this post, we'll walk through a dual defense-line pattern that uses Rancher Fleet's ClusterGroup to control Blast Radius across clusters, and Argo Rollouts to progressively shift traffic within a cluster, with concrete YAML examples. Combining the two lets you automate a release wave flow — "validate on 3 clusters → 15 staging → 50 prod-wave-1 → remaining 300" — from a single Git commit. By the end of this post, you'll understand how to declaratively gate deployments by partition, how to configure metric-based automatic rollback within a cluster, and the real-world pitfalls you'll encounter when combining the two technologies.

Prerequisites: This post targets DevOps or platform engineers who are already operating Kubernetes. Familiarity with CRDs, Ingress controllers, basic kubectl usage, and foundational Prometheus concepts is assumed. If you're new to Kubernetes, we recommend reviewing the official documentation for core concepts first.


Core Concepts

What Is Fleet, and Why Do You Need ClusterGroup?

Rancher Fleet is a GitOps engine built by SUSE that lets you declaratively manage hundreds to thousands of Kubernetes clusters from a single Git repository. If you've ever wondered "I get GitOps, but does that mean I need 500 ArgoCD Applications for 500 clusters?" — Fleet's answer to that is the combination of ClusterGroup and GitRepo.

ClusterGroup is a CRD that groups clusters into logical groups based on labels. If you group 3 clusters labeled env: canary into a canary-clusters group, you can then declare deployment policies using the group name rather than individual cluster names.

yaml
# ClusterGroup example
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: canary-clusters
  namespace: fleet-default
spec:
  selector:
    matchLabels:
      env: canary

There are several ways to label clusters. You can add labels directly in the Rancher UI's cluster settings screen, or patch the Cluster CRD directly with kubectl.

bash
# Add labels to the Cluster CRD in Rancher Fleet
kubectl label cluster my-cluster-01 env=canary wave-number=1 -n fleet-default

Fleet's GitRepo resource has a field called rolloutStrategy. Here you can declare deployment order and acceptable failure thresholds per partition, and deployment to the next partition is blocked until the previous partition completes successfully. It's essentially declaring cluster-level deployment gates in Git.

Blast Radius — refers to the number of clusters actually affected, or the percentage of users impacted, when a deployment fails or introduces a bug. Intentionally minimizing this number is the core of Progressive Delivery.


What Does Argo Rollouts Do Within a Single Cluster?

Argo Rollouts is a controller installed within a cluster that uses a Rollout CRD instead of the standard Kubernetes Deployment. When you declare a canary strategy, it shifts traffic in stages — 5% → 20% → 50% → 100% — and can insert AnalysisRun steps between each stage to query Prometheus or Datadog metrics.

yaml
# Rollout canary steps overview
steps:
  - setWeight: 5        # Route 5% of traffic to the new version
  - analysis: ...       # Analyze success rate and latency
  - pause: {duration: 5m}
  - setWeight: 20
  - analysis: ...
  - setWeight: 100

If an AnalysisRun fails, Argo Rollouts automatically reverts traffic to the previous version — no manual intervention required. This automation serves as the second line of defense for limiting Blast Radius within a cluster.

Separation of concerns between the two technologies: Fleet controls "which clusters to deploy to," while Argo Rollouts controls "how much traffic to shift within those clusters." Because they operate at different layers, they combine without conflict.


Where the Two Technologies Meet: From a Single Git Commit to Hundreds of Clusters

Seeing the full flow in one view makes it much easier to understand how everything fits together.

sql
Git commit pushed
  └─ Fleet GitRepo detects change
       ├─ Partition 1: canary-clusters (3 clusters)
       │    └─ Argo Rollouts: 5% traffic → AnalysisRun → auto-promotion
       │         [Confirm Partition 1 Ready, then proceed to next partition]
       ├─ Partition 2: staging (15 clusters)
       │    └─ Argo Rollouts: 20% → 50% → 100%
       │         [Confirm Partition 2 Ready, then proceed to next partition]
       ├─ Partition 3: prod-wave-1 (50 clusters, us-east)
       └─ Partition 4: prod-wave-2 (remaining 300 clusters)

With maxUnavailablePartitions: 1, the entire rollout is paused the moment any partition enters a NotReady state. If something goes wrong in the 3 canary clusters, no deployment proceeds past staging.

There's one important nuance to address here. The criterion Fleet uses to determine partition completion is the Ready state of resources within the bundle. For a standard Deployment, this is clearly determined via the Available condition, but Argo Rollouts' Rollout CRD behaves differently. A Rollout can report Available: True even while canary steps are in progress, as long as stable-version pods are running. This means Fleet could recognize a partition as Ready and begin deploying to the next partition before the canary analysis has finished.

This is something I struggled with quite a bit when first applying this pattern. A practical approach to dealing with this problem is covered in Example 2 below.


Practical Application

Example 1: Fleet fleet.yaml — Declaring a Partition Rollout Strategy

Partitions are defined in the fleet.yaml at the root of the Fleet repository. You can use either a ClusterGroup name or a cluster label selector.

yaml
# fleet.yaml
rolloutStrategy:
  maxUnavailablePartitions: 1
  partitions:
    - name: canary
      clusterGroup: canary-clusters       # Reference to ClusterGroup CRD
      maxUnavailable: 1                   # Allow at most 1 cluster to update simultaneously
 
    - name: staging
      clusterGroupSelector:
        matchLabels:
          env: staging                    # Dynamically select group by label
      maxUnavailable: "30%"
 
    - name: prod-wave-1
      clusterSelector:
        matchLabels:
          region: us-east
          wave-number: "1"               # Dedicated wave-number label to avoid overlap
          env: production
      maxUnavailable: "10%"
 
    - name: prod-wave-2
      clusterSelector:
        matchLabels:
          wave-number: "2"               # Clearly distinct from prod-wave-1
          env: production
      maxUnavailable: "5%"
Field Meaning
maxUnavailablePartitions Number of partitions allowed to be NotReady simultaneously (1 means only 1 partition proceeds at a time)
clusterGroup References a pre-defined ClusterGroup CRD by name
clusterGroupSelector Dynamically selects a ClusterGroup by label
maxUnavailable Number or percentage of clusters within the partition allowed to update simultaneously

Did you notice that prod-wave-1 and prod-wave-2 use separate wave-number labels? If you split partitions by env: production alone, the same cluster ends up included in both partitions. Our team explicitly manages this separation using a dedicated wave-number label.


Example 2: Argo Rollouts Canary Within a Cluster — Including Metric Analysis

Applications deployed to each cluster are declared as Rollout instead of Deployment. Rather than hardcoding the image tag as shown below, we recommend injecting it via Helm values or Kustomize overlays in a real GitOps workflow.

yaml
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: my-app:v2.0.0  # In a real GitOps environment, inject this via Helm values
  strategy:
    canary:
      canaryService: my-app-canary      # Service for canary traffic
      stableService: my-app-stable      # Service for stable version traffic
      trafficRouting:
        nginx:
          stableIngress: my-app-ingress
          annotationPrefix: nginx.ingress.kubernetes.io
      steps:
        - setWeight: 5
        - analysis:
            templates:
              - templateName: success-rate-check
            args:
              - name: service-name
                value: my-app-canary
        - pause:
            duration: 5m
        - setWeight: 20
        - analysis:
            templates:
              - templateName: latency-check
        - setWeight: 100
        - pause: {}  # Wait for manual promote after canary completes — ties into Fleet partition gating

Pay close attention to the final pause: {} step. This is one approach to mitigating the Fleet Rollout Ready detection issue mentioned earlier. By leaving the rollout in a state waiting for manual promotion after all canary steps complete, you can explicitly control when Fleet determines the partition is done. This is especially useful for teams that want a final confirmation via kubectl argo rollouts promote my-app before proceeding to the next partition.

Many teams integrate this with GitHub Actions so that a specific person must approve before the promote is executed. However, adding an approval process makes deployments dependent on people, so personally I prefer to build a sufficiently rigorous AnalysisTemplate and rely on automatic promotion. Human bottlenecks inevitably lead to late-night phone calls.

yaml
# analysistemplate.yaml — Prometheus-based success rate analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99   # Maintain success rate of 99% or higher
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
Component Role
canaryService / stableService Two Services that NGINX Ingress uses to split traffic
setWeight Percentage of traffic (%) to send to the canary Service
analysis References an AnalysisTemplate to perform metric checks
pause Wait time before the next step, or wait for manual approval
failureLimit Number of analysis failures to tolerate

AnalysisRun — The metric collection and evaluation job that Argo Rollouts executes between canary stages. It integrates with various providers including Prometheus, Datadog, and CloudWatch, and the evaluation result determines whether to automatically promote or roll back.


Example 3: Argo CD ApplicationSet Progressive Sync — Using Only ArgoCD Without Fleet

This section should be read alongside the decision criteria for "when is ArgoCD alone sufficient instead of Fleet?" ApplicationSet alone is often enough in the following cases:

  • You're already using Argo CD as your organization-wide GitOps engine and the operational overhead of introducing Fleet is prohibitive
  • Your cluster count is under 100 and Fleet's large-scale extensibility isn't a pressing need
  • The RollingSync feature of ArgoCD ApplicationSet already satisfies your wave deployment requirements

Conversely, if your cluster count exceeds 500, or if you already have Fleet's bundle-based management in place, the Example 1+2 combination is more appropriate than the example below.

yaml
# applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-fleet
spec:
  generators:
    - clusters:
        selector:
          matchExpressions:
            - key: env
              operator: In
              values: [canary, staging, production]
  strategy:
    type: RollingSync
    rollingSync:
      steps:
        - matchExpressions:
            - key: env
              operator: In
              values: [canary]
          maxUpdate: 1              # Sync canary clusters one at a time sequentially
 
        - matchExpressions:
            - key: env
              operator: In
              values: [staging]
          maxUpdate: "30%"         # Sync staging clusters in batches of 30%
 
        - matchExpressions:
            - key: env
              operator: In
              values: [production]
          maxUpdate: "10%"         # Wave deploy production clusters in batches of 10%
  template:
    metadata:
      name: "my-app-{{name}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/my-org/my-app
        path: k8s/
        targetRevision: HEAD
      destination:
        server: "{{server}}"
        namespace: my-app

Argo CD ApplicationSet's RollingSync is approaching a stable phase. We recommend checking the current status directly in the official Argo CD documentation.


Pros and Cons Analysis

Advantages

Item Details
Dual-layer Blast Radius control Fleet prevents spread across clusters, Argo Rollouts controls traffic within a cluster — two layers operating independently
Fully declarative All deployment policies live in Git, so there's no drift, and history tracking and rollback are naturally supported
Automatic rollback Traffic reverts automatically without manual intervention when an AnalysisRun fails
Progressive confidence building Validating on a small set of canary clusters before expanding lets you build confidence in a release step by step
Visibility Argo Rollouts Dashboard + Fleet UI gives you a bird's-eye view of rollout status across hundreds of clusters

Disadvantages and Caveats

Item Details Mitigation
Increased operational complexity Burden of managing version compatibility across Fleet, Argo CD, and Argo Rollouts Our team maintains a compatibility matrix directly on a single Confluence page. Checking this table before any upgrade has become a habit
Rollout CRD Ready detection issue Fleet may interpret the completion state of a Rollout differently from a standard Deployment, causing partition gating to not behave as expected Add pause: {} at the end of canary steps, or configure a Fleet custom health check
Argo Rollouts is cluster-local Cannot observe global multi-cluster state with Argo Rollouts alone Build a separate higher-level view in Fleet or Argo CD dashboards
Partition completion threshold design If maxUnavailable thresholds are too strict, a single cluster failure can halt the entire rollout for extended periods Set thresholds differently based on cluster characteristics (strict for canary, lenient for prod-wave-2)
Monitoring stack must be pre-deployed Prometheus and other dependencies of AnalysisRun must be deployed on all clusters Managing the monitoring stack itself via GitOps with Fleet ensures consistency
Fleet bundle controller performance Bundle reconciliation storms can occur at thousands-of-clusters scale Refer to SUSE's published experiment results and tune batch sizes and partition design to fit your environment

Most Common Mistakes in Practice

  1. Designing overlapping labels for canary and production clusters so the same cluster is included in two partitions — Fleet partitions with overlapping cluster selections cause unexpected behavior. Using dedicated labels like wave-number: 1 and wave-number: 2 is far safer. I wasted a fair amount of time on this early on when partitions were getting mixed together.

  2. Declaring only setWeight without an AnalysisTemplate — Traffic gets shifted but without any validation logic, the canary loses its purpose. We recommend setting at least one analysis condition, such as success rate. Even under pressure to "just ship it," this is the one thing you shouldn't skip.

  3. Setting maxUnavailablePartitions to 0, causing the entire deployment to halt permanently on a single cluster failure — When this value is 0, every partition must be completely healthy simultaneously before progressing. Given the inherent instability of cluster infrastructure, it's more realistic to design for tolerating at least 1–2 cluster failures.


Closing Thoughts

When I talk to teams that have adopted this pattern, the biggest change they report is that deployment cycles got shorter and the tension during hotfixes noticeably decreased. Once you have the confidence that changes are validated first on 3 canary clusters, deployment stops being a dreaded event and becomes routine work. One team that had been deploying to 300 clusters all at once adopted this structure and went from 2 releases per month to 3 per week.

You don't need to migrate all 300 clusters to this pattern from the start. Here are 3 steps you can take right now:

  1. Label 3–5 existing clusters with env: canary and add a partitions block to fleet.yaml — Partition configuration can be declared without affecting existing deployments, so the barrier is low. Start by creating the ClusterGroup CRD and verifying that the label selector picks up the intended clusters.

  2. Install the Argo Rollouts controller via Helm on a single cluster, then convert your lowest-traffic service to a Rollout CRD — Install with helm install argo-rollouts argo/argo-rollouts -n argo-rollouts --create-namespace, then verify a basic setWeight: 5 → pause → setWeight: 100 step sequence first.

  3. Write a single AnalysisTemplate based on Prometheus success rate and connect it to the Rollout — We recommend starting with a relaxed threshold like successCondition: result[0] >= 0.95. Writing the metric query yourself gives you an intuitive feel for how AnalysisRun decides to roll back, and as you gradually raise the threshold, you'll reach a point where you realize your monitoring stack isn't sufficient. That's exactly when you're ready to move to the next stage.


References

  • Fleet Rollout Strategy Official Documentation | Rancher Fleet
  • Fleet fleet.yaml Reference | Rancher Fleet
  • Fleet Custom Resources Spec (ClusterGroup CRD) | Rancher Fleet
  • Canary Releases with Rancher Continuous Delivery | SUSE Blog
  • Scaling Kubernetes GitOps with Fleet — Experiment Results | SUSE Blog
  • Argo Rollouts Official Documentation — Canary Strategy
  • Argo Rollouts Best Practices
  • Argo CD Progressive Syncs Official Documentation
  • ArgoCD Sync Waves Official Documentation
  • Kamada + Argo Rollouts: From Canary To Global | ArgoCon Europe 2026
  • Building a Fleet with ArgoCD and GKE | Google Cloud Blog
  • ACK One + Argo Rollouts Canary Release | Alibaba Cloud
  • Canary delivery with Argo Rollouts and Amazon VPC Lattice | AWS Blog
  • Progressive Delivery Pipeline with GitHub Actions and Argo Rollouts
  • Multi-Cluster GitOps: Fleet Provisioning and Bootstrapping | AWS Blog
#RancherFleet#ArgoRollouts#Kubernetes#GitOps#ProgressiveDelivery#카나리배포#ArgoCD#Prometheus#멀티클러스터#NGINX-Ingress
Share

Table of Contents

Core ConceptsWhat Is Fleet, and Why Do You Need ClusterGroup?What Does Argo Rollouts Do Within a Single Cluster?Where the Two Technologies Meet: From a Single Git Commit to Hundreds of ClustersPractical ApplicationExample 1: FleetExample 2: Argo Rollouts Canary Within a Cluster — Including Metric AnalysisExample 3: Argo CD ApplicationSet Progressive Sync — Using Only ArgoCD Without FleetPros and Cons AnalysisAdvantagesDisadvantages and CaveatsMost Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy
DevOps

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

There was a time when I'd wait in Slack during every deployment and manually type rollback commands whenever error rates spiked. I thought introducing canary de...

May 26, 202620 min read
Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline
DevOps

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Honestly, when I first introduced canary deployments, I was running deployment scripts by hand. I'd type in the terminal, post "canary is now at 5%" in Slack, ...

May 26, 202623 min read
Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It
DevOps

Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It

Whenever I think through deployment strategies, I always pause for a moment at "should I go with canary or BlueGreen?" At first, I vaguely assumed canary was sa...

May 26, 202619 min read
Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift
DevOps

Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift

This article is written for DevOps/infrastructure engineers who have hands-on experience operating Kubernetes. It assumes familiarity with Helm, Kustomize, and ...

May 25, 202623 min read
Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)
DevOps

Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)

If you've used Vercel for any length of time, you've probably had this experience at least once: you open your end-of-month invoice and see a number far larger ...

May 25, 202624 min read
Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing
DevOps

Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing

When I first introduced canary deployments, I made a similar mistake. I thought splitting replica ratios with a basic Kubernetes Deployment and calling it a "10...

May 25, 202623 min read