Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition
Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn't that hard, but the moment you scale to hundreds of clusters, everything changes. The fear of "what if this change blows up across all of production?" makes you hesitate to deploy, and before you know it, even two releases a month feels like a stretch — and every hotfix has everyone holding their breath.
In this post, we'll walk through a dual defense-line pattern that uses Rancher Fleet's ClusterGroup to control Blast Radius across clusters, and Argo Rollouts to progressively shift traffic within a cluster, with concrete YAML examples. Combining the two lets you automate a release wave flow — "validate on 3 clusters → 15 staging → 50 prod-wave-1 → remaining 300" — from a single Git commit. By the end of this post, you'll understand how to declaratively gate deployments by partition, how to configure metric-based automatic rollback within a cluster, and the real-world pitfalls you'll encounter when combining the two technologies.
Prerequisites: This post targets DevOps or platform engineers who are already operating Kubernetes. Familiarity with CRDs, Ingress controllers, basic
kubectlusage, and foundational Prometheus concepts is assumed. If you're new to Kubernetes, we recommend reviewing the official documentation for core concepts first.
Core Concepts
What Is Fleet, and Why Do You Need ClusterGroup?
Rancher Fleet is a GitOps engine built by SUSE that lets you declaratively manage hundreds to thousands of Kubernetes clusters from a single Git repository. If you've ever wondered "I get GitOps, but does that mean I need 500 ArgoCD Applications for 500 clusters?" — Fleet's answer to that is the combination of ClusterGroup and GitRepo.
ClusterGroup is a CRD that groups clusters into logical groups based on labels. If you group 3 clusters labeled env: canary into a canary-clusters group, you can then declare deployment policies using the group name rather than individual cluster names.
# ClusterGroup example
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
name: canary-clusters
namespace: fleet-default
spec:
selector:
matchLabels:
env: canaryThere are several ways to label clusters. You can add labels directly in the Rancher UI's cluster settings screen, or patch the Cluster CRD directly with kubectl.
# Add labels to the Cluster CRD in Rancher Fleet
kubectl label cluster my-cluster-01 env=canary wave-number=1 -n fleet-defaultFleet's GitRepo resource has a field called rolloutStrategy. Here you can declare deployment order and acceptable failure thresholds per partition, and deployment to the next partition is blocked until the previous partition completes successfully. It's essentially declaring cluster-level deployment gates in Git.
Blast Radius — refers to the number of clusters actually affected, or the percentage of users impacted, when a deployment fails or introduces a bug. Intentionally minimizing this number is the core of Progressive Delivery.
What Does Argo Rollouts Do Within a Single Cluster?
Argo Rollouts is a controller installed within a cluster that uses a Rollout CRD instead of the standard Kubernetes Deployment. When you declare a canary strategy, it shifts traffic in stages — 5% → 20% → 50% → 100% — and can insert AnalysisRun steps between each stage to query Prometheus or Datadog metrics.
# Rollout canary steps overview
steps:
- setWeight: 5 # Route 5% of traffic to the new version
- analysis: ... # Analyze success rate and latency
- pause: {duration: 5m}
- setWeight: 20
- analysis: ...
- setWeight: 100If an AnalysisRun fails, Argo Rollouts automatically reverts traffic to the previous version — no manual intervention required. This automation serves as the second line of defense for limiting Blast Radius within a cluster.
Separation of concerns between the two technologies: Fleet controls "which clusters to deploy to," while Argo Rollouts controls "how much traffic to shift within those clusters." Because they operate at different layers, they combine without conflict.
Where the Two Technologies Meet: From a Single Git Commit to Hundreds of Clusters
Seeing the full flow in one view makes it much easier to understand how everything fits together.
Git commit pushed
└─ Fleet GitRepo detects change
├─ Partition 1: canary-clusters (3 clusters)
│ └─ Argo Rollouts: 5% traffic → AnalysisRun → auto-promotion
│ [Confirm Partition 1 Ready, then proceed to next partition]
├─ Partition 2: staging (15 clusters)
│ └─ Argo Rollouts: 20% → 50% → 100%
│ [Confirm Partition 2 Ready, then proceed to next partition]
├─ Partition 3: prod-wave-1 (50 clusters, us-east)
└─ Partition 4: prod-wave-2 (remaining 300 clusters)With maxUnavailablePartitions: 1, the entire rollout is paused the moment any partition enters a NotReady state. If something goes wrong in the 3 canary clusters, no deployment proceeds past staging.
There's one important nuance to address here. The criterion Fleet uses to determine partition completion is the Ready state of resources within the bundle. For a standard Deployment, this is clearly determined via the Available condition, but Argo Rollouts' Rollout CRD behaves differently. A Rollout can report Available: True even while canary steps are in progress, as long as stable-version pods are running. This means Fleet could recognize a partition as Ready and begin deploying to the next partition before the canary analysis has finished.
This is something I struggled with quite a bit when first applying this pattern. A practical approach to dealing with this problem is covered in Example 2 below.
Practical Application
Example 1: Fleet fleet.yaml — Declaring a Partition Rollout Strategy
Partitions are defined in the fleet.yaml at the root of the Fleet repository. You can use either a ClusterGroup name or a cluster label selector.
# fleet.yaml
rolloutStrategy:
maxUnavailablePartitions: 1
partitions:
- name: canary
clusterGroup: canary-clusters # Reference to ClusterGroup CRD
maxUnavailable: 1 # Allow at most 1 cluster to update simultaneously
- name: staging
clusterGroupSelector:
matchLabels:
env: staging # Dynamically select group by label
maxUnavailable: "30%"
- name: prod-wave-1
clusterSelector:
matchLabels:
region: us-east
wave-number: "1" # Dedicated wave-number label to avoid overlap
env: production
maxUnavailable: "10%"
- name: prod-wave-2
clusterSelector:
matchLabels:
wave-number: "2" # Clearly distinct from prod-wave-1
env: production
maxUnavailable: "5%"| Field | Meaning |
|---|---|
maxUnavailablePartitions |
Number of partitions allowed to be NotReady simultaneously (1 means only 1 partition proceeds at a time) |
clusterGroup |
References a pre-defined ClusterGroup CRD by name |
clusterGroupSelector |
Dynamically selects a ClusterGroup by label |
maxUnavailable |
Number or percentage of clusters within the partition allowed to update simultaneously |
Did you notice that prod-wave-1 and prod-wave-2 use separate wave-number labels? If you split partitions by env: production alone, the same cluster ends up included in both partitions. Our team explicitly manages this separation using a dedicated wave-number label.
Example 2: Argo Rollouts Canary Within a Cluster — Including Metric Analysis
Applications deployed to each cluster are declared as Rollout instead of Deployment. Rather than hardcoding the image tag as shown below, we recommend injecting it via Helm values or Kustomize overlays in a real GitOps workflow.
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 10
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: my-app:v2.0.0 # In a real GitOps environment, inject this via Helm values
strategy:
canary:
canaryService: my-app-canary # Service for canary traffic
stableService: my-app-stable # Service for stable version traffic
trafficRouting:
nginx:
stableIngress: my-app-ingress
annotationPrefix: nginx.ingress.kubernetes.io
steps:
- setWeight: 5
- analysis:
templates:
- templateName: success-rate-check
args:
- name: service-name
value: my-app-canary
- pause:
duration: 5m
- setWeight: 20
- analysis:
templates:
- templateName: latency-check
- setWeight: 100
- pause: {} # Wait for manual promote after canary completes — ties into Fleet partition gatingPay close attention to the final pause: {} step. This is one approach to mitigating the Fleet Rollout Ready detection issue mentioned earlier. By leaving the rollout in a state waiting for manual promotion after all canary steps complete, you can explicitly control when Fleet determines the partition is done. This is especially useful for teams that want a final confirmation via kubectl argo rollouts promote my-app before proceeding to the next partition.
Many teams integrate this with GitHub Actions so that a specific person must approve before the promote is executed. However, adding an approval process makes deployments dependent on people, so personally I prefer to build a sufficiently rigorous AnalysisTemplate and rely on automatic promotion. Human bottlenecks inevitably lead to late-night phone calls.
# analysistemplate.yaml — Prometheus-based success rate analysis
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.99 # Maintain success rate of 99% or higher
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[2m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))| Component | Role |
|---|---|
canaryService / stableService |
Two Services that NGINX Ingress uses to split traffic |
setWeight |
Percentage of traffic (%) to send to the canary Service |
analysis |
References an AnalysisTemplate to perform metric checks |
pause |
Wait time before the next step, or wait for manual approval |
failureLimit |
Number of analysis failures to tolerate |
AnalysisRun — The metric collection and evaluation job that Argo Rollouts executes between canary stages. It integrates with various providers including Prometheus, Datadog, and CloudWatch, and the evaluation result determines whether to automatically promote or roll back.
Example 3: Argo CD ApplicationSet Progressive Sync — Using Only ArgoCD Without Fleet
This section should be read alongside the decision criteria for "when is ArgoCD alone sufficient instead of Fleet?" ApplicationSet alone is often enough in the following cases:
- You're already using Argo CD as your organization-wide GitOps engine and the operational overhead of introducing Fleet is prohibitive
- Your cluster count is under 100 and Fleet's large-scale extensibility isn't a pressing need
- The
RollingSyncfeature of ArgoCD ApplicationSet already satisfies your wave deployment requirements
Conversely, if your cluster count exceeds 500, or if you already have Fleet's bundle-based management in place, the Example 1+2 combination is more appropriate than the example below.
# applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app-fleet
spec:
generators:
- clusters:
selector:
matchExpressions:
- key: env
operator: In
values: [canary, staging, production]
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: env
operator: In
values: [canary]
maxUpdate: 1 # Sync canary clusters one at a time sequentially
- matchExpressions:
- key: env
operator: In
values: [staging]
maxUpdate: "30%" # Sync staging clusters in batches of 30%
- matchExpressions:
- key: env
operator: In
values: [production]
maxUpdate: "10%" # Wave deploy production clusters in batches of 10%
template:
metadata:
name: "my-app-{{name}}"
spec:
project: default
source:
repoURL: https://github.com/my-org/my-app
path: k8s/
targetRevision: HEAD
destination:
server: "{{server}}"
namespace: my-appArgo CD ApplicationSet's RollingSync is approaching a stable phase. We recommend checking the current status directly in the official Argo CD documentation.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Dual-layer Blast Radius control | Fleet prevents spread across clusters, Argo Rollouts controls traffic within a cluster — two layers operating independently |
| Fully declarative | All deployment policies live in Git, so there's no drift, and history tracking and rollback are naturally supported |
| Automatic rollback | Traffic reverts automatically without manual intervention when an AnalysisRun fails |
| Progressive confidence building | Validating on a small set of canary clusters before expanding lets you build confidence in a release step by step |
| Visibility | Argo Rollouts Dashboard + Fleet UI gives you a bird's-eye view of rollout status across hundreds of clusters |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Increased operational complexity | Burden of managing version compatibility across Fleet, Argo CD, and Argo Rollouts | Our team maintains a compatibility matrix directly on a single Confluence page. Checking this table before any upgrade has become a habit |
| Rollout CRD Ready detection issue | Fleet may interpret the completion state of a Rollout differently from a standard Deployment, causing partition gating to not behave as expected |
Add pause: {} at the end of canary steps, or configure a Fleet custom health check |
| Argo Rollouts is cluster-local | Cannot observe global multi-cluster state with Argo Rollouts alone | Build a separate higher-level view in Fleet or Argo CD dashboards |
| Partition completion threshold design | If maxUnavailable thresholds are too strict, a single cluster failure can halt the entire rollout for extended periods |
Set thresholds differently based on cluster characteristics (strict for canary, lenient for prod-wave-2) |
| Monitoring stack must be pre-deployed | Prometheus and other dependencies of AnalysisRun must be deployed on all clusters | Managing the monitoring stack itself via GitOps with Fleet ensures consistency |
| Fleet bundle controller performance | Bundle reconciliation storms can occur at thousands-of-clusters scale | Refer to SUSE's published experiment results and tune batch sizes and partition design to fit your environment |
Most Common Mistakes in Practice
-
Designing overlapping labels for canary and production clusters so the same cluster is included in two partitions — Fleet partitions with overlapping cluster selections cause unexpected behavior. Using dedicated labels like
wave-number: 1andwave-number: 2is far safer. I wasted a fair amount of time on this early on when partitions were getting mixed together. -
Declaring only
setWeightwithout an AnalysisTemplate — Traffic gets shifted but without any validation logic, the canary loses its purpose. We recommend setting at least one analysis condition, such as success rate. Even under pressure to "just ship it," this is the one thing you shouldn't skip. -
Setting
maxUnavailablePartitionsto 0, causing the entire deployment to halt permanently on a single cluster failure — When this value is 0, every partition must be completely healthy simultaneously before progressing. Given the inherent instability of cluster infrastructure, it's more realistic to design for tolerating at least 1–2 cluster failures.
Closing Thoughts
When I talk to teams that have adopted this pattern, the biggest change they report is that deployment cycles got shorter and the tension during hotfixes noticeably decreased. Once you have the confidence that changes are validated first on 3 canary clusters, deployment stops being a dreaded event and becomes routine work. One team that had been deploying to 300 clusters all at once adopted this structure and went from 2 releases per month to 3 per week.
You don't need to migrate all 300 clusters to this pattern from the start. Here are 3 steps you can take right now:
-
Label 3–5 existing clusters with
env: canaryand add apartitionsblock tofleet.yaml— Partition configuration can be declared without affecting existing deployments, so the barrier is low. Start by creating the ClusterGroup CRD and verifying that the label selector picks up the intended clusters. -
Install the Argo Rollouts controller via Helm on a single cluster, then convert your lowest-traffic service to a
RolloutCRD — Install withhelm install argo-rollouts argo/argo-rollouts -n argo-rollouts --create-namespace, then verify a basicsetWeight: 5 → pause → setWeight: 100step sequence first. -
Write a single AnalysisTemplate based on Prometheus success rate and connect it to the Rollout — We recommend starting with a relaxed threshold like
successCondition: result[0] >= 0.95. Writing the metric query yourself gives you an intuitive feel for how AnalysisRun decides to roll back, and as you gradually raise the threshold, you'll reach a point where you realize your monitoring stack isn't sufficient. That's exactly when you're ready to move to the next stage.
References
- Fleet Rollout Strategy Official Documentation | Rancher Fleet
- Fleet fleet.yaml Reference | Rancher Fleet
- Fleet Custom Resources Spec (ClusterGroup CRD) | Rancher Fleet
- Canary Releases with Rancher Continuous Delivery | SUSE Blog
- Scaling Kubernetes GitOps with Fleet — Experiment Results | SUSE Blog
- Argo Rollouts Official Documentation — Canary Strategy
- Argo Rollouts Best Practices
- Argo CD Progressive Syncs Official Documentation
- ArgoCD Sync Waves Official Documentation
- Kamada + Argo Rollouts: From Canary To Global | ArgoCon Europe 2026
- Building a Fleet with ArgoCD and GKE | Google Cloud Blog
- ACK One + Argo Rollouts Canary Release | Alibaba Cloud
- Canary delivery with Argo Rollouts and Amazon VPC Lattice | AWS Blog
- Progressive Delivery Pipeline with GitHub Actions and Argo Rollouts
- Multi-Cluster GitOps: Fleet Provisioning and Bootstrapping | AWS Blog