Implementing Multi-Cluster Canary Deployments with ArgoCD ApplicationSet rollingSync + Argo Rollouts
A dual-gating strategy for safe, progressive rollouts across dozens of clusters
Target audience: Those with some ArgoCD experience. Basic Kubernetes concepts (Deployment, Service, Ingress) are assumed.
Honestly, when I first took on multi-cluster deployments, my biggest fear was "what if everything blows up at once?" Whether it was 5 clusters or 50, there was a real dread that if something went wrong the moment you pushed a new version to all of them simultaneously, everything would be affected. The combination I found at the end of that struggle was ArgoCD ApplicationSet's rollingSync and Argo Rollouts' canary steps.
These two tools implement the principle of "progressively" at different layers. If ApplicationSet controls "which cluster to deploy to first," Argo Rollouts controls "how much traffic to shift within that cluster." This dual-gating structure is the key. By the end of this article, you'll be able to build the full architecture yourself: auto-generating Applications for dozens of clusters with a Matrix Generator, controlling deployment order with rollingSync, and having Argo Rollouts automatically evaluate canaries based on Prometheus metrics within each cluster.
Core Concepts
What Each Tool Is Responsible For
It's important to clarify the separation of responsibilities upfront. I used to have a fuzzy understanding of this boundary myself, which led me to duplicate configuration all over the place.
| Layer | Tool | Role |
|---|---|---|
| Between clusters (Cluster-level) | ApplicationSet rollingSync |
Staged rollout in order: cluster A → B → C |
| Within a cluster (Pod-level) | Argo Rollouts Rollout |
Stepped canary traffic weight control within each cluster |
ArgoCD ApplicationSet is an extension resource that auto-generates dozens to hundreds of ArgoCD Applications from a single template. Through various generators — cluster lists, Git directories, matrix combinations — it enables declarative automation of multi-cluster and multi-tenant environments.
Argo Rollouts is a Kubernetes CRD-based controller that replaces the standard Deployment resource with a Rollout resource, providing advanced deployment strategies like Canary and Blue-Green. It supports traffic weight steps and metric-based automatic promotion/rollback.
What is Progressive Sync? When using ApplicationSet's
rollingSyncstrategy, instead of synchronizing all generated Applications at once, it synchronizes them step by step in label-matched groups. All Applications in the previous step must reachHealthystatus before moving to the next step.
The Structure of ApplicationSet rollingSync
The core of rollingSync is dividing steps based on labels attached to Applications using matchExpressions.
# ApplicationSet rollingSync core structure
spec:
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: env
operator: In
values: ["canary"] # Step 1: canary clusters
maxUpdate: 1
- matchExpressions:
- key: env
operator: In
values: ["staging"] # Step 2: staging clusters
maxUpdate: 25%
- matchExpressions:
- key: env
operator: In
values: ["production"] # Step 3: all productionoperator: In means: classify into this step when the Application's label value is included in the values list. It can be confusing at first glance because it's not entirely intuitive — read it as "if the label env matches the value canary, assign to this step."
maxUpdate limits the maximum number (or percentage) of Applications that can be updated simultaneously in that step. If there are 20 staging clusters, setting 25% means only 5 are updated at a time.
Argo Rollouts Canary Steps and Metric Analysis
Inside each cluster, Argo Rollouts moves traffic progressively. By inserting an AnalysisTemplate, it queries metric providers like Prometheus to evaluate success rate, error rate, and latency — and automatically rolls back if thresholds are exceeded.
# Rollout canary steps example
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
spec:
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
steps:
- setWeight: 10 # 10% traffic → canary
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service # Injected into {{args.service-name}} in AnalysisTemplate
- pause: {duration: 10m}
- setWeight: 30
- pause: {duration: 10m}
- setWeight: 100Note: The example above assumes Istio is already configured in the cluster. If you're using NGINX Ingress or the Kubernetes Gateway API, the
trafficRoutingconfiguration will differ. See the Argo Rollouts traffic management official docs for details.
Practical Application
The three examples are interconnected layers. Example 1 is the foundation that uses ApplicationSet to auto-generate multi-cluster Applications. Example 2 is the in-cluster quality evaluation layer built on top of it. Example 3 is an extended configuration for centralizing metrics when the number of clusters grows large.
The overall file structure looks like this:
my-gitops-repo/
├── applicationsets/
│ └── my-service-appset.yaml # Example 1: ApplicationSet (includes rollingSync)
└── k8s/
├── base/
│ └── analysis-template.yaml # Example 2: AnalysisTemplate definition
└── overlays/
├── canary/
│ └── rollout.yaml # Example 2: Rollout (references AnalysisTemplate)
└── production/
└── rollout.yamlExample 1: Auto-Generating Cluster × Environment Combinations with Matrix Generator
The ApplicationSet Matrix Generator multiplies a list of clusters by a list of environments to automatically create Applications for every combination. Compared to the days of writing Application YAMLs one by one by hand, this is a huge improvement. The key point here is that the template.labels.env value becomes the basis for matching rollingSync steps later on.
# applicationsets/my-service-appset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-service-appset
namespace: argocd
spec:
generators:
- matrix:
generators:
- clusters:
selector:
matchLabels:
region: ap-northeast-2 # Filter to target only clusters in a specific region
- list:
elements:
- env: canary
- env: production
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: env
operator: In
values: ["canary"]
maxUpdate: 1 # One canary cluster at a time
- matchExpressions:
- key: env
operator: In
values: ["production"] # Then proceed to all production
template:
metadata:
name: "{{name}}-{{env}}"
labels:
env: "{{env}}" # The key label that matches rollingSync matchExpressions
cluster: "{{name}}"
spec:
project: default
source:
repoURL: https://github.com/my-org/my-service
targetRevision: HEAD
path: "k8s/overlays/{{env}}"
destination:
server: "{{server}}"
namespace: my-service
syncPolicy:
syncOptions:
- CreateNamespace=true| Field | Description |
|---|---|
matrix.generators |
Multiplies the cluster list and environment list to generate all combinations |
clusters.selector |
Filters to target only clusters with specific labels |
rollingSync.steps |
Deploys canary environment first, then proceeds to production |
template.labels.env |
Value injected from the Generator; matched against rollingSync matchExpressions |
If template.labels.env is missing, Applications will be created that don't belong to any step, and rollingSync itself won't function. This label connection is the glue that holds the entire structure together.
Example 2: Automated Quality Evaluation with AnalysisTemplate
This configuration sends canary traffic, then queries Prometheus for the success rate and automatically rolls back if it falls below the threshold. Without this part, you effectively have a half-baked canary deployment. The AnalysisTemplate and Rollout are connected via args — if this connection is missing, the analysis won't run at all.
First, define the AnalysisTemplate:
# k8s/base/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name # Declare arg to receive a value from the Rollout
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95 # Success rate must be >= 95%
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{
job="{{args.service-name}}",
status!~"5.."
}[2m]))
/
sum(rate(http_requests_total{
job="{{args.service-name}}"
}[2m]))Then, when referencing this template from a Rollout, explicitly pass the service name via args:
# k8s/overlays/canary/rollout.yaml (analysis step section)
steps:
- setWeight: 10
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-service # Injected into {{args.service-name}} in AnalysisTemplate
- pause: {duration: 10m}
- setWeight: 30
- pause: {duration: 10m}
- setWeight: 100I spent a long time puzzled about why my analysis wasn't running at first — if you omit args, the {{args.service-name}} in the Prometheus query gets replaced with an empty string. The failure can be silent or produce no error at all, making the symptom easy to miss. Make sure both the Rollout and the AnalysisTemplate have args definitions on both sides.
Example 3: Centralizing Multi-Cluster Analysis with Federated Prometheus
When the number of clusters grows large, checking metrics separately per cluster becomes tedious. This section is an extended configuration for those familiar with the Prometheus federation architecture.
What is ClusterAnalysisTemplate? While a regular
AnalysisTemplateis an analysis resource scoped to a specific namespace, aClusterAnalysisTemplateis an analysis resource shared across the entire cluster. You can define it once on the management cluster and reference it from multiple Rollouts, making it well-suited for centralized evaluation in multi-cluster environments.
By placing a Federated Prometheus on the management cluster to centrally collect metrics from all workload clusters, a single ClusterAnalysisTemplate can evaluate them all at once.
# scrape_configs entry in the management cluster's Prometheus ConfigMap
# If installing via Helm, add to server.extraScrapeConfigs in values.yaml
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true # Preserve original cluster labels
metrics_path: '/federate'
params:
match[]:
- '{job="my-service"}'
- '{__name__=~"http_requests_total|http_request_duration_seconds.*"}'
static_configs:
- targets:
- 'prometheus.cluster-a.internal:9090'
- 'prometheus.cluster-b.internal:9090'
- 'prometheus.cluster-c.internal:9090'honor_labels: true preserves the original labels (job, instance, etc.) from the workload clusters, so you can distinguish which cluster a metric originated from in the federated data.
With this setup, if an error fires in any cluster during a canary deployment, it can be detected via a single Prometheus query on the management cluster, halting that Application's rollout.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Minimized Blast Radius | Dual gating at cluster level + pod level minimizes the scope of failure propagation |
| Automated Rollback | Auto-rollback when metric thresholds are exceeded; sub-2-second recovery as of Argo Rollouts 1.8 |
| GitOps Single Source of Truth | All deployment state managed declaratively in Git, with easy audit trails |
| Independent Failure Isolation | A deployment failure in one cluster does not affect other clusters |
Blast Radius: Refers to the scope of impact from a failure or deployment issue. Minimizing this scope is one of the core goals of canary deployments.
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| AutoSync forcibly disabled | Using rollingSync forcibly disables auto-sync for all Applications in the ApplicationSet |
Replace with explicit argocd appset sync trigger from CI/CD pipeline |
| Race condition bug | Cases reported where step ordering between remote clusters is not guaranteed (Issue #22852) | Verify patch status in latest ArgoCD version before applying |
| Rollout stall | If an Application fails to reach Healthy, the entire rollout halts |
Set up timeout configuration and a manual intervention procedure in advance |
| Large-scale cluster load | Simultaneously processing Watch streams for hundreds of Applications can become a bottleneck | ArgoCD controller sharding and resource limit tuning |
| Per-cluster independent install | Argo Rollouts must be installed separately on each cluster; no central management | Can automate Rollouts installation itself via ApplicationSet |
| Traffic management tool dependency | Istio, NGINX, or Gateway API required for precise weight control | Choose the integration method that fits your existing infrastructure |
The Most Common Mistakes in Practice
-
Expecting AutoSync to work as usual after enabling
rollingSync— Using therollingSyncstrategy forcibly disables AutoSync for all Applications created by that ApplicationSet. I was confused for a long time about why deployments weren't happening — this is intentional behavior. You must explicitly trigger it from your CI pipeline using theargocd appset synccommand. -
Not passing
argsto the AnalysisTemplate — If you omitanalysis.argsin the Rollout,{{args.service-name}}in the AnalysisTemplate's Prometheus query becomes an empty string. The analysis will fail silently or evaluate the wrong metrics, causing canary auto-rollback to not work. -
Deploying
Rolloutresources to clusters without Argo Rollouts installed — Without the CRD,Rolloutresources will be silently ignored or cause errors. You can prevent this by placing an Application that installs Argo Rollouts itself as a preceding step in your ApplicationSet. -
Expecting canary weight to work without a traffic management tool — Even if Argo Rollouts declares
setWeight: 10, actual traffic distribution won't happen without a traffic management tool like Istio or NGINX Ingress. At the base Kubernetes service level, traffic is only roughly proportional to the number of pods.
Closing Thoughts
By combining ArgoCD ApplicationSet's rollingSync with Argo Rollouts' canary steps, you can meaningfully reduce the risk of multi-cluster deployments through two layers of progressiveness — at the cluster level and the pod level.
When first building this architecture, it can feel overwhelming to know where to start with four tools (ArgoCD + ApplicationSet + Argo Rollouts + a service mesh) all intertwined. The most practical approach is to build up complexity incrementally.
Three steps you can start right now:
-
Install Argo Rollouts on a single cluster first and validate the canary steps. After installing with the command below, you can find how to convert an existing
Deploymentto aRolloutin the official migration guide. Watchingkubectl argo rollouts get rollout my-service --watchas the steps progress will make the structure click.bashkubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml -
Apply
rollingSyncwith the ApplicationSet Matrix Generator on a small environment (e.g., 2 clusters: dev + staging). Start withmaxUpdate: 1and observe the flow of deployments rolling out one cluster at a time. Experiencing the AutoSync-disabled behavior firsthand at this stage will save you from being caught off guard later. -
Add a Prometheus
AnalysisTemplateto complete the loop with metric-based automatic evaluation. It's practical to start with a generoussuccessCondition(e.g.,>= 0.80), watch the actual metric flow, and gradually raise the threshold. The most stable path is to start with simple metric thresholds, stabilize the system, and then evolve toward SLO-based automatic promotion.
References
- Argo Rollouts Official Docs | argo-rollouts.readthedocs.io
- ArgoCD ApplicationSet Progressive Syncs Official Docs | argo-cd.readthedocs.io
- Canary Deployment Strategy | Argo Rollouts Official
- Analysis & Progressive Delivery | Argo Rollouts Official
- Skyscanner/applicationset-progressive-sync | GitHub
- How to automate multi-cluster deployments using Argo CD | Red Hat Developer
- Canary delivery with Argo Rollouts and Amazon VPC Lattice for EKS | AWS Blog
- Argo Rollouts Canary Monitoring with Last9 | last9.io
- RollingSync race condition issue | argoproj/argo-cd #22852