Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing
When I first introduced canary deployments, I made a similar mistake. I thought splitting replica ratios with a basic Kubernetes Deployment and calling it a "10% canary" was good enough. But with 3 pods, 1 is canary — that's exactly 33%. If the problem stopped there it would have been fine, but the metrics from canary and stable pods got mixed together, causing success rates to be measured incorrectly. The AnalysisTemplate rollback condition triggered based on polluted aggregates, and I've had the experience of a perfectly healthy deployment ending in a meaningless rollback.
Combining Istio's VirtualService weights with Argo Rollouts gives you precise traffic ratio control regardless of pod count, while automatically isolating metrics to only the requests that actually reached the canary pods. Add setHeaderRoute on top of that, and you can have regular users continue using the stable version while the QA team validates the canary ahead of time with a single header.
In this post, I'll walk through three things — precise traffic distribution, pod-level metric isolation, and header-based test routing — using real YAML and PromQL. This is aimed at those running Kubernetes with Prometheus for metrics collection, who have adopted or are considering adopting Istio or Argo Rollouts. I'll assume familiarity with basic Kubernetes concepts (Deployment, Service, ReplicaSet).
Core Concepts
Why VirtualService Weights Instead of Replica Ratios
The default Kubernetes approach determines traffic by the ratio of pod counts. If you want exactly 5% going to canary, the math says you need 20 pods — 2 canary pods vs. 38 stable pods is both costly and practically difficult to maintain.
Istio's VirtualService solves this problem by changing the layer. An Envoy sidecar is injected into each pod, and the weight value in the VirtualService directly controls Envoy's routing decisions. Regardless of how many pods there are, the traffic ratio is exactly the number written in the YAML.
VirtualService: An Istio CRD that defines routing rules for HTTP/gRPC traffic. It declares what requests go where and in what proportion, and Argo Rollouts dynamically modifies the
weightvalues in this resource as canary steps progress.
Overall Architecture Flow
Looking at the full picture before reading the rest makes each component's role much clearer.
┌────────────────────────────────────┐
Normal request ───▶│ Istio Ingress Gateway │
X-Canary: true ───▶│ (Envoy Proxy) │
└─────────────┬──────────────────────┘
│ VirtualService
┌───────┴─────────┐
weight │ 90% │ 10% (or header match)
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Stable Service │ │ Canary Service │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Stable Pods │ │ Canary Pods │
│ [Envoy Sidecar] │ │ [Envoy Sidecar] │
└────────┬────────┘ └────────┬────────┘
│ metrics │ metrics
└──────────┬──────────┘
▼
┌────────────┐
│ Prometheus │◀── AnalysisTemplate query
└────────────┘
▲
┌─────────────┴───────────────┐
│ Argo Rollouts Controller │
│ (auto promotion / rollback) │
└─────────────────────────────┘Here's a summary of each component's role:
| Component | Role |
|---|---|
Rollout CRD |
Defines canary steps, analysis linkage, and traffic weights |
VirtualService |
Dynamically modifies HTTP routing weights (stable/canary) |
DestinationRule |
Distinguishes stable/canary subsets by pod labels |
AnalysisTemplate |
Automatically decides promotion/rollback via Prometheus queries |
setHeaderRoute |
Routes only requests with a specific header to the canary pod |
AnalysisTemplate: A CRD in Argo Rollouts that automatically evaluates canary pod performance. It supports various metric providers including Prometheus, Datadog, and New Relic, and automatically rolls back if conditions are not met.
How Pod-Level Metric Isolation Works
The Istio sidecar exposes Prometheus metrics for every request received by each pod. The key is that these metrics carry a destination_workload label.
# Filter for only requests that reached the canary workload
istio_requests_total{destination_workload="my-service-canary"}my-service-canary is the name of the canary ReplicaSet created by Argo Rollouts. Thanks to this label, you can query only "requests actually delivered to canary pods" rather than metrics for the entire service.
| Istio Label | Meaning |
|---|---|
destination_workload |
Name of the workload that received the request (key for distinguishing canary/stable) |
source_workload |
Name of the workload that sent the request |
reporter |
"source" or "destination" — fix to "destination" to prevent duplicate aggregation |
response_code |
HTTP response code |
Practical Application
Example 1: Preparing stable/canary Services and DestinationRule
Argo Rollouts' Istio integration requires two separate Kubernetes Service objects. stableService and canaryService are the names referenced in the Rollout manifest, and these Services must be created manually in advance — this is an easy point to miss at first. Without them, the Rollout controller throws errors and the deployment won't proceed.
# Prerequisite: Create separate Services for stable and canary
apiVersion: v1
kind: Service
metadata:
name: my-service-stable
spec:
selector:
app: my-service
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: my-service-canary
spec:
selector:
app: my-service
ports:
- port: 80
targetPort: 8080Initially, both Services can have identical selectors. When a rollout starts, the Argo Rollouts controller automatically adds the rollouts-pod-template-hash label to each Service's selector, separating stable and canary pods into their respective Services.
Next, define stable and canary subsets with a DestinationRule.
# DestinationRule — define stable/canary subsets
# The rollouts-pod-template-hash values are managed automatically by Argo Rollouts,
# so you don't need to manually fill in or edit the hash values
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service-dr
spec:
host: my-service
subsets:
- name: stable
labels:
rollouts-pod-template-hash: stable # Rollouts auto-patches with actual hash
- name: canary
labels:
rollouts-pod-template-hash: canary # Rollouts auto-patches with actual hash# VirtualService — Argo Rollouts auto-modifies weight values at each deployment step
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service-vs
spec:
hosts:
- my-service # spec.hosts is a required field
http:
- name: primary
route:
- destination:
host: my-service
subset: stable
weight: 100
- destination:
host: my-service
subset: canary
weight: 0You can start with weight: stable 100, canary 0. Argo Rollouts will update it automatically according to setWeight steps, so you rarely need to touch the VirtualService directly. Honestly, I worried at first whether I'd need to manage this manually, but the controller handles everything, which turned out to be quite convenient.
Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate
Define canary steps and analysis in the Rollout resource. The following is an excerpt of the core strategy section (a full spec must also include selector and template fields).
# Rollout — canary steps, traffic routing, analysis linkage (strategy excerpt)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
spec:
# Basic fields like selector, replicas, template follow the same form as a Deployment
strategy:
canary:
stableService: my-service-stable # Reference the Service created earlier
canaryService: my-service-canary # Reference the Service created earlier
trafficRouting:
istio:
virtualService:
name: my-service-vs
routes:
- primary
steps:
- setWeight: 10
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
args:
- name: canary-workload
value: my-service-canary
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100The AnalysisTemplate that measures only the success rate for canary pods looks like this:
# AnalysisTemplate — measures success rate of only requests that reached canary pods
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: canary-workload
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3 # Rollback after 3 cumulative failures (use consecutiveErrorLimit for consecutive failures)
provider:
prometheus:
address: http://prometheus:9090
query: |
(
sum(rate(istio_requests_total{
destination_workload="{{args.canary-workload}}",
reporter="destination",
response_code!~"5.*"
}[2m]))
/
sum(rate(istio_requests_total{
destination_workload="{{args.canary-workload}}",
reporter="destination"
}[2m]))
) or on() vector(1)Leaving out the reporter="destination" condition can cause duplicate aggregation with source-side metrics. I've had experience writing queries without this condition and getting strange success rates, so I strongly recommend always including it explicitly.
Quick PromQL function reference:
rate(metric[2m])calculates the average per-second rate of increase over the past 2 minutes.sum()aggregates time series separated by label combinations.or on() vector(1)returns a default value of 1 (100% success) when there's no traffic and the result would beNaN, preventing unnecessary rollbacks from triggering during phases with no initial traffic.
| Query Element | Description |
|---|---|
destination_workload="{{args.canary-workload}}" |
Filters to only requests that reached canary pods |
reporter="destination" |
Fixed to receiver-side aggregation to prevent duplication |
response_code!~"5.*" |
Excludes 5xx to calculate successful request ratio |
or on() vector(1) |
Returns default value instead of NaN when there's no traffic |
failureLimit: 3 |
Triggers automatic rollback after 3 cumulative failures |
Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute
This is personally the pattern I find most practical. It lets the QA team validate canary pods first without affecting the live service.
# Rollout steps — set header route first, then start weighted traffic after validation
steps:
- setHeaderRoute:
name: canary-header-route
match:
- headerName: X-Canary
headerValue:
exact: "true"
- pause: {} # QA tests canary with header, waits for manual approval
- setWeight: 10 # With header route still active, also send 10% of regular traffic to canary
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
args:
- name: canary-workload
value: my-service-canary
- setWeight: 50
- pause: {duration: 10m}
- setHeaderRoute:
name: canary-header-route # Same name without match → removes that header route
- setWeight: 100Using the same name without match in the second-to-last step may feel unintuitive at first. Argo Rollouts interprets this as "remove the route with that name from the managedRoutes list." The result is that the header-condition route is removed from the VirtualService.
# QA team — send request directly to canary pod with X-Canary header
curl -H "X-Canary: true" https://my-service.example.com/api/health
# Regular users — no header → routed to stable pod
curl https://my-service.example.com/api/healthOnce the QA team finishes E2E testing while paused at the pause: {} step, you can advance to the next step with kubectl argo rollouts promote my-service. If you're using ArgoCD, clicking the promotion button in the UI works just as well.
Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate
Monitoring latency and the impact on services the canary depends on, in addition to success rate, is much safer. As I continued operations, it was only after adding these queries that I started catching problems that success rate alone wasn't picking up.
# P99 response time for canary pods (based on histogram buckets)
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{
destination_workload="my-service-canary",
reporter="destination"
}[5m])) by (le)
)# Rate of 5xx errors in requests sent by canary pods (measures canary → dependency impact)
(
sum(rate(istio_requests_total{
source_workload="my-service-canary",
response_code=~"5.*"
}[2m]))
/
sum(rate(istio_requests_total{
source_workload="my-service-canary"
}[2m]))
) or on() vector(0)The second query uses source_workload as the basis. It catches errors that occur when the canary pod calls other services, not errors received by the canary pod itself. If a new version calls a DB or external API differently, you can miss the problem without this query.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Replica-count-independent traffic control | Accurate 5%, 10% traffic distribution even with just 1 canary pod |
| Pod-level metric precision | Independent measurement of error rate and latency for only requests that reached canary pods, via destination_workload label |
| Zero-downtime QA testing | Validate canary with setHeaderRoute without impacting live service, then gradually transition |
| Automatic rollback | Argo Rollouts automatically reverts to stable when AnalysisTemplate failure conditions are met |
| GitOps-friendly | All configuration is declared as Kubernetes manifests, integrating naturally with ArgoCD |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Resource complexity | Managing multiple resources simultaneously: VirtualService, DestinationRule, Rollout, AnalysisTemplate, etc. | Templatize with Helm chart or Kustomize for consistency |
| Metric cardinality | Increased Prometheus storage and query costs from growing Istio label combinations | Limit metric scope with Sidecar resource, configure unnecessary label excludes |
| ArgoCD diff conflicts | ArgoCD detects drift because Argo Rollouts dynamically modifies VirtualService | Must add VirtualService spec.http path to ignoreDifferences |
| Subset hash dependency | rollouts-pod-template-hash in DestinationRule changes with each rollout |
Managed automatically by Argo Rollouts — do not modify manually, delegate to the controller |
| DB migration coupling | Canary/stable pods share the same DB simultaneously when schema changes are involved | Separate schema changes and code deployment into distinct phases (expand/contract pattern) |
| mTLS scraping issues | Prometheus scraping may be blocked in PeerAuthentication STRICT mode | Mount Istio certificates to Prometheus, or scrape port 15090 (Envoy stats) directly |
The DB migration issue was quite shocking when I first encountered it. I only realized the need for the expand/contract pattern after experiencing firsthand the moment canary pods started writing a new column and stable pods began throwing errors about unknown fields. The mTLS scraping issue was similar — I changed PeerAuthentication to STRICT and Prometheus collection silently stopped.
ignoreDifferences: An ArgoCD option that prevents specific field changes from being detected as drift. Because Argo Rollouts automatically modifies the
weightvalues in VirtualService, this prevents ArgoCD from continuously attempting to sync after detecting it as an "unwanted change."
Most Common Mistakes in Practice
-
Writing PromQL without the
reporterlabel: Querying withoutreporter="destination"results in duplicate aggregation of source and destination metrics, causing success rates to appear lower than they actually are. It's recommended to always include this condition in AnalysisTemplate queries. -
Deploying a Rollout without creating the
canaryServiceandstableServiceService objects: If these two fields are declared in the Rollout manifest, Kubernetes Services with those exact names must actually exist. Without them, the Rollout controller throws errors and deployment won't proceed. You need to apply the Services introduced in Example 1 ahead of time. -
Omitting the
setHeaderRouteremoval step: If you don't remove the header route at the final stage of the rollout, requests with theX-Canary: trueheader will continue to be routed only to that subset (which is already stable) going forward. You need to explicitly remove it by placing asetHeaderRoutewithoutmatchat the end of the steps.
Closing Thoughts
If you've followed along this far, you now have the foundation to set up precise canary deployments that aren't tied to pod count. VirtualService determines traffic ratios, AnalysisTemplate makes automated decisions using canary-pod-specific metrics, and setHeaderRoute opens a safe validation window for the QA team in between. While it may look complex at first with so many components, each role is clearly separated — and once you're familiar with it, the entire deployment process becomes much more transparent.
Here are 3 steps you can start with right now:
-
Try the full flow locally (for those comfortable with Kubernetes basics and able to spin up a local cluster easily): Install Istio and Argo Rollouts on
kindorminikubeand apply the example manifests in order. You can see the controller automatically modifying the VirtualService in real time. Usekubectl get vs my-service-vs -o yaml -wto watch the changes live. -
Connect an AnalysisTemplate to an existing service first (for those already running Istio and Prometheus): Before switching to a canary strategy, it helps to first verify the
istio_requests_totalquery for your current service in Prometheus. Confirm how thedestination_workloadlabel appears, then connect the AnalysisTemplate indryRun: truemode to validate the analysis logic without triggering actual rollbacks. -
Add ArgoCD
ignoreDifferencesconfiguration in advance (for those running GitOps with ArgoCD): Before applying to production, add the VirtualServicespec.httppath toignoreDifferencesin your ArgoCD Application. This prevents the confusion of ArgoCD repeatedly syncing on your first deployment.
References
- Argo Rollouts — Istio Traffic Management | Official Docs
- Argo Rollouts — Analysis Overview | Official Docs
- Demo: An Automated Canary Deployment on Kubernetes with Argo Rollouts, Istio and Prometheus | CNCF Blog
- Progressive Delivery with Service Mesh: Argo Rollouts + Istio | InfraCloud
- Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
- Canary Deployment in Kubernetes using Argo Rollouts and Istio | Deckhouse Blog
- Canary Progressive Delivery with Argo Rollouts | Tetrate Official Docs
- Querying Prometheus Metrics | Istio Official Docs
- Under the Hood: Argo Rollouts 1.8 with Kubernetes 1.33 and Prometheus 3.1 | earezki.com
- How to Perform Canary with Argo Rollouts and Istio Service Mesh | OpsMx