Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
DevOps

Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing

When I first introduced canary deployments, I made a similar mistake. I thought splitting replica ratios with a basic Kubernetes Deployment and calling it a "10% canary" was good enough. But with 3 pods, 1 is canary — that's exactly 33%. If the problem stopped there it would have been fine, but the metrics from canary and stable pods got mixed together, causing success rates to be measured incorrectly. The AnalysisTemplate rollback condition triggered based on polluted aggregates, and I've had the experience of a perfectly healthy deployment ending in a meaningless rollback.

Combining Istio's VirtualService weights with Argo Rollouts gives you precise traffic ratio control regardless of pod count, while automatically isolating metrics to only the requests that actually reached the canary pods. Add setHeaderRoute on top of that, and you can have regular users continue using the stable version while the QA team validates the canary ahead of time with a single header.

In this post, I'll walk through three things — precise traffic distribution, pod-level metric isolation, and header-based test routing — using real YAML and PromQL. This is aimed at those running Kubernetes with Prometheus for metrics collection, who have adopted or are considering adopting Istio or Argo Rollouts. I'll assume familiarity with basic Kubernetes concepts (Deployment, Service, ReplicaSet).


Core Concepts

Why VirtualService Weights Instead of Replica Ratios

The default Kubernetes approach determines traffic by the ratio of pod counts. If you want exactly 5% going to canary, the math says you need 20 pods — 2 canary pods vs. 38 stable pods is both costly and practically difficult to maintain.

Istio's VirtualService solves this problem by changing the layer. An Envoy sidecar is injected into each pod, and the weight value in the VirtualService directly controls Envoy's routing decisions. Regardless of how many pods there are, the traffic ratio is exactly the number written in the YAML.

VirtualService: An Istio CRD that defines routing rules for HTTP/gRPC traffic. It declares what requests go where and in what proportion, and Argo Rollouts dynamically modifies the weight values in this resource as canary steps progress.

Overall Architecture Flow

Looking at the full picture before reading the rest makes each component's role much clearer.

                     ┌────────────────────────────────────┐
  Normal request ───▶│       Istio Ingress Gateway         │
  X-Canary: true ───▶│          (Envoy Proxy)              │
                     └─────────────┬──────────────────────┘
                                   │ VirtualService
                           ┌───────┴─────────┐
                    weight │ 90%             │ 10% (or header match)
                           ▼                 ▼
              ┌─────────────────┐   ┌─────────────────┐
              │  Stable Service │   │  Canary Service │
              └────────┬────────┘   └────────┬────────┘
                       │                     │
                       ▼                     ▼
              ┌─────────────────┐   ┌─────────────────┐
              │   Stable Pods   │   │   Canary Pods   │
              │ [Envoy Sidecar] │   │ [Envoy Sidecar] │
              └────────┬────────┘   └────────┬────────┘
                       │  metrics             │  metrics
                       └──────────┬──────────┘
                                  ▼
                           ┌────────────┐
                           │ Prometheus │◀── AnalysisTemplate query
                           └────────────┘
                                  ▲
                    ┌─────────────┴───────────────┐
                    │  Argo Rollouts Controller    │
                    │ (auto promotion / rollback)  │
                    └─────────────────────────────┘

Here's a summary of each component's role:

Component Role
Rollout CRD Defines canary steps, analysis linkage, and traffic weights
VirtualService Dynamically modifies HTTP routing weights (stable/canary)
DestinationRule Distinguishes stable/canary subsets by pod labels
AnalysisTemplate Automatically decides promotion/rollback via Prometheus queries
setHeaderRoute Routes only requests with a specific header to the canary pod

AnalysisTemplate: A CRD in Argo Rollouts that automatically evaluates canary pod performance. It supports various metric providers including Prometheus, Datadog, and New Relic, and automatically rolls back if conditions are not met.

How Pod-Level Metric Isolation Works

The Istio sidecar exposes Prometheus metrics for every request received by each pod. The key is that these metrics carry a destination_workload label.

promql
# Filter for only requests that reached the canary workload
istio_requests_total{destination_workload="my-service-canary"}

my-service-canary is the name of the canary ReplicaSet created by Argo Rollouts. Thanks to this label, you can query only "requests actually delivered to canary pods" rather than metrics for the entire service.

Istio Label Meaning
destination_workload Name of the workload that received the request (key for distinguishing canary/stable)
source_workload Name of the workload that sent the request
reporter "source" or "destination" — fix to "destination" to prevent duplicate aggregation
response_code HTTP response code

Practical Application

Example 1: Preparing stable/canary Services and DestinationRule

Argo Rollouts' Istio integration requires two separate Kubernetes Service objects. stableService and canaryService are the names referenced in the Rollout manifest, and these Services must be created manually in advance — this is an easy point to miss at first. Without them, the Rollout controller throws errors and the deployment won't proceed.

yaml
# Prerequisite: Create separate Services for stable and canary
apiVersion: v1
kind: Service
metadata:
  name: my-service-stable
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-canary
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080

Initially, both Services can have identical selectors. When a rollout starts, the Argo Rollouts controller automatically adds the rollouts-pod-template-hash label to each Service's selector, separating stable and canary pods into their respective Services.

Next, define stable and canary subsets with a DestinationRule.

yaml
# DestinationRule — define stable/canary subsets
# The rollouts-pod-template-hash values are managed automatically by Argo Rollouts,
# so you don't need to manually fill in or edit the hash values
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service-dr
spec:
  host: my-service
  subsets:
    - name: stable
      labels:
        rollouts-pod-template-hash: stable   # Rollouts auto-patches with actual hash
    - name: canary
      labels:
        rollouts-pod-template-hash: canary   # Rollouts auto-patches with actual hash
yaml
# VirtualService — Argo Rollouts auto-modifies weight values at each deployment step
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service-vs
spec:
  hosts:
    - my-service                  # spec.hosts is a required field
  http:
    - name: primary
      route:
        - destination:
            host: my-service
            subset: stable
          weight: 100
        - destination:
            host: my-service
            subset: canary
          weight: 0

You can start with weight: stable 100, canary 0. Argo Rollouts will update it automatically according to setWeight steps, so you rarely need to touch the VirtualService directly. Honestly, I worried at first whether I'd need to manage this manually, but the controller handles everything, which turned out to be quite convenient.

Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate

Define canary steps and analysis in the Rollout resource. The following is an excerpt of the core strategy section (a full spec must also include selector and template fields).

yaml
# Rollout — canary steps, traffic routing, analysis linkage (strategy excerpt)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  # Basic fields like selector, replicas, template follow the same form as a Deployment
  strategy:
    canary:
      stableService: my-service-stable    # Reference the Service created earlier
      canaryService: my-service-canary    # Reference the Service created earlier
      trafficRouting:
        istio:
          virtualService:
            name: my-service-vs
            routes:
              - primary
      steps:
        - setWeight: 10
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: canary-workload
                value: my-service-canary
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100

The AnalysisTemplate that measures only the success rate for canary pods looks like this:

yaml
# AnalysisTemplate — measures success rate of only requests that reached canary pods
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: canary-workload
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3    # Rollback after 3 cumulative failures (use consecutiveErrorLimit for consecutive failures)
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            (
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination",
                response_code!~"5.*"
              }[2m]))
              /
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination"
              }[2m]))
            ) or on() vector(1)

Leaving out the reporter="destination" condition can cause duplicate aggregation with source-side metrics. I've had experience writing queries without this condition and getting strange success rates, so I strongly recommend always including it explicitly.

Quick PromQL function reference: rate(metric[2m]) calculates the average per-second rate of increase over the past 2 minutes. sum() aggregates time series separated by label combinations. or on() vector(1) returns a default value of 1 (100% success) when there's no traffic and the result would be NaN, preventing unnecessary rollbacks from triggering during phases with no initial traffic.

Query Element Description
destination_workload="{{args.canary-workload}}" Filters to only requests that reached canary pods
reporter="destination" Fixed to receiver-side aggregation to prevent duplication
response_code!~"5.*" Excludes 5xx to calculate successful request ratio
or on() vector(1) Returns default value instead of NaN when there's no traffic
failureLimit: 3 Triggers automatic rollback after 3 cumulative failures

Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute

This is personally the pattern I find most practical. It lets the QA team validate canary pods first without affecting the live service.

yaml
# Rollout steps — set header route first, then start weighted traffic after validation
steps:
  - setHeaderRoute:
      name: canary-header-route
      match:
        - headerName: X-Canary
          headerValue:
            exact: "true"
  - pause: {}              # QA tests canary with header, waits for manual approval
  - setWeight: 10          # With header route still active, also send 10% of regular traffic to canary
  - pause: {duration: 10m}
  - analysis:
      templates:
        - templateName: success-rate
      args:
        - name: canary-workload
          value: my-service-canary
  - setWeight: 50
  - pause: {duration: 10m}
  - setHeaderRoute:
      name: canary-header-route    # Same name without match → removes that header route
  - setWeight: 100

Using the same name without match in the second-to-last step may feel unintuitive at first. Argo Rollouts interprets this as "remove the route with that name from the managedRoutes list." The result is that the header-condition route is removed from the VirtualService.

bash
# QA team — send request directly to canary pod with X-Canary header
curl -H "X-Canary: true" https://my-service.example.com/api/health
 
# Regular users — no header → routed to stable pod
curl https://my-service.example.com/api/health

Once the QA team finishes E2E testing while paused at the pause: {} step, you can advance to the next step with kubectl argo rollouts promote my-service. If you're using ArgoCD, clicking the promotion button in the UI works just as well.

Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate

Monitoring latency and the impact on services the canary depends on, in addition to success rate, is much safer. As I continued operations, it was only after adding these queries that I started catching problems that success rate alone wasn't picking up.

promql
# P99 response time for canary pods (based on histogram buckets)
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{
    destination_workload="my-service-canary",
    reporter="destination"
  }[5m])) by (le)
)
promql
# Rate of 5xx errors in requests sent by canary pods (measures canary → dependency impact)
(
  sum(rate(istio_requests_total{
    source_workload="my-service-canary",
    response_code=~"5.*"
  }[2m]))
  /
  sum(rate(istio_requests_total{
    source_workload="my-service-canary"
  }[2m]))
) or on() vector(0)

The second query uses source_workload as the basis. It catches errors that occur when the canary pod calls other services, not errors received by the canary pod itself. If a new version calls a DB or external API differently, you can miss the problem without this query.


Pros and Cons

Advantages

Item Details
Replica-count-independent traffic control Accurate 5%, 10% traffic distribution even with just 1 canary pod
Pod-level metric precision Independent measurement of error rate and latency for only requests that reached canary pods, via destination_workload label
Zero-downtime QA testing Validate canary with setHeaderRoute without impacting live service, then gradually transition
Automatic rollback Argo Rollouts automatically reverts to stable when AnalysisTemplate failure conditions are met
GitOps-friendly All configuration is declared as Kubernetes manifests, integrating naturally with ArgoCD

Disadvantages and Caveats

Item Details Mitigation
Resource complexity Managing multiple resources simultaneously: VirtualService, DestinationRule, Rollout, AnalysisTemplate, etc. Templatize with Helm chart or Kustomize for consistency
Metric cardinality Increased Prometheus storage and query costs from growing Istio label combinations Limit metric scope with Sidecar resource, configure unnecessary label excludes
ArgoCD diff conflicts ArgoCD detects drift because Argo Rollouts dynamically modifies VirtualService Must add VirtualService spec.http path to ignoreDifferences
Subset hash dependency rollouts-pod-template-hash in DestinationRule changes with each rollout Managed automatically by Argo Rollouts — do not modify manually, delegate to the controller
DB migration coupling Canary/stable pods share the same DB simultaneously when schema changes are involved Separate schema changes and code deployment into distinct phases (expand/contract pattern)
mTLS scraping issues Prometheus scraping may be blocked in PeerAuthentication STRICT mode Mount Istio certificates to Prometheus, or scrape port 15090 (Envoy stats) directly

The DB migration issue was quite shocking when I first encountered it. I only realized the need for the expand/contract pattern after experiencing firsthand the moment canary pods started writing a new column and stable pods began throwing errors about unknown fields. The mTLS scraping issue was similar — I changed PeerAuthentication to STRICT and Prometheus collection silently stopped.

ignoreDifferences: An ArgoCD option that prevents specific field changes from being detected as drift. Because Argo Rollouts automatically modifies the weight values in VirtualService, this prevents ArgoCD from continuously attempting to sync after detecting it as an "unwanted change."

Most Common Mistakes in Practice

  1. Writing PromQL without the reporter label: Querying without reporter="destination" results in duplicate aggregation of source and destination metrics, causing success rates to appear lower than they actually are. It's recommended to always include this condition in AnalysisTemplate queries.

  2. Deploying a Rollout without creating the canaryService and stableService Service objects: If these two fields are declared in the Rollout manifest, Kubernetes Services with those exact names must actually exist. Without them, the Rollout controller throws errors and deployment won't proceed. You need to apply the Services introduced in Example 1 ahead of time.

  3. Omitting the setHeaderRoute removal step: If you don't remove the header route at the final stage of the rollout, requests with the X-Canary: true header will continue to be routed only to that subset (which is already stable) going forward. You need to explicitly remove it by placing a setHeaderRoute without match at the end of the steps.


Closing Thoughts

If you've followed along this far, you now have the foundation to set up precise canary deployments that aren't tied to pod count. VirtualService determines traffic ratios, AnalysisTemplate makes automated decisions using canary-pod-specific metrics, and setHeaderRoute opens a safe validation window for the QA team in between. While it may look complex at first with so many components, each role is clearly separated — and once you're familiar with it, the entire deployment process becomes much more transparent.

Here are 3 steps you can start with right now:

  1. Try the full flow locally (for those comfortable with Kubernetes basics and able to spin up a local cluster easily): Install Istio and Argo Rollouts on kind or minikube and apply the example manifests in order. You can see the controller automatically modifying the VirtualService in real time. Use kubectl get vs my-service-vs -o yaml -w to watch the changes live.

  2. Connect an AnalysisTemplate to an existing service first (for those already running Istio and Prometheus): Before switching to a canary strategy, it helps to first verify the istio_requests_total query for your current service in Prometheus. Confirm how the destination_workload label appears, then connect the AnalysisTemplate in dryRun: true mode to validate the analysis logic without triggering actual rollbacks.

  3. Add ArgoCD ignoreDifferences configuration in advance (for those running GitOps with ArgoCD): Before applying to production, add the VirtualService spec.http path to ignoreDifferences in your ArgoCD Application. This prevents the confusion of ArgoCD repeatedly syncing on your first deployment.


References

  • Argo Rollouts — Istio Traffic Management | Official Docs
  • Argo Rollouts — Analysis Overview | Official Docs
  • Demo: An Automated Canary Deployment on Kubernetes with Argo Rollouts, Istio and Prometheus | CNCF Blog
  • Progressive Delivery with Service Mesh: Argo Rollouts + Istio | InfraCloud
  • Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
  • Canary Deployment in Kubernetes using Argo Rollouts and Istio | Deckhouse Blog
  • Canary Progressive Delivery with Argo Rollouts | Tetrate Official Docs
  • Querying Prometheus Metrics | Istio Official Docs
  • Under the Hood: Argo Rollouts 1.8 with Kubernetes 1.33 and Prometheus 3.1 | earezki.com
  • How to Perform Canary with Argo Rollouts and Istio Service Mesh | OpsMx
#Istio#ArgoRollouts#카나리배포#Kubernetes#Prometheus#VirtualService#PromQL#GitOps#서비스메시#AnalysisTemplate
Share

Table of Contents

Core ConceptsWhy VirtualService Weights Instead of Replica RatiosOverall Architecture FlowHow Pod-Level Metric Isolation WorksPractical ApplicationExample 1: Preparing stable/canary Services and DestinationRuleExample 2: Automated Promotion/Rollback with Rollout + AnalysisTemplateExample 3: Header-Based Canary Testing for QA Team Only with setHeaderRouteExample 4: Querying Canary Pod P99 Latency and Upstream Error RatePros and ConsAdvantagesDisadvantages and CaveatsMost Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)
DevOps

Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)

If you've used Vercel for any length of time, you've probably had this experience at least once: you open your end-of-month invoice and see a number far larger ...

May 25, 202624 min read
Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift
DevOps

Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift

This article is written for DevOps/infrastructure engineers who have hands-on experience operating Kubernetes. It assumes familiarity with Helm, Kustomize, and ...

May 25, 202623 min read
Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition
DevOps

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn...

May 26, 202624 min read
Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog
DevOps

Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Have you ever been jolted awake at 3 AM by a PagerDuty alert? I have. More than once. Every time, I'd dig through logs and eventually land on the same thought: ...

May 25, 202624 min read
Implementing Multi-Cluster Canary Deployments with ArgoCD ApplicationSet rollingSync + Argo Rollouts
DevOps

Implementing Multi-Cluster Canary Deployments with ArgoCD ApplicationSet rollingSync + Argo Rollouts

A dual-gating strategy for safe, progressive rollouts across dozens of clusters Target audience: Those with some ArgoCD experience. Basic Kubernetes concepts...

May 25, 202620 min read
GitOps Tools Comparison 2026: ArgoCD 3.3 vs FluxCD 2.8 + MCP Server — What Fits Which Team
DevOps

GitOps Tools Comparison 2026: ArgoCD 3.3 vs FluxCD 2.8 + MCP Server — What Fits Which Team

When I first adopted ArgoCD, I'll be honest — I chose it simply because "the UI looks nice and the community is large." About six months in, as our multi-cluste...

May 24, 202621 min read