Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing

When I first introduced canary deployments, I made a similar mistake. I thought splitting replica ratios with a basic Kubernetes Deployment and calling it a "10% canary" was good enough. But with 3 pods, 1 is canary — that's exactly 33%. If the problem stopped there it would have been fine, but the metrics from canary and stable pods got mixed together, causing success rates to be measured incorrectly. The AnalysisTemplate rollback condition triggered based on polluted aggregates, and I've had the experience of a perfectly healthy deployment ending in a meaningless rollback.

Combining Istio's VirtualService weights with Argo Rollouts gives you precise traffic ratio control regardless of pod count, while automatically isolating metrics to only the requests that actually reached the canary pods. Add setHeaderRoute on top of that, and you can have regular users continue using the stable version while the QA team validates the canary ahead of time with a single header.

In this post, I'll walk through three things — precise traffic distribution, pod-level metric isolation, and header-based test routing — using real YAML and PromQL. This is aimed at those running Kubernetes with Prometheus for metrics collection, who have adopted or are considering adopting Istio or Argo Rollouts. I'll assume familiarity with basic Kubernetes concepts (Deployment, Service, ReplicaSet).

Core Concepts

Why VirtualService Weights Instead of Replica Ratios

The default Kubernetes approach determines traffic by the ratio of pod counts. If you want exactly 5% going to canary, the math says you need 20 pods — 2 canary pods vs. 38 stable pods is both costly and practically difficult to maintain.

Istio's VirtualService solves this problem by changing the layer. An Envoy sidecar is injected into each pod, and the weight value in the VirtualService directly controls Envoy's routing decisions. Regardless of how many pods there are, the traffic ratio is exactly the number written in the YAML.

VirtualService: An Istio CRD that defines routing rules for HTTP/gRPC traffic. It declares what requests go where and in what proportion, and Argo Rollouts dynamically modifies the weight values in this resource as canary steps progress.

Overall Architecture Flow

Looking at the full picture before reading the rest makes each component's role much clearer.

                     ┌────────────────────────────────────┐
  Normal request ───▶│       Istio Ingress Gateway         │
  X-Canary: true ───▶│          (Envoy Proxy)              │
                     └─────────────┬──────────────────────┘
                                   │ VirtualService
                           ┌───────┴─────────┐
                    weight │ 90%             │ 10% (or header match)
                           ▼                 ▼
              ┌─────────────────┐   ┌─────────────────┐
              │  Stable Service │   │  Canary Service │
              └────────┬────────┘   └────────┬────────┘
                       │                     │
                       ▼                     ▼
              ┌─────────────────┐   ┌─────────────────┐
              │   Stable Pods   │   │   Canary Pods   │
              │ [Envoy Sidecar] │   │ [Envoy Sidecar] │
              └────────┬────────┘   └────────┬────────┘
                       │  metrics             │  metrics
                       └──────────┬──────────┘
                                  ▼
                           ┌────────────┐
                           │ Prometheus │◀── AnalysisTemplate query
                           └────────────┘
                                  ▲
                    ┌─────────────┴───────────────┐
                    │  Argo Rollouts Controller    │
                    │ (auto promotion / rollback)  │
                    └─────────────────────────────┘

Here's a summary of each component's role:

Component	Role
`Rollout` CRD	Defines canary steps, analysis linkage, and traffic weights
`VirtualService`	Dynamically modifies HTTP routing weights (stable/canary)
`DestinationRule`	Distinguishes stable/canary subsets by pod labels
`AnalysisTemplate`	Automatically decides promotion/rollback via Prometheus queries
`setHeaderRoute`	Routes only requests with a specific header to the canary pod

AnalysisTemplate: A CRD in Argo Rollouts that automatically evaluates canary pod performance. It supports various metric providers including Prometheus, Datadog, and New Relic, and automatically rolls back if conditions are not met.

How Pod-Level Metric Isolation Works

The Istio sidecar exposes Prometheus metrics for every request received by each pod. The key is that these metrics carry a destination_workload label.

promql

# Filter for only requests that reached the canary workload
istio_requests_total{destination_workload="my-service-canary"}

my-service-canary is the name of the canary ReplicaSet created by Argo Rollouts. Thanks to this label, you can query only "requests actually delivered to canary pods" rather than metrics for the entire service.

Istio Label	Meaning
`destination_workload`	Name of the workload that received the request (key for distinguishing canary/stable)
`source_workload`	Name of the workload that sent the request
`reporter`	`"source"` or `"destination"` — fix to `"destination"` to prevent duplicate aggregation
`response_code`	HTTP response code

Practical Application

Example 1: Preparing stable/canary Services and DestinationRule

Argo Rollouts' Istio integration requires two separate Kubernetes Service objects. stableService and canaryService are the names referenced in the Rollout manifest, and these Services must be created manually in advance — this is an easy point to miss at first. Without them, the Rollout controller throws errors and the deployment won't proceed.

yaml

# Prerequisite: Create separate Services for stable and canary
apiVersion: v1
kind: Service
metadata:
  name: my-service-stable
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-canary
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080

Initially, both Services can have identical selectors. When a rollout starts, the Argo Rollouts controller automatically adds the rollouts-pod-template-hash label to each Service's selector, separating stable and canary pods into their respective Services.

Next, define stable and canary subsets with a DestinationRule.

yaml

# DestinationRule — define stable/canary subsets
# The rollouts-pod-template-hash values are managed automatically by Argo Rollouts,
# so you don't need to manually fill in or edit the hash values
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service-dr
spec:
  host: my-service
  subsets:
    - name: stable
      labels:
        rollouts-pod-template-hash: stable   # Rollouts auto-patches with actual hash
    - name: canary
      labels:
        rollouts-pod-template-hash: canary   # Rollouts auto-patches with actual hash

yaml

# VirtualService — Argo Rollouts auto-modifies weight values at each deployment step
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service-vs
spec:
  hosts:
    - my-service                  # spec.hosts is a required field
  http:
    - name: primary
      route:
        - destination:
            host: my-service
            subset: stable
          weight: 100
        - destination:
            host: my-service
            subset: canary
          weight: 0

You can start with weight: stable 100, canary 0. Argo Rollouts will update it automatically according to setWeight steps, so you rarely need to touch the VirtualService directly. Honestly, I worried at first whether I'd need to manage this manually, but the controller handles everything, which turned out to be quite convenient.

Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate

Define canary steps and analysis in the Rollout resource. The following is an excerpt of the core strategy section (a full spec must also include selector and template fields).

yaml

# Rollout — canary steps, traffic routing, analysis linkage (strategy excerpt)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  # Basic fields like selector, replicas, template follow the same form as a Deployment
  strategy:
    canary:
      stableService: my-service-stable    # Reference the Service created earlier
      canaryService: my-service-canary    # Reference the Service created earlier
      trafficRouting:
        istio:
          virtualService:
            name: my-service-vs
            routes:
              - primary
      steps:
        - setWeight: 10
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: canary-workload
                value: my-service-canary
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100

The AnalysisTemplate that measures only the success rate for canary pods looks like this:

yaml

# AnalysisTemplate — measures success rate of only requests that reached canary pods
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: canary-workload
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3    # Rollback after 3 cumulative failures (use consecutiveErrorLimit for consecutive failures)
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            (
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination",
                response_code!~"5.*"
              }[2m]))
              /
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination"
              }[2m]))
            ) or on() vector(1)

Leaving out the reporter="destination" condition can cause duplicate aggregation with source-side metrics. I've had experience writing queries without this condition and getting strange success rates, so I strongly recommend always including it explicitly.

Quick PromQL function reference: rate(metric[2m]) calculates the average per-second rate of increase over the past 2 minutes. sum() aggregates time series separated by label combinations. or on() vector(1) returns a default value of 1 (100% success) when there's no traffic and the result would be NaN, preventing unnecessary rollbacks from triggering during phases with no initial traffic.

Query Element	Description
`destination_workload="{{args.canary-workload}}"`	Filters to only requests that reached canary pods
`reporter="destination"`	Fixed to receiver-side aggregation to prevent duplication
`response_code!~"5.*"`	Excludes 5xx to calculate successful request ratio
`or on() vector(1)`	Returns default value instead of NaN when there's no traffic
`failureLimit: 3`	Triggers automatic rollback after 3 cumulative failures

Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute

This is personally the pattern I find most practical. It lets the QA team validate canary pods first without affecting the live service.

yaml

# Rollout steps — set header route first, then start weighted traffic after validation
steps:
  - setHeaderRoute:
      name: canary-header-route
      match:
        - headerName: X-Canary
          headerValue:
            exact: "true"
  - pause: {}              # QA tests canary with header, waits for manual approval
  - setWeight: 10          # With header route still active, also send 10% of regular traffic to canary
  - pause: {duration: 10m}
  - analysis:
      templates:
        - templateName: success-rate
      args:
        - name: canary-workload
          value: my-service-canary
  - setWeight: 50
  - pause: {duration: 10m}
  - setHeaderRoute:
      name: canary-header-route    # Same name without match → removes that header route
  - setWeight: 100

Using the same name without match in the second-to-last step may feel unintuitive at first. Argo Rollouts interprets this as "remove the route with that name from the managedRoutes list." The result is that the header-condition route is removed from the VirtualService.

bash

# QA team — send request directly to canary pod with X-Canary header
curl -H "X-Canary: true" https://my-service.example.com/api/health
 
# Regular users — no header → routed to stable pod
curl https://my-service.example.com/api/health

Once the QA team finishes E2E testing while paused at the pause: {} step, you can advance to the next step with kubectl argo rollouts promote my-service. If you're using ArgoCD, clicking the promotion button in the UI works just as well.

Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate

Monitoring latency and the impact on services the canary depends on, in addition to success rate, is much safer. As I continued operations, it was only after adding these queries that I started catching problems that success rate alone wasn't picking up.

promql

# P99 response time for canary pods (based on histogram buckets)
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{
    destination_workload="my-service-canary",
    reporter="destination"
  }[5m])) by (le)
)

promql

# Rate of 5xx errors in requests sent by canary pods (measures canary → dependency impact)
(
  sum(rate(istio_requests_total{
    source_workload="my-service-canary",
    response_code=~"5.*"
  }[2m]))
  /
  sum(rate(istio_requests_total{
    source_workload="my-service-canary"
  }[2m]))
) or on() vector(0)

The second query uses source_workload as the basis. It catches errors that occur when the canary pod calls other services, not errors received by the canary pod itself. If a new version calls a DB or external API differently, you can miss the problem without this query.

Pros and Cons

Advantages

Item	Details
Replica-count-independent traffic control	Accurate 5%, 10% traffic distribution even with just 1 canary pod
Pod-level metric precision	Independent measurement of error rate and latency for only requests that reached canary pods, via `destination_workload` label
Zero-downtime QA testing	Validate canary with `setHeaderRoute` without impacting live service, then gradually transition
Automatic rollback	Argo Rollouts automatically reverts to stable when `AnalysisTemplate` failure conditions are met
GitOps-friendly	All configuration is declared as Kubernetes manifests, integrating naturally with ArgoCD

Disadvantages and Caveats

Item	Details	Mitigation
Resource complexity	Managing multiple resources simultaneously: VirtualService, DestinationRule, Rollout, AnalysisTemplate, etc.	Templatize with Helm chart or Kustomize for consistency
Metric cardinality	Increased Prometheus storage and query costs from growing Istio label combinations	Limit metric scope with `Sidecar` resource, configure unnecessary label excludes
ArgoCD diff conflicts	ArgoCD detects drift because Argo Rollouts dynamically modifies VirtualService	Must add VirtualService `spec.http` path to `ignoreDifferences`
Subset hash dependency	`rollouts-pod-template-hash` in DestinationRule changes with each rollout	Managed automatically by Argo Rollouts — do not modify manually, delegate to the controller
DB migration coupling	Canary/stable pods share the same DB simultaneously when schema changes are involved	Separate schema changes and code deployment into distinct phases (expand/contract pattern)
mTLS scraping issues	Prometheus scraping may be blocked in PeerAuthentication STRICT mode	Mount Istio certificates to Prometheus, or scrape port 15090 (Envoy stats) directly

The DB migration issue was quite shocking when I first encountered it. I only realized the need for the expand/contract pattern after experiencing firsthand the moment canary pods started writing a new column and stable pods began throwing errors about unknown fields. The mTLS scraping issue was similar — I changed PeerAuthentication to STRICT and Prometheus collection silently stopped.

ignoreDifferences: An ArgoCD option that prevents specific field changes from being detected as drift. Because Argo Rollouts automatically modifies the weight values in VirtualService, this prevents ArgoCD from continuously attempting to sync after detecting it as an "unwanted change."

Most Common Mistakes in Practice

Writing PromQL without the reporter label: Querying without reporter="destination" results in duplicate aggregation of source and destination metrics, causing success rates to appear lower than they actually are. It's recommended to always include this condition in AnalysisTemplate queries.
Deploying a Rollout without creating the canaryService and stableService Service objects: If these two fields are declared in the Rollout manifest, Kubernetes Services with those exact names must actually exist. Without them, the Rollout controller throws errors and deployment won't proceed. You need to apply the Services introduced in Example 1 ahead of time.
Omitting the setHeaderRoute removal step: If you don't remove the header route at the final stage of the rollout, requests with the X-Canary: true header will continue to be routed only to that subset (which is already stable) going forward. You need to explicitly remove it by placing a setHeaderRoute without match at the end of the steps.

Closing Thoughts

If you've followed along this far, you now have the foundation to set up precise canary deployments that aren't tied to pod count. VirtualService determines traffic ratios, AnalysisTemplate makes automated decisions using canary-pod-specific metrics, and setHeaderRoute opens a safe validation window for the QA team in between. While it may look complex at first with so many components, each role is clearly separated — and once you're familiar with it, the entire deployment process becomes much more transparent.

Here are 3 steps you can start with right now:

Try the full flow locally (for those comfortable with Kubernetes basics and able to spin up a local cluster easily): Install Istio and Argo Rollouts on kind or minikube and apply the example manifests in order. You can see the controller automatically modifying the VirtualService in real time. Use kubectl get vs my-service-vs -o yaml -w to watch the changes live.
Connect an AnalysisTemplate to an existing service first (for those already running Istio and Prometheus): Before switching to a canary strategy, it helps to first verify the istio_requests_total query for your current service in Prometheus. Confirm how the destination_workload label appears, then connect the AnalysisTemplate in dryRun: true mode to validate the analysis logic without triggering actual rollbacks.
Add ArgoCD ignoreDifferences configuration in advance (for those running GitOps with ArgoCD): Before applying to production, add the VirtualService spec.http path to ignoreDifferences in your ArgoCD Application. This prevents the confusion of ArgoCD repeatedly syncing on your first deployment.

References

#Istio#ArgoRollouts#카나리배포#Kubernetes#Prometheus#VirtualService#PromQL#GitOps#서비스메시#AnalysisTemplate

Istio + Argo Rollouts로 구성하는 카나리 배포: 파드 메트릭 격리부터 헤더 기반 테스트 라우팅까지 | DEV BAK - 기술블로그

DevOps

Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing

Core Concepts

Why VirtualService Weights Instead of Replica Ratios

VirtualService: An Istio CRD that defines routing rules for HTTP/gRPC traffic. It declares what requests go where and in what proportion, and Argo Rollouts dynamically modifies the weight values in this resource as canary steps progress.

Overall Architecture Flow

Looking at the full picture before reading the rest makes each component's role much clearer.

                     ┌────────────────────────────────────┐
  Normal request ───▶│       Istio Ingress Gateway         │
  X-Canary: true ───▶│          (Envoy Proxy)              │
                     └─────────────┬──────────────────────┘
                                   │ VirtualService
                           ┌───────┴─────────┐
                    weight │ 90%             │ 10% (or header match)
                           ▼                 ▼
              ┌─────────────────┐   ┌─────────────────┐
              │  Stable Service │   │  Canary Service │
              └────────┬────────┘   └────────┬────────┘
                       │                     │
                       ▼                     ▼
              ┌─────────────────┐   ┌─────────────────┐
              │   Stable Pods   │   │   Canary Pods   │
              │ [Envoy Sidecar] │   │ [Envoy Sidecar] │
              └────────┬────────┘   └────────┬────────┘
                       │  metrics             │  metrics
                       └──────────┬──────────┘
                                  ▼
                           ┌────────────┐
                           │ Prometheus │◀── AnalysisTemplate query
                           └────────────┘
                                  ▲
                    ┌─────────────┴───────────────┐
                    │  Argo Rollouts Controller    │
                    │ (auto promotion / rollback)  │
                    └─────────────────────────────┘

Here's a summary of each component's role:

Component	Role
`Rollout` CRD	Defines canary steps, analysis linkage, and traffic weights
`VirtualService`	Dynamically modifies HTTP routing weights (stable/canary)
`DestinationRule`	Distinguishes stable/canary subsets by pod labels
`AnalysisTemplate`	Automatically decides promotion/rollback via Prometheus queries
`setHeaderRoute`	Routes only requests with a specific header to the canary pod

AnalysisTemplate: A CRD in Argo Rollouts that automatically evaluates canary pod performance. It supports various metric providers including Prometheus, Datadog, and New Relic, and automatically rolls back if conditions are not met.

How Pod-Level Metric Isolation Works

The Istio sidecar exposes Prometheus metrics for every request received by each pod. The key is that these metrics carry a destination_workload label.

promql

# Filter for only requests that reached the canary workload
istio_requests_total{destination_workload="my-service-canary"}

Istio Label	Meaning
`destination_workload`	Name of the workload that received the request (key for distinguishing canary/stable)
`source_workload`	Name of the workload that sent the request
`reporter`	`"source"` or `"destination"` — fix to `"destination"` to prevent duplicate aggregation
`response_code`	HTTP response code

Practical Application

Example 1: Preparing stable/canary Services and DestinationRule

yaml

# Prerequisite: Create separate Services for stable and canary
apiVersion: v1
kind: Service
metadata:
  name: my-service-stable
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-service-canary
spec:
  selector:
    app: my-service
  ports:
    - port: 80
      targetPort: 8080

Next, define stable and canary subsets with a DestinationRule.

yaml

# DestinationRule — define stable/canary subsets
# The rollouts-pod-template-hash values are managed automatically by Argo Rollouts,
# so you don't need to manually fill in or edit the hash values
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service-dr
spec:
  host: my-service
  subsets:
    - name: stable
      labels:
        rollouts-pod-template-hash: stable   # Rollouts auto-patches with actual hash
    - name: canary
      labels:
        rollouts-pod-template-hash: canary   # Rollouts auto-patches with actual hash

yaml

# VirtualService — Argo Rollouts auto-modifies weight values at each deployment step
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service-vs
spec:
  hosts:
    - my-service                  # spec.hosts is a required field
  http:
    - name: primary
      route:
        - destination:
            host: my-service
            subset: stable
          weight: 100
        - destination:
            host: my-service
            subset: canary
          weight: 0

Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate

Define canary steps and analysis in the Rollout resource. The following is an excerpt of the core strategy section (a full spec must also include selector and template fields).

yaml

# Rollout — canary steps, traffic routing, analysis linkage (strategy excerpt)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  # Basic fields like selector, replicas, template follow the same form as a Deployment
  strategy:
    canary:
      stableService: my-service-stable    # Reference the Service created earlier
      canaryService: my-service-canary    # Reference the Service created earlier
      trafficRouting:
        istio:
          virtualService:
            name: my-service-vs
            routes:
              - primary
      steps:
        - setWeight: 10
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: canary-workload
                value: my-service-canary
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100

The AnalysisTemplate that measures only the success rate for canary pods looks like this:

yaml

# AnalysisTemplate — measures success rate of only requests that reached canary pods
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: canary-workload
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3    # Rollback after 3 cumulative failures (use consecutiveErrorLimit for consecutive failures)
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            (
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination",
                response_code!~"5.*"
              }[2m]))
              /
              sum(rate(istio_requests_total{
                destination_workload="{{args.canary-workload}}",
                reporter="destination"
              }[2m]))
            ) or on() vector(1)

Quick PromQL function reference: rate(metric[2m]) calculates the average per-second rate of increase over the past 2 minutes. sum() aggregates time series separated by label combinations. or on() vector(1) returns a default value of 1 (100% success) when there's no traffic and the result would be NaN, preventing unnecessary rollbacks from triggering during phases with no initial traffic.

Query Element	Description
`destination_workload="{{args.canary-workload}}"`	Filters to only requests that reached canary pods
`reporter="destination"`	Fixed to receiver-side aggregation to prevent duplication
`response_code!~"5.*"`	Excludes 5xx to calculate successful request ratio
`or on() vector(1)`	Returns default value instead of NaN when there's no traffic
`failureLimit: 3`	Triggers automatic rollback after 3 cumulative failures

Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute

This is personally the pattern I find most practical. It lets the QA team validate canary pods first without affecting the live service.

yaml

# Rollout steps — set header route first, then start weighted traffic after validation
steps:
  - setHeaderRoute:
      name: canary-header-route
      match:
        - headerName: X-Canary
          headerValue:
            exact: "true"
  - pause: {}              # QA tests canary with header, waits for manual approval
  - setWeight: 10          # With header route still active, also send 10% of regular traffic to canary
  - pause: {duration: 10m}
  - analysis:
      templates:
        - templateName: success-rate
      args:
        - name: canary-workload
          value: my-service-canary
  - setWeight: 50
  - pause: {duration: 10m}
  - setHeaderRoute:
      name: canary-header-route    # Same name without match → removes that header route
  - setWeight: 100

bash

# QA team — send request directly to canary pod with X-Canary header
curl -H "X-Canary: true" https://my-service.example.com/api/health
 
# Regular users — no header → routed to stable pod
curl https://my-service.example.com/api/health

Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate

promql

# P99 response time for canary pods (based on histogram buckets)
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{
    destination_workload="my-service-canary",
    reporter="destination"
  }[5m])) by (le)
)

promql

# Rate of 5xx errors in requests sent by canary pods (measures canary → dependency impact)
(
  sum(rate(istio_requests_total{
    source_workload="my-service-canary",
    response_code=~"5.*"
  }[2m]))
  /
  sum(rate(istio_requests_total{
    source_workload="my-service-canary"
  }[2m]))
) or on() vector(0)

Pros and Cons

Advantages

Item	Details
Replica-count-independent traffic control	Accurate 5%, 10% traffic distribution even with just 1 canary pod
Pod-level metric precision	Independent measurement of error rate and latency for only requests that reached canary pods, via `destination_workload` label
Zero-downtime QA testing	Validate canary with `setHeaderRoute` without impacting live service, then gradually transition
Automatic rollback	Argo Rollouts automatically reverts to stable when `AnalysisTemplate` failure conditions are met
GitOps-friendly	All configuration is declared as Kubernetes manifests, integrating naturally with ArgoCD

Disadvantages and Caveats

Item	Details	Mitigation
Resource complexity	Managing multiple resources simultaneously: VirtualService, DestinationRule, Rollout, AnalysisTemplate, etc.	Templatize with Helm chart or Kustomize for consistency
Metric cardinality	Increased Prometheus storage and query costs from growing Istio label combinations	Limit metric scope with `Sidecar` resource, configure unnecessary label excludes
ArgoCD diff conflicts	ArgoCD detects drift because Argo Rollouts dynamically modifies VirtualService	Must add VirtualService `spec.http` path to `ignoreDifferences`
Subset hash dependency	`rollouts-pod-template-hash` in DestinationRule changes with each rollout	Managed automatically by Argo Rollouts — do not modify manually, delegate to the controller
DB migration coupling	Canary/stable pods share the same DB simultaneously when schema changes are involved	Separate schema changes and code deployment into distinct phases (expand/contract pattern)
mTLS scraping issues	Prometheus scraping may be blocked in PeerAuthentication STRICT mode	Mount Istio certificates to Prometheus, or scrape port 15090 (Envoy stats) directly

ignoreDifferences: An ArgoCD option that prevents specific field changes from being detected as drift. Because Argo Rollouts automatically modifies the weight values in VirtualService, this prevents ArgoCD from continuously attempting to sync after detecting it as an "unwanted change."

Most Common Mistakes in Practice

Writing PromQL without the reporter label: Querying without reporter="destination" results in duplicate aggregation of source and destination metrics, causing success rates to appear lower than they actually are. It's recommended to always include this condition in AnalysisTemplate queries.
Deploying a Rollout without creating the canaryService and stableService Service objects: If these two fields are declared in the Rollout manifest, Kubernetes Services with those exact names must actually exist. Without them, the Rollout controller throws errors and deployment won't proceed. You need to apply the Services introduced in Example 1 ahead of time.
Omitting the setHeaderRoute removal step: If you don't remove the header route at the final stage of the rollout, requests with the X-Canary: true header will continue to be routed only to that subset (which is already stable) going forward. You need to explicitly remove it by placing a setHeaderRoute without match at the end of the steps.

Closing Thoughts

Here are 3 steps you can start with right now:

Try the full flow locally (for those comfortable with Kubernetes basics and able to spin up a local cluster easily): Install Istio and Argo Rollouts on kind or minikube and apply the example manifests in order. You can see the controller automatically modifying the VirtualService in real time. Use kubectl get vs my-service-vs -o yaml -w to watch the changes live.
Connect an AnalysisTemplate to an existing service first (for those already running Istio and Prometheus): Before switching to a canary strategy, it helps to first verify the istio_requests_total query for your current service in Prometheus. Confirm how the destination_workload label appears, then connect the AnalysisTemplate in dryRun: true mode to validate the analysis logic without triggering actual rollbacks.
Add ArgoCD ignoreDifferences configuration in advance (for those running GitOps with ArgoCD): Before applying to production, add the VirtualService spec.http path to ignoreDifferences in your ArgoCD Application. This prevents the confusion of ArgoCD repeatedly syncing on your first deployment.

References

#Istio#ArgoRollouts#카나리배포#Kubernetes#Prometheus#VirtualService#PromQL#GitOps#서비스메시#AnalysisTemplate

Core Concepts

Why VirtualService Weights Instead of Replica Ratios

Overall Architecture Flow

How Pod-Level Metric Isolation Works

Practical Application

Example 1: Preparing stable/canary Services and DestinationRule

Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate

Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute

Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why VirtualService Weights Instead of Replica Ratios

Overall Architecture Flow

How Pod-Level Metric Isolation Works

Practical Application

Example 1: Preparing stable/canary Services and DestinationRule

Example 2: Automated Promotion/Rollback with Rollout + AnalysisTemplate

Example 3: Header-Based Canary Testing for QA Team Only with setHeaderRoute

Example 4: Querying Canary Pod P99 Latency and Upstream Error Rate

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

추천 포스트

Vercel CDN 비용 폭탄 없애기: Flat Rate CDN과 FinOps로 예측 가능한 인프라 비용 만들기 (2026)

Rancher Fleet으로 Kubernetes 멀티클러스터 운영하기 — 드리프트 없이 수십 개 클러스터를 Git 하나로 관리하는 패턴

Rancher Fleet과 Argo Rollouts를 조합해 500개 Kubernetes 클러스터에 카나리 배포하기 — Blast Radius를 파티션 단위로 제한하는 Progressive Delivery

Kubernetes Argo Rollouts AnalysisTemplate과 Datadog으로 구현하는 번 레이트 SLO 기반 카나리 자동 롤백

ArgoCD ApplicationSet rollingSync + Argo Rollouts로 멀티 클러스터 카나리 배포 구현하기

2026년 GitOps 도구 비교: ArgoCD 3.3 vs FluxCD 2.8 + MCP Server, 어떤 팀에 무엇이 맞을까