Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook

When I first introduced canary deployments, staring intently at Grafana dashboards was part of the deployment process for quite some time. The workflow involved routing 10% of traffic to the canary, confirming the error rate graph was quiet, then manually firing the argo rollouts promote command. One night, I missed a P99 latency that was creeping upward. The graph was rising so gradually that I thought "it's just momentary noise" and hit promote — thirty minutes later, the Slack on-call channel exploded. That's when the thought struck me: "couldn't we just codify this?" — and that was the moment I began seriously digging into Argo Rollouts' AnalysisTemplate.

This post covers the complete pattern for combining three metric providers — Prometheus, Datadog, and Webhook — to configure multi-layer automated validation gates at each canary deployment step, and for triggering automatic rollbacks on analysis failure. Alongside real YAML examples, I'll honestly walk through where it's easy to make mistakes and the pitfalls I've stepped in myself. If you're running Kubernetes and have adopted canary deployments or are considering doing so, this is written so you can apply it immediately. This assumes you're already using Prometheus and Datadog, so PromQL queries and similar details are used without separate explanation. The examples also assume Istio, so note that metric structures will vary depending on your service mesh environment.

Core Concepts

What AnalysisTemplate Does

Argo Rollouts' AnalysisTemplate is a CRD that automatically evaluates metrics during canary or blue-green deployments to decide "whether to continue the deployment or roll it back." When an AnalysisTemplate is referenced, an AnalysisRun instance is created; that Run collects data from the specified providers, then evaluates success/failure conditions.

A simplified view of the structure looks like this:

Rollout → AnalysisRun ← AnalysisTemplate
                ↓
    ┌───────────┼───────────┐
 Prometheus  Datadog    Webhook
    └───────────┼───────────┘
                ↓
      Success → Promote
      Failure → Abort & Rollback

Promotion Gate: A quality checkpoint that runs automatically at each deployment step. Only when the analysis succeeds does execution proceed to the next step; on failure, automatic rollback is triggered.

Three Analysis Types

There are three types depending on when the analysis runs. I found this a bit confusing at first too, but after using them in practice, each has a clear purpose.

Type	Description	Primary Use
Background Analysis	Runs continuously in parallel with the rollout, does not block steps	Continuous monitoring of infrastructure metrics
Inline Analysis	Declared within a rollout step; the next step cannot be entered until analysis completes	Explicit per-step gates
Pre/Post Analysis	Runs immediately before or after deployment	Pre-validation, post-deployment health checks

One thing worth noting upfront: there's also a type called ClusterAnalysisTemplate. It's a cluster-wide shared version rather than namespace-scoped. The pattern of separating team-wide common gates (error rates, latency thresholds) into ClusterAnalysisTemplate and service-specific gates into AnalysisTemplate makes operations much easier down the road. I'll revisit this distinction in the pros and cons section.

Practical Application

Example 1: Configuring an HTTP Success Rate Gate with Prometheus

This is the most fundamental yet effective pattern. Based on Istio metrics, if the HTTP 5xx error rate exceeds a threshold, it automatically rolls back. All AnalysisTemplate examples below assume deployment to the same namespace.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m        # measure every 5 minutes
    count: 10           # measure 10 times total (= up to 50 minutes of analysis)
    # Prometheus returns a vector, so reference the first value with result[0]
    successCondition: result[0] >= 0.95
    failureLimit: 3     # analysis fails if 3 measurements fail
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          sum(irate(
            istio_requests_total{
              reporter="source",
              destination_service=~"{{args.service-name}}",
              response_code!~"5.*"
            }[5m]
          )) /
          sum(irate(
            istio_requests_total{
              reporter="source",
              destination_service=~"{{args.service-name}}"
            }[5m]
          ))

Being able to use template parameters like {{args.service-name}} directly inside queries is quite convenient — it lets you reuse the same AnalysisTemplate across multiple services.

The Prometheus provider returns a vector as the PromQL execution result. In successCondition: result[0] >= 0.95, result[0] refers to the first value of the returned time series. For queries where a single value is expected, it's always safe to use result[0].

At this point, let me summarize the commonly appearing parameters.

Parameter	Description
`interval`	Metric measurement interval (e.g., `5m`)
`count`	Number of measurements (0 means repeat indefinitely until rollout ends)
`successCondition`	Success condition expression
`failureCondition`	Failure condition expression
`failureLimit`	Number of tolerable failures (exceeding this count marks the analysis as failed)
`inconclusiveLimit`	Tolerated count for inconclusive results (e.g., metric not returned)

failureLimit and inconclusiveLimit are subtly different. failureLimit counts how many times the condition evaluation result is "failed," while inconclusiveLimit counts how many times the metric itself was not returned or could not be evaluated. To prevent a perfectly healthy deployment from rolling back due to momentary network issues or metric collection delays, it's good practice to configure inconclusiveLimit appropriately.

Example 2: Configuring a P99 Latency Gate with Datadog

Business metrics often accumulate in Datadog. Here's an example that blocks promotion when P99 response latency exceeds 200ms.

Honestly speaking, I stumbled in two ways with the Datadog provider at first. One was nil handling, and the other was using the wrong aggregation function for latency metrics. .as_count() is a function for converting rate-based metrics into counts — attaching it to a duration-like latency metric makes no sense. P99 latency requires a percentile aggregation method.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: datadog-latency-gate
spec:
  args:
  - name: service-name
  metrics:
  - name: p99-latency
    interval: 5m
    count: 5
    # default() for nil handling is essential — see the callout below for why
    successCondition: "default(result, 0) < 200"
    failureLimit: 2
    provider:
      datadog:
        apiVersion: v2
        # Query P99 latency using percentile aggregation
        # .as_count() is for rate conversion, so do not use it on duration metrics
        query: |
          p99:trace.web.request.duration.by.resource_service{
            service:{{args.service-name}}
          }

Datadog credentials must always be separated into a Secret. The namespace must be set to argo-rollouts for the controller to recognize it.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog
  namespace: argo-rollouts
stringData:
  address: https://api.datadoghq.com
  api-key: <DD_API_KEY>
  app-key: <DD_APP_KEY>

Why the default() function is necessary: Datadog returns nil when there is no data for a given time window. Evaluating nil < 200 in successCondition raises an exception, causing the analysis to go into an Inconclusive state. Specifying a default value with default(result, 0) handles this situation safely.

Metric units can vary depending on your Datadog APM configuration. It is recommended to first check the actual value range in the Datadog UI before setting thresholds.

Example 3: Connecting an E2E Test Gate with Webhook

If Prometheus and Datadog handle infrastructure/APM-level signals, E2E tests validate real user scenarios. You can connect external test services (Testkube, k6, your own QA API, etc.) via Webhook.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: e2e-test-gate
spec:
  args:
  - name: canary-url
  - name: api-token
  metrics:
  - name: e2e-smoke-test
    interval: 10m
    count: 1            # a single run is usually sufficient for E2E
    successCondition: result.status == "passed"
    failureCondition: result.status == "failed"
    provider:
      web:
        method: POST
        url: "https://qa-service.internal/run-tests"
        timeoutSeconds: 300   # wait time for external service response; must be set
        headers:
          - key: Authorization
            # {{args.api-token}} is Argo Rollouts parameter substitution syntax (double curly braces)
            value: "Bearer {{args.api-token}}"
          - key: Content-Type
            value: application/json
        body: |
          {
            "targetUrl": "{{args.canary-url}}",
            "testSuite": "smoke"
          }
        # jsonPath uses JSONPath standard syntax (single curly braces) — field path to evaluate in the response JSON
        jsonPath: "{$.result}"

The differing number of curly braces between jsonPath: "{$.result}" and {{args.canary-url}} can be confusing at first glance. {$.result} is JSONPath standard syntax, while {{args.xxx}} is Argo Rollouts' parameter substitution syntax. It's worth keeping in mind that two different syntaxes are mixed within the same YAML.

Example 4: Integrating All Three Providers into a Rollout

Now let's wire together the three AnalysisTemplates we created independently into a single Rollout. This is a configuration that combines multi-layer gates with progressive traffic increases of 10% → 50% → 100%.

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
  namespace: default
spec:
  strategy:
    canary:
      # Background Analysis: declared under spec.strategy.canary
      # runs in parallel across all steps and does not block them
      analysis:
        templates:
        - templateName: http-success-rate
        args:
        - name: service-name
          value: my-service.default.svc.cluster.local
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      # Inline Analysis: declared inside steps
      # all three gates must pass before proceeding to setWeight: 50
      - analysis:
          templates:
          - templateName: http-success-rate    # Prometheus: error rate check
          - templateName: datadog-latency-gate # Datadog: latency check
          - templateName: e2e-test-gate        # Webhook: E2E test
          args:
          - name: service-name
            value: my-service.default.svc.cluster.local
          - name: canary-url
            value: http://my-service-canary.internal
          - name: api-token
            valueFrom:
              secretKeyRef:
                name: qa-credentials
                key: token
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
  progressDeadlineSeconds: 600
  progressDeadlineAbort: true

Notice that analysis blocks appear in two places here. spec.strategy.canary.analysis (outside of steps) is Background Analysis that monitors the entire rollout in parallel, while analysis inside steps is Inline Analysis that blocks that particular step. It's important to pay attention to the difference in indentation levels. I made the mistake early on of placing analysis: at the same level as steps:, which caused a parsing error.

Example 5: The Mechanism That Triggers Automatic Rollback

Understanding what happens internally when analysis fails makes debugging much easier.

yaml

spec:
  strategy:
    canary:
      # The time the canary ReplicaSet waits before being deleted after Abort
      # This serves as graceful termination to wait for in-flight requests to complete,
      # so setting it to 0 may cut off requests currently being processed
      abortScaleDownDelaySeconds: 30
      steps:
      - setWeight: 20
      - analysis:
          templates:
          - templateName: http-success-rate
          # failureLimit exceeded
          #   → AnalysisRun: Failed
          #   → Rollout: transitions to Abort state
          #   → canary weight: restored to 0
          #   → traffic restored to stable ReplicaSet

It's also worth learning the commands to monitor rollout status in real time:

bash

# Real-time monitoring of overall rollout status and analysis results
kubectl argo rollouts get rollout my-service -n default --watch
 
# List AnalysisRuns
kubectl get analysisrun -n default
 
# Detailed view of a specific AnalysisRun (including failure cause)
kubectl describe analysisrun <analysisrun-name> -n default

Pros and Cons Analysis

Advantages

Item	Details
Automated promotion/rollback	Deployment progression is decided based on metrics without human intervention
High reusability	`ClusterAnalysisTemplate` lets the entire team share common gate templates
Multi-provider	Prometheus, Datadog, New Relic, Webhook, Kubernetes Job, and other sources can be combined in a single Rollout
Progressive risk reduction	Traffic is increased incrementally, enabling early detection of anomalies
GitOps-friendly	AnalysisTemplates can be version-controlled in Git and included in the code review process

Disadvantages and Caveats

Item	Details	Mitigation
Cold start problem	Metrics may be statistically insufficient during early stages with very low canary traffic, leading to misjudgment	Allow sufficient initial `pause` time or start with loose thresholds
Datadog nil returns	Analysis errors occur when nil is returned for time windows with no metric data	Mandatory application of the `default(result, 0)` pattern in `successCondition`
progressDeadlineAbort bug	Known issue where Rollout displays as Degraded even when not in the middle of a deployment (GitHub Issue #1624)	Set monitoring alerts based on AnalysisRun state, not Rollout state
Threshold configuration difficulty	Too strict causes frequent false-positive rollbacks; too loose makes the gate meaningless	Start with loose thresholds and progressively tighten them based on real data
Webhook timeouts	External service response delays can cause the analysis itself to fail	Set `timeoutSeconds` generously, longer than the maximum E2E test execution time
Operational complexity	Identifying failure causes becomes difficult with multi-provider combinations	Use `kubectl describe analysisrun` to check results per provider individually

ClusterAnalysisTemplate: An AnalysisTemplate shared across the entire cluster rather than scoped to a namespace. The pattern of separating team-wide common gates into ClusterAnalysisTemplate and service-specific gates into AnalysisTemplate makes operations easier.

⚠️ Top 3 Most Common Mistakes in Practice

1. Applying strict thresholds from the very first deployment

Starting with settings like successCondition: result[0] >= 0.999 will cause continuous rollbacks even from statistical noise. Our team made this mistake initially, and three consecutive deployments that were perfectly fine got rolled back — ultimately leaving every team member distrusting the AnalysisTemplate itself. It's recommended to first run manual validation with autoPromotionEnabled: false, then set thresholds after observing the actual metric distribution.

2. Not accounting for total analysis time from interval and count

With interval: 5m and count: 10, analysis can take up to 50 minutes. If you attach an E2E test as Inline Analysis with count: 3 and interval: 10m, a single deployment can take 30 minutes. It's worth designing with both deployment frequency and SLO in mind.

3. Mismatch between Webhook response structure and jsonPath

If you set successCondition: result.status == "passed" but incorrectly specify jsonPath: "{$.data.result}", it will always result in Inconclusive. When setting up a Webhook provider for the first time, it's helpful to first confirm how the actual result value is evaluated using kubectl describe analysisrun. This is also a problem I only discovered after Slack alerts flooded in mid-deployment.

Closing Thoughts

As I mentioned in the introduction, a "deployment that only feels safe when someone is watching it" ultimately burns people out. AnalysisTemplate lets you express error rates, latency, and feature validation as code so that deployments can verify their own safety. After adopting this setup, on-call pages related to deployments dropped by about 70% on our team, and there were two deployments that quietly auto-rolled back at the canary stage — both of which, we confirmed later, were issues that humans would have missed. The configuration looks complex, but once set up, night deployments and Friday afternoon deployments feel significantly less stressful.

Three steps you can start with right now:

You can install Argo Rollouts via Helm and try converting an existing Deployment to a Rollout. Install with helm install argo-rollouts argo/argo-rollouts --namespace argo-rollouts --create-namespace, then change your existing Deployment YAML to kind: Rollout and start with autoPromotionEnabled: false to get comfortable with the manual promote flow first.
You can attach a single Prometheus-integrated AnalysisTemplate with loose thresholds. If your service's current average HTTP success rate is 99.5%, start with a generous value like successCondition: result[0] >= 0.90 and observe how the actual analysis results come out with kubectl argo rollouts get rollout <name> --watch.
Once you've confirmed stable operation, you can add Datadog or Webhook gates and progressively tighten the thresholds. Connecting multi-provider combinations after each has been individually validated is much more manageable from a debugging perspective.

References

#ArgoRollouts#카나리배포#Kubernetes#Prometheus#Datadog#AnalysisTemplate#GitOps#Istio#PromQL#자동롤백

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook | DEV BAK - 기술블로그

DevOps

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook

Core Concepts

What AnalysisTemplate Does

A simplified view of the structure looks like this:

Rollout → AnalysisRun ← AnalysisTemplate
                ↓
    ┌───────────┼───────────┐
 Prometheus  Datadog    Webhook
    └───────────┼───────────┘
                ↓
      Success → Promote
      Failure → Abort & Rollback

Promotion Gate: A quality checkpoint that runs automatically at each deployment step. Only when the analysis succeeds does execution proceed to the next step; on failure, automatic rollback is triggered.

Three Analysis Types

There are three types depending on when the analysis runs. I found this a bit confusing at first too, but after using them in practice, each has a clear purpose.

Type	Description	Primary Use
Background Analysis	Runs continuously in parallel with the rollout, does not block steps	Continuous monitoring of infrastructure metrics
Inline Analysis	Declared within a rollout step; the next step cannot be entered until analysis completes	Explicit per-step gates
Pre/Post Analysis	Runs immediately before or after deployment	Pre-validation, post-deployment health checks

Practical Application

Example 1: Configuring an HTTP Success Rate Gate with Prometheus

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m        # measure every 5 minutes
    count: 10           # measure 10 times total (= up to 50 minutes of analysis)
    # Prometheus returns a vector, so reference the first value with result[0]
    successCondition: result[0] >= 0.95
    failureLimit: 3     # analysis fails if 3 measurements fail
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          sum(irate(
            istio_requests_total{
              reporter="source",
              destination_service=~"{{args.service-name}}",
              response_code!~"5.*"
            }[5m]
          )) /
          sum(irate(
            istio_requests_total{
              reporter="source",
              destination_service=~"{{args.service-name}}"
            }[5m]
          ))

Being able to use template parameters like {{args.service-name}} directly inside queries is quite convenient — it lets you reuse the same AnalysisTemplate across multiple services.

The Prometheus provider returns a vector as the PromQL execution result. In successCondition: result[0] >= 0.95, result[0] refers to the first value of the returned time series. For queries where a single value is expected, it's always safe to use result[0].

At this point, let me summarize the commonly appearing parameters.

Parameter	Description
`interval`	Metric measurement interval (e.g., `5m`)
`count`	Number of measurements (0 means repeat indefinitely until rollout ends)
`successCondition`	Success condition expression
`failureCondition`	Failure condition expression
`failureLimit`	Number of tolerable failures (exceeding this count marks the analysis as failed)
`inconclusiveLimit`	Tolerated count for inconclusive results (e.g., metric not returned)

Example 2: Configuring a P99 Latency Gate with Datadog

Business metrics often accumulate in Datadog. Here's an example that blocks promotion when P99 response latency exceeds 200ms.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: datadog-latency-gate
spec:
  args:
  - name: service-name
  metrics:
  - name: p99-latency
    interval: 5m
    count: 5
    # default() for nil handling is essential — see the callout below for why
    successCondition: "default(result, 0) < 200"
    failureLimit: 2
    provider:
      datadog:
        apiVersion: v2
        # Query P99 latency using percentile aggregation
        # .as_count() is for rate conversion, so do not use it on duration metrics
        query: |
          p99:trace.web.request.duration.by.resource_service{
            service:{{args.service-name}}
          }

Datadog credentials must always be separated into a Secret. The namespace must be set to argo-rollouts for the controller to recognize it.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog
  namespace: argo-rollouts
stringData:
  address: https://api.datadoghq.com
  api-key: <DD_API_KEY>
  app-key: <DD_APP_KEY>

Why the default() function is necessary: Datadog returns nil when there is no data for a given time window. Evaluating nil < 200 in successCondition raises an exception, causing the analysis to go into an Inconclusive state. Specifying a default value with default(result, 0) handles this situation safely.

Metric units can vary depending on your Datadog APM configuration. It is recommended to first check the actual value range in the Datadog UI before setting thresholds.

Example 3: Connecting an E2E Test Gate with Webhook

If Prometheus and Datadog handle infrastructure/APM-level signals, E2E tests validate real user scenarios. You can connect external test services (Testkube, k6, your own QA API, etc.) via Webhook.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: e2e-test-gate
spec:
  args:
  - name: canary-url
  - name: api-token
  metrics:
  - name: e2e-smoke-test
    interval: 10m
    count: 1            # a single run is usually sufficient for E2E
    successCondition: result.status == "passed"
    failureCondition: result.status == "failed"
    provider:
      web:
        method: POST
        url: "https://qa-service.internal/run-tests"
        timeoutSeconds: 300   # wait time for external service response; must be set
        headers:
          - key: Authorization
            # {{args.api-token}} is Argo Rollouts parameter substitution syntax (double curly braces)
            value: "Bearer {{args.api-token}}"
          - key: Content-Type
            value: application/json
        body: |
          {
            "targetUrl": "{{args.canary-url}}",
            "testSuite": "smoke"
          }
        # jsonPath uses JSONPath standard syntax (single curly braces) — field path to evaluate in the response JSON
        jsonPath: "{$.result}"

Example 4: Integrating All Three Providers into a Rollout

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
  namespace: default
spec:
  strategy:
    canary:
      # Background Analysis: declared under spec.strategy.canary
      # runs in parallel across all steps and does not block them
      analysis:
        templates:
        - templateName: http-success-rate
        args:
        - name: service-name
          value: my-service.default.svc.cluster.local
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      # Inline Analysis: declared inside steps
      # all three gates must pass before proceeding to setWeight: 50
      - analysis:
          templates:
          - templateName: http-success-rate    # Prometheus: error rate check
          - templateName: datadog-latency-gate # Datadog: latency check
          - templateName: e2e-test-gate        # Webhook: E2E test
          args:
          - name: service-name
            value: my-service.default.svc.cluster.local
          - name: canary-url
            value: http://my-service-canary.internal
          - name: api-token
            valueFrom:
              secretKeyRef:
                name: qa-credentials
                key: token
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
  progressDeadlineSeconds: 600
  progressDeadlineAbort: true

Example 5: The Mechanism That Triggers Automatic Rollback

Understanding what happens internally when analysis fails makes debugging much easier.

yaml

spec:
  strategy:
    canary:
      # The time the canary ReplicaSet waits before being deleted after Abort
      # This serves as graceful termination to wait for in-flight requests to complete,
      # so setting it to 0 may cut off requests currently being processed
      abortScaleDownDelaySeconds: 30
      steps:
      - setWeight: 20
      - analysis:
          templates:
          - templateName: http-success-rate
          # failureLimit exceeded
          #   → AnalysisRun: Failed
          #   → Rollout: transitions to Abort state
          #   → canary weight: restored to 0
          #   → traffic restored to stable ReplicaSet

It's also worth learning the commands to monitor rollout status in real time:

bash

# Real-time monitoring of overall rollout status and analysis results
kubectl argo rollouts get rollout my-service -n default --watch
 
# List AnalysisRuns
kubectl get analysisrun -n default
 
# Detailed view of a specific AnalysisRun (including failure cause)
kubectl describe analysisrun <analysisrun-name> -n default

Pros and Cons Analysis

Advantages

Item	Details
Automated promotion/rollback	Deployment progression is decided based on metrics without human intervention
High reusability	`ClusterAnalysisTemplate` lets the entire team share common gate templates
Multi-provider	Prometheus, Datadog, New Relic, Webhook, Kubernetes Job, and other sources can be combined in a single Rollout
Progressive risk reduction	Traffic is increased incrementally, enabling early detection of anomalies
GitOps-friendly	AnalysisTemplates can be version-controlled in Git and included in the code review process

Disadvantages and Caveats

Item	Details	Mitigation
Cold start problem	Metrics may be statistically insufficient during early stages with very low canary traffic, leading to misjudgment	Allow sufficient initial `pause` time or start with loose thresholds
Datadog nil returns	Analysis errors occur when nil is returned for time windows with no metric data	Mandatory application of the `default(result, 0)` pattern in `successCondition`
progressDeadlineAbort bug	Known issue where Rollout displays as Degraded even when not in the middle of a deployment (GitHub Issue #1624)	Set monitoring alerts based on AnalysisRun state, not Rollout state
Threshold configuration difficulty	Too strict causes frequent false-positive rollbacks; too loose makes the gate meaningless	Start with loose thresholds and progressively tighten them based on real data
Webhook timeouts	External service response delays can cause the analysis itself to fail	Set `timeoutSeconds` generously, longer than the maximum E2E test execution time
Operational complexity	Identifying failure causes becomes difficult with multi-provider combinations	Use `kubectl describe analysisrun` to check results per provider individually

ClusterAnalysisTemplate: An AnalysisTemplate shared across the entire cluster rather than scoped to a namespace. The pattern of separating team-wide common gates into ClusterAnalysisTemplate and service-specific gates into AnalysisTemplate makes operations easier.

⚠️ Top 3 Most Common Mistakes in Practice

1. Applying strict thresholds from the very first deployment

2. Not accounting for total analysis time from interval and count

3. Mismatch between Webhook response structure and jsonPath

Closing Thoughts

Three steps you can start with right now:

You can install Argo Rollouts via Helm and try converting an existing Deployment to a Rollout. Install with helm install argo-rollouts argo/argo-rollouts --namespace argo-rollouts --create-namespace, then change your existing Deployment YAML to kind: Rollout and start with autoPromotionEnabled: false to get comfortable with the manual promote flow first.
You can attach a single Prometheus-integrated AnalysisTemplate with loose thresholds. If your service's current average HTTP success rate is 99.5%, start with a generous value like successCondition: result[0] >= 0.90 and observe how the actual analysis results come out with kubectl argo rollouts get rollout <name> --watch.
Once you've confirmed stable operation, you can add Datadog or Webhook gates and progressively tighten the thresholds. Connecting multi-provider combinations after each has been individually validated is much more manageable from a debugging perspective.

References

#ArgoRollouts#카나리배포#Kubernetes#Prometheus#Datadog#AnalysisTemplate#GitOps#Istio#PromQL#자동롤백

Core Concepts

What AnalysisTemplate Does

Three Analysis Types

Practical Application

Example 1: Configuring an HTTP Success Rate Gate with Prometheus

Example 2: Configuring a P99 Latency Gate with Datadog

Example 3: Connecting an E2E Test Gate with Webhook

Example 4: Integrating All Three Providers into a Rollout

Example 5: The Mechanism That Triggers Automatic Rollback

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

⚠️ Top 3 Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What AnalysisTemplate Does

Three Analysis Types

Practical Application

Example 1: Configuring an HTTP Success Rate Gate with Prometheus

Example 2: Configuring a P99 Latency Gate with Datadog

Example 3: Connecting an E2E Test Gate with Webhook

Example 4: Integrating All Three Providers into a Rollout

Example 5: The Mechanism That Triggers Automatic Rollback

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

⚠️ Top 3 Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty

How to Detect Argo Rollouts Rollbacks with Argo Events and Automatically Create Jira Incidents and Confluence Postmortems

Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition