Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook
When I first introduced canary deployments, staring intently at Grafana dashboards was part of the deployment process for quite some time. The workflow involved routing 10% of traffic to the canary, confirming the error rate graph was quiet, then manually firing the argo rollouts promote command. One night, I missed a P99 latency that was creeping upward. The graph was rising so gradually that I thought "it's just momentary noise" and hit promote — thirty minutes later, the Slack on-call channel exploded. That's when the thought struck me: "couldn't we just codify this?" — and that was the moment I began seriously digging into Argo Rollouts' AnalysisTemplate.
This post covers the complete pattern for combining three metric providers — Prometheus, Datadog, and Webhook — to configure multi-layer automated validation gates at each canary deployment step, and for triggering automatic rollbacks on analysis failure. Alongside real YAML examples, I'll honestly walk through where it's easy to make mistakes and the pitfalls I've stepped in myself. If you're running Kubernetes and have adopted canary deployments or are considering doing so, this is written so you can apply it immediately. This assumes you're already using Prometheus and Datadog, so PromQL queries and similar details are used without separate explanation. The examples also assume Istio, so note that metric structures will vary depending on your service mesh environment.
Core Concepts
What AnalysisTemplate Does
Argo Rollouts' AnalysisTemplate is a CRD that automatically evaluates metrics during canary or blue-green deployments to decide "whether to continue the deployment or roll it back." When an AnalysisTemplate is referenced, an AnalysisRun instance is created; that Run collects data from the specified providers, then evaluates success/failure conditions.
A simplified view of the structure looks like this:
Rollout → AnalysisRun ← AnalysisTemplate
↓
┌───────────┼───────────┐
Prometheus Datadog Webhook
└───────────┼───────────┘
↓
Success → Promote
Failure → Abort & RollbackPromotion Gate: A quality checkpoint that runs automatically at each deployment step. Only when the analysis succeeds does execution proceed to the next step; on failure, automatic rollback is triggered.
Three Analysis Types
There are three types depending on when the analysis runs. I found this a bit confusing at first too, but after using them in practice, each has a clear purpose.
| Type | Description | Primary Use |
|---|---|---|
| Background Analysis | Runs continuously in parallel with the rollout, does not block steps | Continuous monitoring of infrastructure metrics |
| Inline Analysis | Declared within a rollout step; the next step cannot be entered until analysis completes | Explicit per-step gates |
| Pre/Post Analysis | Runs immediately before or after deployment | Pre-validation, post-deployment health checks |
One thing worth noting upfront: there's also a type called ClusterAnalysisTemplate. It's a cluster-wide shared version rather than namespace-scoped. The pattern of separating team-wide common gates (error rates, latency thresholds) into ClusterAnalysisTemplate and service-specific gates into AnalysisTemplate makes operations much easier down the road. I'll revisit this distinction in the pros and cons section.
Practical Application
Example 1: Configuring an HTTP Success Rate Gate with Prometheus
This is the most fundamental yet effective pattern. Based on Istio metrics, if the HTTP 5xx error rate exceeds a threshold, it automatically rolls back. All AnalysisTemplate examples below assume deployment to the same namespace.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 5m # measure every 5 minutes
count: 10 # measure 10 times total (= up to 50 minutes of analysis)
# Prometheus returns a vector, so reference the first value with result[0]
successCondition: result[0] >= 0.95
failureLimit: 3 # analysis fails if 3 measurements fail
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(irate(
istio_requests_total{
reporter="source",
destination_service=~"{{args.service-name}}",
response_code!~"5.*"
}[5m]
)) /
sum(irate(
istio_requests_total{
reporter="source",
destination_service=~"{{args.service-name}}"
}[5m]
))Being able to use template parameters like {{args.service-name}} directly inside queries is quite convenient — it lets you reuse the same AnalysisTemplate across multiple services.
The Prometheus provider returns a vector as the PromQL execution result. In
successCondition: result[0] >= 0.95,result[0]refers to the first value of the returned time series. For queries where a single value is expected, it's always safe to useresult[0].
At this point, let me summarize the commonly appearing parameters.
| Parameter | Description |
|---|---|
interval |
Metric measurement interval (e.g., 5m) |
count |
Number of measurements (0 means repeat indefinitely until rollout ends) |
successCondition |
Success condition expression |
failureCondition |
Failure condition expression |
failureLimit |
Number of tolerable failures (exceeding this count marks the analysis as failed) |
inconclusiveLimit |
Tolerated count for inconclusive results (e.g., metric not returned) |
failureLimit and inconclusiveLimit are subtly different. failureLimit counts how many times the condition evaluation result is "failed," while inconclusiveLimit counts how many times the metric itself was not returned or could not be evaluated. To prevent a perfectly healthy deployment from rolling back due to momentary network issues or metric collection delays, it's good practice to configure inconclusiveLimit appropriately.
Example 2: Configuring a P99 Latency Gate with Datadog
Business metrics often accumulate in Datadog. Here's an example that blocks promotion when P99 response latency exceeds 200ms.
Honestly speaking, I stumbled in two ways with the Datadog provider at first. One was nil handling, and the other was using the wrong aggregation function for latency metrics. .as_count() is a function for converting rate-based metrics into counts — attaching it to a duration-like latency metric makes no sense. P99 latency requires a percentile aggregation method.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: datadog-latency-gate
spec:
args:
- name: service-name
metrics:
- name: p99-latency
interval: 5m
count: 5
# default() for nil handling is essential — see the callout below for why
successCondition: "default(result, 0) < 200"
failureLimit: 2
provider:
datadog:
apiVersion: v2
# Query P99 latency using percentile aggregation
# .as_count() is for rate conversion, so do not use it on duration metrics
query: |
p99:trace.web.request.duration.by.resource_service{
service:{{args.service-name}}
}Datadog credentials must always be separated into a Secret. The namespace must be set to argo-rollouts for the controller to recognize it.
apiVersion: v1
kind: Secret
metadata:
name: datadog
namespace: argo-rollouts
stringData:
address: https://api.datadoghq.com
api-key: <DD_API_KEY>
app-key: <DD_APP_KEY>Why the
default()function is necessary: Datadog returnsnilwhen there is no data for a given time window. Evaluatingnil < 200insuccessConditionraises an exception, causing the analysis to go into an Inconclusive state. Specifying a default value withdefault(result, 0)handles this situation safely.
Metric units can vary depending on your Datadog APM configuration. It is recommended to first check the actual value range in the Datadog UI before setting thresholds.
Example 3: Connecting an E2E Test Gate with Webhook
If Prometheus and Datadog handle infrastructure/APM-level signals, E2E tests validate real user scenarios. You can connect external test services (Testkube, k6, your own QA API, etc.) via Webhook.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: e2e-test-gate
spec:
args:
- name: canary-url
- name: api-token
metrics:
- name: e2e-smoke-test
interval: 10m
count: 1 # a single run is usually sufficient for E2E
successCondition: result.status == "passed"
failureCondition: result.status == "failed"
provider:
web:
method: POST
url: "https://qa-service.internal/run-tests"
timeoutSeconds: 300 # wait time for external service response; must be set
headers:
- key: Authorization
# {{args.api-token}} is Argo Rollouts parameter substitution syntax (double curly braces)
value: "Bearer {{args.api-token}}"
- key: Content-Type
value: application/json
body: |
{
"targetUrl": "{{args.canary-url}}",
"testSuite": "smoke"
}
# jsonPath uses JSONPath standard syntax (single curly braces) — field path to evaluate in the response JSON
jsonPath: "{$.result}"The differing number of curly braces between jsonPath: "{$.result}" and {{args.canary-url}} can be confusing at first glance. {$.result} is JSONPath standard syntax, while {{args.xxx}} is Argo Rollouts' parameter substitution syntax. It's worth keeping in mind that two different syntaxes are mixed within the same YAML.
Example 4: Integrating All Three Providers into a Rollout
Now let's wire together the three AnalysisTemplates we created independently into a single Rollout. This is a configuration that combines multi-layer gates with progressive traffic increases of 10% → 50% → 100%.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
namespace: default
spec:
strategy:
canary:
# Background Analysis: declared under spec.strategy.canary
# runs in parallel across all steps and does not block them
analysis:
templates:
- templateName: http-success-rate
args:
- name: service-name
value: my-service.default.svc.cluster.local
steps:
- setWeight: 10
- pause: {duration: 5m}
# Inline Analysis: declared inside steps
# all three gates must pass before proceeding to setWeight: 50
- analysis:
templates:
- templateName: http-success-rate # Prometheus: error rate check
- templateName: datadog-latency-gate # Datadog: latency check
- templateName: e2e-test-gate # Webhook: E2E test
args:
- name: service-name
value: my-service.default.svc.cluster.local
- name: canary-url
value: http://my-service-canary.internal
- name: api-token
valueFrom:
secretKeyRef:
name: qa-credentials
key: token
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
progressDeadlineSeconds: 600
progressDeadlineAbort: trueNotice that analysis blocks appear in two places here. spec.strategy.canary.analysis (outside of steps) is Background Analysis that monitors the entire rollout in parallel, while analysis inside steps is Inline Analysis that blocks that particular step. It's important to pay attention to the difference in indentation levels. I made the mistake early on of placing analysis: at the same level as steps:, which caused a parsing error.
Example 5: The Mechanism That Triggers Automatic Rollback
Understanding what happens internally when analysis fails makes debugging much easier.
spec:
strategy:
canary:
# The time the canary ReplicaSet waits before being deleted after Abort
# This serves as graceful termination to wait for in-flight requests to complete,
# so setting it to 0 may cut off requests currently being processed
abortScaleDownDelaySeconds: 30
steps:
- setWeight: 20
- analysis:
templates:
- templateName: http-success-rate
# failureLimit exceeded
# → AnalysisRun: Failed
# → Rollout: transitions to Abort state
# → canary weight: restored to 0
# → traffic restored to stable ReplicaSetIt's also worth learning the commands to monitor rollout status in real time:
# Real-time monitoring of overall rollout status and analysis results
kubectl argo rollouts get rollout my-service -n default --watch
# List AnalysisRuns
kubectl get analysisrun -n default
# Detailed view of a specific AnalysisRun (including failure cause)
kubectl describe analysisrun <analysisrun-name> -n defaultPros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Automated promotion/rollback | Deployment progression is decided based on metrics without human intervention |
| High reusability | ClusterAnalysisTemplate lets the entire team share common gate templates |
| Multi-provider | Prometheus, Datadog, New Relic, Webhook, Kubernetes Job, and other sources can be combined in a single Rollout |
| Progressive risk reduction | Traffic is increased incrementally, enabling early detection of anomalies |
| GitOps-friendly | AnalysisTemplates can be version-controlled in Git and included in the code review process |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Cold start problem | Metrics may be statistically insufficient during early stages with very low canary traffic, leading to misjudgment | Allow sufficient initial pause time or start with loose thresholds |
| Datadog nil returns | Analysis errors occur when nil is returned for time windows with no metric data | Mandatory application of the default(result, 0) pattern in successCondition |
| progressDeadlineAbort bug | Known issue where Rollout displays as Degraded even when not in the middle of a deployment (GitHub Issue #1624) | Set monitoring alerts based on AnalysisRun state, not Rollout state |
| Threshold configuration difficulty | Too strict causes frequent false-positive rollbacks; too loose makes the gate meaningless | Start with loose thresholds and progressively tighten them based on real data |
| Webhook timeouts | External service response delays can cause the analysis itself to fail | Set timeoutSeconds generously, longer than the maximum E2E test execution time |
| Operational complexity | Identifying failure causes becomes difficult with multi-provider combinations | Use kubectl describe analysisrun to check results per provider individually |
ClusterAnalysisTemplate: An AnalysisTemplate shared across the entire cluster rather than scoped to a namespace. The pattern of separating team-wide common gates into ClusterAnalysisTemplate and service-specific gates into AnalysisTemplate makes operations easier.
⚠️ Top 3 Most Common Mistakes in Practice
1. Applying strict thresholds from the very first deployment
Starting with settings like successCondition: result[0] >= 0.999 will cause continuous rollbacks even from statistical noise. Our team made this mistake initially, and three consecutive deployments that were perfectly fine got rolled back — ultimately leaving every team member distrusting the AnalysisTemplate itself. It's recommended to first run manual validation with autoPromotionEnabled: false, then set thresholds after observing the actual metric distribution.
2. Not accounting for total analysis time from interval and count
With interval: 5m and count: 10, analysis can take up to 50 minutes. If you attach an E2E test as Inline Analysis with count: 3 and interval: 10m, a single deployment can take 30 minutes. It's worth designing with both deployment frequency and SLO in mind.
3. Mismatch between Webhook response structure and jsonPath
If you set successCondition: result.status == "passed" but incorrectly specify jsonPath: "{$.data.result}", it will always result in Inconclusive. When setting up a Webhook provider for the first time, it's helpful to first confirm how the actual result value is evaluated using kubectl describe analysisrun. This is also a problem I only discovered after Slack alerts flooded in mid-deployment.
Closing Thoughts
As I mentioned in the introduction, a "deployment that only feels safe when someone is watching it" ultimately burns people out. AnalysisTemplate lets you express error rates, latency, and feature validation as code so that deployments can verify their own safety. After adopting this setup, on-call pages related to deployments dropped by about 70% on our team, and there were two deployments that quietly auto-rolled back at the canary stage — both of which, we confirmed later, were issues that humans would have missed. The configuration looks complex, but once set up, night deployments and Friday afternoon deployments feel significantly less stressful.
Three steps you can start with right now:
-
You can install Argo Rollouts via Helm and try converting an existing Deployment to a Rollout. Install with
helm install argo-rollouts argo/argo-rollouts --namespace argo-rollouts --create-namespace, then change your existing Deployment YAML tokind: Rolloutand start withautoPromotionEnabled: falseto get comfortable with the manual promote flow first. -
You can attach a single Prometheus-integrated AnalysisTemplate with loose thresholds. If your service's current average HTTP success rate is 99.5%, start with a generous value like
successCondition: result[0] >= 0.90and observe how the actual analysis results come out withkubectl argo rollouts get rollout <name> --watch. -
Once you've confirmed stable operation, you can add Datadog or Webhook gates and progressively tighten the thresholds. Connecting multi-provider combinations after each has been individually validated is much more manageable from a debugging perspective.
References
- Analysis & Progressive Delivery | Argo Rollouts Official Docs
- Prometheus Provider | Argo Rollouts Official Docs
- Datadog Provider | Argo Rollouts Official Docs
- Web (HTTP) Provider | Argo Rollouts Official Docs
- Argo Rollouts Integration | Datadog Official Docs
- Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
- Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates | Last9
- Automated Deployments With Argo Rollouts + Datadog | DZone
- Zero-Touch Safety: Automated Canary Rollbacks with Argo Rollouts & Istio | Medium
- Automating Blue-Green & Canary Deployments with Argo Rollouts | Akuity Blog
- argoproj/argo-rollouts | GitHub
- rollouts-plugin-metric-sample-prometheus | GitHub