Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy
There was a time when I'd wait in Slack during every deployment and manually type rollback commands whenever error rates spiked. I thought introducing canary deployments would solve this problem, but operating them in practice brings new questions: "What threshold should I set?", "I waited 5 minutes — is the sample size too small?", "What happens if the deployment completes before the analysis finishes?" — As these concerns pile up, you end up not trusting the automation and keeping a human at the end of the line. Even with canary in place, if you're not confident the analysis is actually trustworthy, you'll still have Slack open during late-night deployments. I was in that same position.
Starting in 2025, Datadog began enforcing strict rate limits on the v1 API, which means leaving existing AnalysisTemplates as-is can silently block API calls during deployments. This was the right moment to properly integrate Datadog APM metrics and AWS CloudWatch infrastructure metrics into Argo Rollouts' AnalysisTemplate — so I've documented how to build a pipeline that automatically promotes or rolls back without human intervention, even in multi-cloud environments. Beyond the basic integration steps, I'll also walk through how to progressively harden thresholds over time and where things commonly get stuck.
After reading this, you'll be able to write a Datadog · CloudWatch multi-provider AnalysisTemplate yourself and apply a progressive threshold hardening strategy in production.
Core Concepts
This post assumes you already have basic Kubernetes operational experience and understand canary deployment concepts. It assumes Argo Rollouts is already installed in your cluster and uses an EKS environment as the baseline. For installation instructions, refer to the official documentation.
What Argo Rollouts and AnalysisTemplate Do
Argo Rollouts is a controller that natively supports canary, blue/green, and progressive deployments in Kubernetes. It supports both replacing a standard Deployment with a Rollout CRD (Custom Resource Definition) or using it alongside an existing Deployment via workloadRef.
So who decides whether "this version is okay" after routing 20% of traffic to the new version? That's exactly what AnalysisTemplate does.
An AnalysisTemplate is a CRD that defines metric queries and thresholds. You declare rules like "run this query every 5 minutes — if the error rate is below 1%, mark it as success; if it fails 3 times in a row, roll back."
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-gate
spec:
metrics:
- name: error-percentage
interval: 5m # measurement interval
count: 3 # total number of measurements
successCondition: result[0] <= 0.01 # success if error rate is below 1%
failureCondition: result[0] > 0.05 # immediate failure if above 5%
failureLimit: 1 # trigger rollback after just 1 failure
provider:
# ...metric provider configurationHere's a quick overview of what each key field does:
| Field | Role |
|---|---|
successCondition |
Records the measurement as 'success' when the condition is met |
failureCondition |
Immediately marks as failed when the condition is met (triggers rollback) |
failureLimit |
Number of allowed failures. Exceeding this triggers automatic rollback |
interval |
Measurement interval |
count |
Total number of measurements. Analysis passes when all succeed |
Why failureLimit × interval Must Be Shorter Than the Deployment Duration
This is the easiest trap to miss when designing an AnalysisTemplate for the first time. I fell into it myself.
If failureLimit is 3 and interval is 10 minutes, the analysis can take up to 30 minutes in the worst case. What if the entire canary deployment completes in 20 minutes? The deployment finishes before the analysis does — and a bad release slips through.
Key constraint: The total time of
failureLimit × intervalmust always be shorter than the full Rollout duration. If the canary stages take 40 minutes total, the analysis must complete within that window to be meaningful.
The recommendation is to measure your deployment duration first, then find a failureLimit × interval combination that fits within it.
The Value of Parallel Multi-Provider Analysis
Configuring the system so that both Datadog and CloudWatch must pass before promotion lets you independently monitor the APM layer (application error rate, latency) and the infrastructure layer (ALB 5xx, target response time). If either shows anomalies, the promotion is automatically blocked.
After our team switched to this structure, we actually encountered a case where the application layer looked fine but the infrastructure layer caught an anomalous signal. It was a deployment that would have sailed through with a single provider.
# Analysis block within Rollout spec
steps:
- setWeight: 20
- analysis:
templates:
- templateName: datadog-error-rate # APM layer
- templateName: cloudwatch-5xx-check # infrastructure layerPractical Application
Example 1: Configuring a Datadog APM Error Rate Gating AnalysisTemplate
There are two key things for Datadog integration: register credentials in a Secret, and explicitly specify apiVersion: v2 in the AnalysisTemplate.
Since Datadog deprecated the v1 API and tightened rate limits starting in 2025, omitting the apiVersion field or leaving it as v1 can silently block API calls during deployments. I missed this at first and experienced the analysis suddenly halting mid-deployment.
# Step 1: Register Datadog credentials Secret
apiVersion: v1
kind: Secret
metadata:
name: datadog-secret
namespace: argo-rollouts
stringData:
address: https://api.datadoghq.com
api-key: <DATADOG_API_KEY>
app-key: <DATADOG_APP_KEY># Step 2: Define AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: datadog-error-rate
spec:
metrics:
- name: error-percentage
interval: 5m # AnalysisRun level: how often to repeat measurements
count: 3
successCondition: result[0] <= 0.01 # success if error rate is below 1% (ratio, 0–1 range)
failureCondition: result[0] > 0.05 # immediate rollback if above 5%
failureLimit: 1
provider:
datadog:
apiVersion: v2 # v1 is deprecated — always specify v2
interval: 5m # Datadog query time window: aggregate the last 5 minutes of data
query: |
sum:trace.web.request.errors{env:prod,service:payment-api}.as_rate()
/
sum:trace.web.request.hits{env:prod,service:payment-api}.as_rate()You might be confused seeing interval appear in two places. They serve different roles:
interval location |
Role |
|---|---|
metrics level |
Determines how often the AnalysisRun repeats measurements |
provider.datadog level |
The time window telling the Datadog API "aggregate the last N minutes of data" |
It's typical to keep both values the same, but if query data is too sparse, you can set provider.datadog.interval longer to aggregate a wider range.
It's also worth noting that this query returns a ratio (0–1 range). The 0.01 in successCondition: result[0] <= 0.01 means 1%. Be careful not to mix up condition expressions with the CloudWatch example, which has a different return type.
| Query component | Role |
|---|---|
trace.web.request.errors |
Error request count recorded by APM |
trace.web.request.hits |
Total request count |
.as_rate() |
Converts raw counts into a per-second rate |
env:prod,service:payment-api |
Filters for a specific service and environment |
Example 2: Configuring a CloudWatch ALB 5xx Monitoring AnalysisTemplate
The most common blocker with CloudWatch integration is permissions. I initially thought cloudwatch:GetMetricData alone would suffice, but when API calls silently failed during an actual deployment, I had to go back and revisit the IAM permissions. The recommendation is to use IRSA (IAM Roles for Service Accounts — the mechanism in EKS for attaching an IAM role to a specific Kubernetes ServiceAccount) to grant only the minimum required permissions to the Argo Rollouts controller ServiceAccount.
// IAM policy — attach to the Argo Rollouts controller ServiceAccount
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["cloudwatch:GetMetricData"],
"Resource": "*"
}
]
}# CloudWatch AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: cloudwatch-5xx-check
spec:
metrics:
- name: alb-5xx-count
interval: 5m
count: 5
successCondition: result[0] <= 10 # 10 or fewer 5xx errors in 5 minutes (absolute count)
failureLimit: 2
provider:
cloudWatch:
metricDataQueries:
- id: m1
metricStat:
metric:
namespace: AWS/ApplicationELB
metricName: HTTPCode_Target_5XX_Count
dimensions:
- name: LoadBalancer
value: app/my-alb/abc123 # last segment of the actual ALB ARN
period: 300 # 5-minute (300-second) aggregation
stat: SumUnlike the Datadog example where result[0] <= 0.01 is a ratio (0–1), the CloudWatch example's result[0] <= 10 is an absolute count. Because CloudWatch's return type varies depending on the aggregation function (Sum, Average, etc.), it's good practice to verify what value your query returns before writing condition expressions.
The AWS_REGION environment variable must also be set on the Argo Rollouts controller Deployment. Without it, CloudWatch API calls fail silently — you'll see a message in the controller logs that it can't find the region. If you're deploying the controller via Helm chart or kustomize, add the following to the env block of that Deployment:
# Add to the env block of the Argo Rollouts controller Deployment
env:
- name: AWS_REGION
value: ap-northeast-2Example 3: Progressive Threshold Hardening Flow
Honestly, applying a 1% error rate threshold from the start leads to too many false positive (False Positive — a situation where analysis judges a perfectly fine deployment as failed, triggering an unnecessary rollback) rollbacks. Without baseline data, you can't even tell whether 1% is normal or abnormal. The recommended approach is to start loose and progressively tighten thresholds as service stability data accumulates.
| Deployment Phase | Canary Traffic | Error Rate Threshold | failureLimit | Notes |
|---|---|---|---|---|
| Phase 1 (initial rollout) | 5% | ≤ 5% | 3 | Baseline data collection period |
| Phase 2 (after 1 week) | 20% | ≤ 3% | 2 | Reflecting stability data |
| Phase 3 (stabilization) | 50% | ≤ 1% | 1 | Tighten after confidence is established |
| Full Rollout | 100% | ≤ 0.5% | 1 | Final threshold based on SLO (Service Level Objective) |
Each phase can be managed as a separate AnalysisTemplate, or you can use args to branch via parameters within a single template. However, whether args interpolation syntax is supported inside successCondition depends on the Argo Rollouts version, so it's worth verifying version compatibility before applying it.
# Parameterized AnalysisTemplate using args (verify version compatibility before use)
spec:
args:
- name: error-threshold
metrics:
- name: error-rate
successCondition: "result[0] <= {{args.error-threshold}}"
# ...# Inject threshold value when calling from Rollout
analysis:
templates:
- templateName: parameterized-error-rate
args:
- name: error-threshold
value: "0.03" # Phase 2: 3%Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Automated safety net | Immediately rolls back on SLI (Service Level Indicator) degradation — even during late-night deployments and long holidays |
| Data-driven decisions | Promotion/rollback decisions based on actual metrics, not intuition or gut feeling |
| Multi-provider support | Monitors APM + infrastructure layers simultaneously, minimizing blind spots |
| GitOps-friendly | AnalysisTemplate itself is version-controlled in Git and can be reviewed via PR |
| Progressive risk exposure | New version is exposed only to a fraction of traffic, not all of it |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Threshold configuration difficulty | Without a baseline, thresholds can cause false positives or let bad releases through | Collect at least 2 weeks of operational data before establishing a baseline |
| Analysis timeout blind spot | If failureLimit × interval exceeds deployment duration, bad releases can be promoted |
Calculate total deployment duration first, then configure parameters accordingly |
| Datadog API costs | Frequent API calls with short interval can exceed rate limits |
Keep interval at 5m or above, use API v2 |
| CloudWatch IAM permissions | Overly broad permissions are a security risk | Use IRSA to grant minimum permissions only to the controller ServiceAccount |
| Cold start problem | Deploying during low-traffic hours yields too few samples, reducing reliability | Adjust count and interval to match traffic peak patterns |
| Stateful app complexity | Rollback with DB schema changes can cause data consistency issues | Apply the Expand/Contract pattern to decouple migrations from deployments |
The Most Common Mistakes in Practice
-
Continuing to use the Datadog API v1 — Omitting the
apiVersionfield or leaving it as v1 will cause API calls to be blocked during analysis runs under Datadog's 2025 rate limit policy. It's recommended to make a habit of explicitly specifyingapiVersion: v2. -
Skipping the
failureLimit × intervalcalculation — When you're focused only on the query and threshold, it's easy to overlook this time constraint. It's worth simulating whether the analysis can complete before the deployment finishes. -
Validating thresholds during low-traffic periods — Running a test deployment at 3 AM means too few samples, making the analysis results unreliable. Threshold validation should be done during periods when traffic is at normal levels.
Closing Thoughts
If you came into this post unsure about what thresholds to set or how to use Datadog and CloudWatch together, you should now have everything you need: the ability to build a multi-provider AnalysisTemplate yourself and a strategy for progressively hardening thresholds starting from baseline data. The only thing left is to measure your current service's error rate baseline — just that one step.
Three steps you can take right now:
- Start by measuring the error rate baseline of your existing service. For Datadog, you can run
sum:trace.web.request.errors{service:your-service}.as_rate() / sum:trace.web.request.hits{service:your-service}.as_rate()against the last 2 weeks of data and check the P95 error rate. That value becomes the starting point for your threshold configuration. - Using the example code above, try deploying an AnalysisTemplate with a loose configuration like
failureLimit: 3,interval: 5m,successCondition: result[0] <= 0.05. Rather than starting strict, the priority is confirming that the analysis actually works correctly in a real deployment flow. - Once 1–2 weeks of operational data has accumulated, progressively tighten your thresholds following the Phase strategy described above. Creating a parameterized AnalysisTemplate using
argsmakes it much easier to adjust thresholds between phases.
References
- Argo Rollouts Analysis Official Documentation | argo-rollouts.readthedocs.io
- Argo Rollouts Datadog Integration Official Documentation | argoproj.github.io
- Argo Rollouts CloudWatch Integration Official Documentation | argoproj.github.io
- Datadog Official Argo Rollouts Integration Guide | docs.datadoghq.com
- Automated Deployments With Argo Rollouts + Datadog | DZone
- How to Use Argo Rollouts for Progressive Delivery with Analysis Templates | oneuptime.com
- Canary delivery with Argo Rollout and Amazon VPC Lattice for EKS | AWS Official Blog
- Implementing Production-Grade Progressive Delivery with Automated SLO-Based Rollbacks | Medium
- Argo Rollouts Best Practices | argo-rollouts.readthedocs.io
- Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
- Designing a Progressive Delivery Pipeline with GitHub Actions and Argo Rollouts | dstw.github.io
- Decoupling Canary Deployments From DBs With Argo Rollouts — ArgoConEurope 2026 | tldrecap.tech