Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty
That morning, right after arriving at work, a colleague was typing 'rollback' into the Slack search bar. A canary deployment had silently failed overnight — the AnalysisRun had exceeded the error rate threshold and triggered an automatic rollback, but nobody knew. Instead of a PagerDuty alert, the on-call engineer started piecing together what had happened by reviewing the morning's accumulated error logs. I'd had a similar experience myself, and after that I came to truly understand how critical it is to properly connect notifications to the deployment pipeline.
This article covers, based on Argo Rollouts v1.8, how to build a notification pipeline from scratch that delivers both AnalysisRun failures and rollback events to Slack and PagerDuty simultaneously. Rather than stopping at copy-paste YAML, I'll walk through why each component is designed the way it is — and honestly, where things tend to get stuck.
Argo Rollouts has a built-in notification system. Without a separate event bus or sidecar — meaning you don't need to run AlertManager or an external event processing system — just one ConfigMap, one Secret, and a few annotation lines on the Rollout resource are all it takes to connect AnalysisRun failures all the way through to PagerDuty incident creation. You can minimize operational overhead while keeping the entire team informed of deployment status in real time.
Core Concepts
What's Helpful to Know Before Reading This
This article is aimed at readers who have experience operating Kubernetes and have a general idea of what Argo Rollouts is. If you're familiar with the three concepts below, you can follow along right away.
- Canary Deployment: A strategy where a new version is deployed to only a portion of traffic (e.g., 10%) first, and gradually rolled out further if no issues are found. This is difficult to achieve with a standard Kubernetes Deployment, and Argo Rollouts manages it declaratively.
- Argo Rollouts: Provides a
Rolloutresource that replaces the Kubernetes Deployment, handling canary and blue-green deployment strategies along with automated analysis at the controller level. - ConfigMap / Secret: Kubernetes's way of storing configuration. Non-sensitive values go in ConfigMap; credentials like Slack tokens go in Secret.
Vault and External Secrets are tools for safely managing Secrets via GitOps — that's outside the scope of this article. Refer to the External Secrets Operator or HashiCorp Vault documentation for more.
The Complete Notification Pipeline Flow
Argo Rollouts Notifications is based on argoproj/notifications-engine. Argo CD notifications use the same engine. Teams running both tools together will find the configuration pattern easy to pick up.
Here's the path a notification takes when an event occurs, step by step:
AnalysisRun Failure (error rate threshold exceeded)
→ Rollout automatically Aborts (canary rollback begins)
→ Notifications Controller detects Rollout status change
→ Evaluates trigger conditions defined in ConfigMap
→ Sends message to Slack #deploy-failures channel
→ Creates incident via PagerDuty Events API
AnalysisRun — The unit of analysis execution automatically created by Argo Rollouts during a canary or blue-green deployment. It fetches values from external metric sources such as Prometheus, Datadog, and New Relic to determine success or failure, triggering a rollback on failure.
Notifications Controller — In Argo Rollouts v1.x, notification processing is built into the rollouts-controller. No separate controller needs to be deployed. The
notifications-install.yamlon GitHub is a convenience file that installs a set of default ConfigMap templates and triggers in one shot; it is not required when writing custom configuration from scratch. Logs can be checked withkubectl logs -n argo-rollouts -l app.kubernetes.io/name=argo-rollouts.
The Four Core Components
Getting these four components straight in your head before touching any configuration makes the whole process much smoother.
| Component | Role | Location |
|---|---|---|
| Service | Defines how to connect to Slack/PagerDuty (API endpoint, auth method) | ConfigMap |
| Template | Composes the message body using Go html/template syntax |
ConfigMap |
| Trigger | Defines the conditions for when to send a notification | ConfigMap |
| Subscription | Specifies which triggers a Rollout resource subscribes to and on which channels | Rollout annotation |
Separating ConfigMap and Secret can feel cumbersome at first, but from a GitOps perspective it's a natural design. It aligns perfectly with the flow of committing configuration (ConfigMap) to Git and managing credentials (Secret) via Vault or External Secrets.
What Changed in v1.7 — Accessing the .analysisRuns Array
Honestly, before v1.7, notification messages could only say something like "a rollback occurred," which gave on-call engineers far too little context. Starting in v1.7, templates can reference the .analysisRuns array, making it possible to include directly in the notification message which metric was exceeded and by how much. This turns out to make a bigger difference to the operational experience than you might expect.
Practical Implementation
The four steps below are best applied in order. Steps 1 and 2 should be deployed to the cluster before moving on to Step 3 for the notification wiring to work correctly.
Step 1: Declaring Services, Templates, and Triggers via ConfigMap
Everything begins with argo-rollouts-notification-configmap. I prefer to deploy the ConfigMap first, verify the structure, and then attach annotations — it makes it much easier to narrow down where an error occurred.
# argo-rollouts-notification-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: argo-rollouts-notification-configmap
namespace: argo-rollouts
data:
# $slack-token must exactly match the Secret key name in the same namespace
service.slack: |
token: $slack-token
# The key name under serviceKeys ("production") becomes the channel name in the subscription annotation
service.pagerduty-v2: |
serviceKeys:
production: $pagerduty-integration-key
# Template for AnalysisRun failure messages
# attachments is an officially deprecated Slack API. For new setups, Block Kit (blocks field) is recommended
template.analysis-run-failed: |
message: |
:x: *AnalysisRun Failed* — Rollback has started
Rollout: {{.rollout.metadata.name}}
Namespace: {{.rollout.metadata.namespace}}
slack:
blocks: |
[{
"type": "header",
"text": {
"type": "plain_text",
"text": "Deployment Failed: {{.rollout.metadata.name}}",
"emoji": true
}
}, {
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Namespace*\n{{.rollout.metadata.namespace}}"
},
{
"type": "mrkdwn",
"text": "*Strategy*\n{{if .rollout.spec.strategy.canary}}Canary{{else}}BlueGreen{{end}}"
},
{
"type": "mrkdwn",
"text": "*Image*\n`{{(index .rollout.spec.template.spec.containers 0).image}}`"
}
]
}]
# Rollback (Abort) template — for PagerDuty incident creation
# The value after stripping the "template." prefix ("rollout-aborted")
# is the name referenced in the trigger's send array
template.rollout-aborted: |
message: "Rollback occurred: {{.rollout.metadata.name}}"
pagerdutyV2:
summary: "{{.rollout.metadata.name}} rollback occurred — {{.rollout.metadata.namespace}}"
severity: critical
source: "argo-rollouts"
component: "{{.rollout.metadata.name}}"
# on-analysis-run-failed: metric collection succeeded but the value exceeded the threshold
# on-analysis-run-error: metric collection itself failed (e.g., Prometheus connection failure)
# Both ultimately lead to a rollback, so the same template handles both
trigger.on-analysis-run-failed: |
- send: [analysis-run-failed]
trigger.on-analysis-run-error: |
- send: [analysis-run-failed]
trigger.on-rollout-aborted: |
- send: [rollout-aborted]| Field | Description |
|---|---|
service.slack |
Defines Slack Bot Token-based connection. References Secret keys using the $variableName format |
service.pagerduty-v2 |
Events API v2 serviceKey map. The key name becomes the channel name in the subscription annotation |
template.* |
Composes messages using Go html/template syntax. Declare Block Kit payload under slack: and PD-specific payload under pagerdutyV2: |
trigger.* |
The send: array uses template names with the template. prefix removed |
The difference between on-analysis-run-failed and on-analysis-run-error can be confusing at first: failed means Prometheus successfully collected the error rate but the value exceeded the threshold, while error means the Prometheus server was unreachable or the query execution failed so no metrics could be fetched at all. If error is occurring frequently in production, fixing the metric collection infrastructure should take priority over the notifications themselves.
Step 2: Separating Credentials via Secret
This is where many people get stuck at first. The $variableName in the ConfigMap must exactly match the key name in the Secret for automatic binding to work, and the Secret must be in the same namespace as the ConfigMap.
# argo-rollouts-notification-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: argo-rollouts-notification-secret
namespace: argo-rollouts
type: Opaque
stringData:
slack-token: "xoxb-your-slack-bot-token"
pagerduty-integration-key: "your-pagerduty-v2-integration-key"In a real production environment, it's safer to inject values using External Secrets Operator or Vault Agent Injector rather than putting them directly in stringData. This example is for understanding the structure.
PagerDuty Events API v2 vs Incidents API — Argo Rollouts supports two PagerDuty integrations.
service.pagerduty-v2uses Events API v2 (automatic deduplication, event-based), whileservice.pagerdutycalls the Incidents API directly. For new setups,pagerduty-v2is recommended.
Step 3: Adding Subscription Annotations to the Rollout Resource
No matter how well the ConfigMap is configured, no notifications will be sent if the Rollout resource is missing subscription annotations. I've spent a long time debugging because of this missing step, so I want to emphasize how easy it is to forget.
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-service
namespace: my-team
annotations:
# On AnalysisRun failure (threshold exceeded) → Slack #deploy-failures
notifications.argoproj.io/subscribe.on-analysis-run-failed.slack: "deploy-failures"
# On AnalysisRun error (metric collection failure) → same channel
notifications.argoproj.io/subscribe.on-analysis-run-error.slack: "deploy-failures"
# On rollback → two Slack channels simultaneously + PagerDuty production service key
notifications.argoproj.io/subscribe.on-rollout-aborted.slack: "incidents,deploy-failures"
notifications.argoproj.io/subscribe.on-rollout-aborted.pagerduty-v2: "production"
spec:
# ... rest of Rollout specThe annotation key structure follows the format notifications.argoproj.io/subscribe.{trigger-name}.{service-name}: "{channel-name}". Multiple Slack channels can be specified with a comma to send to all of them simultaneously.
Step 4: Prometheus-Based AnalysisTemplate — The Starting Point for Automatic Rollbacks
Setting up notifications is only half the job. For an AnalysisRun to actually fail, an AnalysisTemplate must be defined — and its results are what activate the notification pipeline.
Taking a Spring Boot service as an example, here's how to configure a template that determines deployment success based on the HTTP 5xx error rate:
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-error-rate
namespace: my-team
spec:
args:
# Separating namespace as an argument allows reuse across multiple teams
- name: namespace
value: my-team
metrics:
- name: error-rate
interval: 5m
count: 3
failureLimit: 1
provider:
prometheus:
address: http://prometheus-server.monitoring.svc.cluster.local
query: |
sum(rate(http_requests_total{
namespace="{{args.namespace}}",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
namespace="{{args.namespace}}"
}[5m]))
# A "gray zone" between the two conditions prevents ambiguous behavior at boundary values.
# The 1–2% range stays Inconclusive, leaving room for human judgment.
successCondition: result[0] < 0.01 # error rate below 1% → success
failureCondition: result[0] >= 0.02 # error rate 2% or above → failureRather than hardcoding namespace, it's extracted as an args parameter. This means no manual edits are needed when the namespace name changes or when multiple teams share the same template.
It's also worth noting that the thresholds for successCondition and failureCondition are intentionally set differently. Setting them to the same value leads to ambiguous behavior at the boundary (e.g., exactly 1%). During an incident response, encountering a state that's neither failure nor success can be genuinely confusing. By leaving the 1–2% range as Inconclusive, if the undecided state persists across all three measurements (count: 3), the deployment is held and a person can step in to make the call.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Built-in integration | Slack, PagerDuty, GitHub, Teams, Webhook, and more supported natively without extra plugins |
| Notification as Code | ConfigMap-based approach integrates naturally into GitOps pipelines |
| Rich context | Since v1.7, the .analysisRuns array lets you include failed metrics and causes directly in messages |
| Dual-channel simultaneous delivery | A single trigger handles both Slack (team awareness) and PagerDuty (on-call escalation) |
| Per-team independent management | Annotation-based approach allows distributed subscription management per namespace and resource |
Disadvantages and Caveats
The issue I encountered most often through real-world operations was duplicate notifications. For high-frequency deployment services, I experienced three or four notifications firing back-to-back for a single rollback. In staging, where intentional failure tests are common, channels quickly filled with noise, and team members ended up ignoring the alerts. We resolved this by routing staging to a dedicated channel or adding filter conditions to triggers.
| Item | Details | Mitigation |
|---|---|---|
| Duplicate notifications | A bug causing on-rollout-aborted to fire duplicate notifications in certain situations has been reported (GitHub Discussion #1964) |
Thoroughly validate in staging before applying to production. Separate staging into its own channel. |
| Namespace scope | ConfigMap and Secret must be in the same namespace as the Rollout | For multi-namespace environments, consider centralizing with the namespace-install approach |
| Notification delay | Notifications Controller processing lag may result in delivery 30 seconds to several minutes after the actual rollback | Set the first-stage wait time in PagerDuty escalation policy to 5 minutes or more |
| Template rendering errors | Go template syntax errors fail silently without sending any notification | Pre-validate with a dry-run using the kubectl argo rollouts notifications template notify command |
| Mixed PagerDuty integration types | V1 (Incidents API) and V2 (Events API) coexist, creating potential for confusion | Always use pagerduty-v2 for new setups |
The Most Common Mistakes in Practice
-
Secret and ConfigMap are in different namespaces — The
$slack-tokenreference fails silently, causing API calls to go out without a token and resulting in authentication errors. Check withkubectl get configmap,secret -n argo-rolloutsto confirm both resources are in the same namespace. -
Typos in the trigger name within the annotation key — A typo like
on-analysis-run-faildwill silently do nothing without any error. It's safer to verify the trigger list withkubectl argo rollouts notifications trigger getbefore writing the annotation. If thekubectl argo rolloutsplugin is not installed, refer to the official installation documentation first. -
PagerDuty serviceKey name and annotation channel name mismatch — If the ConfigMap defines
serviceKeys.productionbut the annotation sayspagerduty-v2: "prod", nothing will be sent. The key name inservice.pagerduty-v2and the channel name in the annotation must match exactly.
Closing Thoughts
With this pipeline in place, when a rollback occurs in the middle of the night, the on-call engineer can immediately see — via PagerDuty mobile notification and Slack message — exactly which service and which image had the problem. No more arriving at work in the morning and typing 'rollback' into the Slack search bar.
Here are three steps you can take right now:
-
Deploy the ConfigMap and Secret first, then validate with a dry-run — After running
kubectl apply -f argo-rollouts-notification-configmap.yaml, use the command below to confirm an actual Slack message is delivered. Thekubectl argo rolloutsplugin must be installed beforehand.bashkubectl argo rollouts notifications template notify \ analysis-run-failed {rollout-name} -n {namespace}If this runs without errors, the critical path is verified.
-
Add annotations to a staging Rollout and reproduce a failure scenario — Setting
failureLimit: 0in a test AnalysisTemplate causes an immediate failure verdict on the first measurement, letting you quickly validate the notification pipeline. -
Wire up PagerDuty escalation policies and gradually apply to production Rollouts — Once staging validation is complete, it's safest to start adding the
on-rollout-aborted.pagerduty-v2annotation to lower-traffic services first. Given the notification delay mentioned earlier (30 seconds to several minutes), setting the first-stage wait time in the PagerDuty escalation policy to 5 minutes or more is also advisable.
References
- Argo Rollouts Notifications Official Documentation | Argo Rollouts
- Slack Integration Official Documentation | Argo Rollouts
- PagerDuty V2 Integration Official Documentation | Argo Rollouts
- PagerDuty Integration Official Documentation | Argo Rollouts
- Best Practices | Argo Rollouts
- Analysis & Progressive Delivery | Argo Rollouts
- notifications-install.yaml | GitHub argoproj/argo-rollouts
- Argo Rollouts v1.7 Release Candidate | Argo Official Blog
- Notifications for Argo | Argo Official Blog
- Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
- Automating Rollbacks in Spring Boot with Argo Rollouts and Prometheus | Medium
- Implementing Production-Grade Progressive Delivery with Automated SLO-Based Rollbacks | Medium
- Stop Watching the Argo CD Dashboard: Set Up Slack and PagerDuty Notifications | burrell.tech
- Argo Rollouts GitHub Releases | GitHub