Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
DevOps

Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty

That morning, right after arriving at work, a colleague was typing 'rollback' into the Slack search bar. A canary deployment had silently failed overnight — the AnalysisRun had exceeded the error rate threshold and triggered an automatic rollback, but nobody knew. Instead of a PagerDuty alert, the on-call engineer started piecing together what had happened by reviewing the morning's accumulated error logs. I'd had a similar experience myself, and after that I came to truly understand how critical it is to properly connect notifications to the deployment pipeline.

This article covers, based on Argo Rollouts v1.8, how to build a notification pipeline from scratch that delivers both AnalysisRun failures and rollback events to Slack and PagerDuty simultaneously. Rather than stopping at copy-paste YAML, I'll walk through why each component is designed the way it is — and honestly, where things tend to get stuck.

Argo Rollouts has a built-in notification system. Without a separate event bus or sidecar — meaning you don't need to run AlertManager or an external event processing system — just one ConfigMap, one Secret, and a few annotation lines on the Rollout resource are all it takes to connect AnalysisRun failures all the way through to PagerDuty incident creation. You can minimize operational overhead while keeping the entire team informed of deployment status in real time.


Core Concepts

What's Helpful to Know Before Reading This

This article is aimed at readers who have experience operating Kubernetes and have a general idea of what Argo Rollouts is. If you're familiar with the three concepts below, you can follow along right away.

  • Canary Deployment: A strategy where a new version is deployed to only a portion of traffic (e.g., 10%) first, and gradually rolled out further if no issues are found. This is difficult to achieve with a standard Kubernetes Deployment, and Argo Rollouts manages it declaratively.
  • Argo Rollouts: Provides a Rollout resource that replaces the Kubernetes Deployment, handling canary and blue-green deployment strategies along with automated analysis at the controller level.
  • ConfigMap / Secret: Kubernetes's way of storing configuration. Non-sensitive values go in ConfigMap; credentials like Slack tokens go in Secret.

Vault and External Secrets are tools for safely managing Secrets via GitOps — that's outside the scope of this article. Refer to the External Secrets Operator or HashiCorp Vault documentation for more.

The Complete Notification Pipeline Flow

Argo Rollouts Notifications is based on argoproj/notifications-engine. Argo CD notifications use the same engine. Teams running both tools together will find the configuration pattern easy to pick up.

Here's the path a notification takes when an event occurs, step by step:

AnalysisRun Failure (error rate threshold exceeded) → Rollout automatically Aborts (canary rollback begins) → Notifications Controller detects Rollout status change → Evaluates trigger conditions defined in ConfigMap → Sends message to Slack #deploy-failures channel → Creates incident via PagerDuty Events API

AnalysisRun — The unit of analysis execution automatically created by Argo Rollouts during a canary or blue-green deployment. It fetches values from external metric sources such as Prometheus, Datadog, and New Relic to determine success or failure, triggering a rollback on failure.

Notifications Controller — In Argo Rollouts v1.x, notification processing is built into the rollouts-controller. No separate controller needs to be deployed. The notifications-install.yaml on GitHub is a convenience file that installs a set of default ConfigMap templates and triggers in one shot; it is not required when writing custom configuration from scratch. Logs can be checked with kubectl logs -n argo-rollouts -l app.kubernetes.io/name=argo-rollouts.

The Four Core Components

Getting these four components straight in your head before touching any configuration makes the whole process much smoother.

Component Role Location
Service Defines how to connect to Slack/PagerDuty (API endpoint, auth method) ConfigMap
Template Composes the message body using Go html/template syntax ConfigMap
Trigger Defines the conditions for when to send a notification ConfigMap
Subscription Specifies which triggers a Rollout resource subscribes to and on which channels Rollout annotation

Separating ConfigMap and Secret can feel cumbersome at first, but from a GitOps perspective it's a natural design. It aligns perfectly with the flow of committing configuration (ConfigMap) to Git and managing credentials (Secret) via Vault or External Secrets.

What Changed in v1.7 — Accessing the .analysisRuns Array

Honestly, before v1.7, notification messages could only say something like "a rollback occurred," which gave on-call engineers far too little context. Starting in v1.7, templates can reference the .analysisRuns array, making it possible to include directly in the notification message which metric was exceeded and by how much. This turns out to make a bigger difference to the operational experience than you might expect.


Practical Implementation

The four steps below are best applied in order. Steps 1 and 2 should be deployed to the cluster before moving on to Step 3 for the notification wiring to work correctly.

Step 1: Declaring Services, Templates, and Triggers via ConfigMap

Everything begins with argo-rollouts-notification-configmap. I prefer to deploy the ConfigMap first, verify the structure, and then attach annotations — it makes it much easier to narrow down where an error occurred.

yaml
# argo-rollouts-notification-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: argo-rollouts-notification-configmap
  namespace: argo-rollouts
data:
  # $slack-token must exactly match the Secret key name in the same namespace
  service.slack: |
    token: $slack-token
 
  # The key name under serviceKeys ("production") becomes the channel name in the subscription annotation
  service.pagerduty-v2: |
    serviceKeys:
      production: $pagerduty-integration-key
 
  # Template for AnalysisRun failure messages
  # attachments is an officially deprecated Slack API. For new setups, Block Kit (blocks field) is recommended
  template.analysis-run-failed: |
    message: |
      :x: *AnalysisRun Failed* — Rollback has started
      Rollout: {{.rollout.metadata.name}}
      Namespace: {{.rollout.metadata.namespace}}
    slack:
      blocks: |
        [{
          "type": "header",
          "text": {
            "type": "plain_text",
            "text": "Deployment Failed: {{.rollout.metadata.name}}",
            "emoji": true
          }
        }, {
          "type": "section",
          "fields": [
            {
              "type": "mrkdwn",
              "text": "*Namespace*\n{{.rollout.metadata.namespace}}"
            },
            {
              "type": "mrkdwn",
              "text": "*Strategy*\n{{if .rollout.spec.strategy.canary}}Canary{{else}}BlueGreen{{end}}"
            },
            {
              "type": "mrkdwn",
              "text": "*Image*\n`{{(index .rollout.spec.template.spec.containers 0).image}}`"
            }
          ]
        }]
 
  # Rollback (Abort) template — for PagerDuty incident creation
  # The value after stripping the "template." prefix ("rollout-aborted")
  # is the name referenced in the trigger's send array
  template.rollout-aborted: |
    message: "Rollback occurred: {{.rollout.metadata.name}}"
    pagerdutyV2:
      summary: "{{.rollout.metadata.name}} rollback occurred — {{.rollout.metadata.namespace}}"
      severity: critical
      source: "argo-rollouts"
      component: "{{.rollout.metadata.name}}"
 
  # on-analysis-run-failed: metric collection succeeded but the value exceeded the threshold
  # on-analysis-run-error:  metric collection itself failed (e.g., Prometheus connection failure)
  # Both ultimately lead to a rollback, so the same template handles both
  trigger.on-analysis-run-failed: |
    - send: [analysis-run-failed]
 
  trigger.on-analysis-run-error: |
    - send: [analysis-run-failed]
 
  trigger.on-rollout-aborted: |
    - send: [rollout-aborted]
Field Description
service.slack Defines Slack Bot Token-based connection. References Secret keys using the $variableName format
service.pagerduty-v2 Events API v2 serviceKey map. The key name becomes the channel name in the subscription annotation
template.* Composes messages using Go html/template syntax. Declare Block Kit payload under slack: and PD-specific payload under pagerdutyV2:
trigger.* The send: array uses template names with the template. prefix removed

The difference between on-analysis-run-failed and on-analysis-run-error can be confusing at first: failed means Prometheus successfully collected the error rate but the value exceeded the threshold, while error means the Prometheus server was unreachable or the query execution failed so no metrics could be fetched at all. If error is occurring frequently in production, fixing the metric collection infrastructure should take priority over the notifications themselves.

Step 2: Separating Credentials via Secret

This is where many people get stuck at first. The $variableName in the ConfigMap must exactly match the key name in the Secret for automatic binding to work, and the Secret must be in the same namespace as the ConfigMap.

yaml
# argo-rollouts-notification-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: argo-rollouts-notification-secret
  namespace: argo-rollouts
type: Opaque
stringData:
  slack-token: "xoxb-your-slack-bot-token"
  pagerduty-integration-key: "your-pagerduty-v2-integration-key"

In a real production environment, it's safer to inject values using External Secrets Operator or Vault Agent Injector rather than putting them directly in stringData. This example is for understanding the structure.

PagerDuty Events API v2 vs Incidents API — Argo Rollouts supports two PagerDuty integrations. service.pagerduty-v2 uses Events API v2 (automatic deduplication, event-based), while service.pagerduty calls the Incidents API directly. For new setups, pagerduty-v2 is recommended.

Step 3: Adding Subscription Annotations to the Rollout Resource

No matter how well the ConfigMap is configured, no notifications will be sent if the Rollout resource is missing subscription annotations. I've spent a long time debugging because of this missing step, so I want to emphasize how easy it is to forget.

yaml
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
  namespace: my-team
  annotations:
    # On AnalysisRun failure (threshold exceeded) → Slack #deploy-failures
    notifications.argoproj.io/subscribe.on-analysis-run-failed.slack: "deploy-failures"
 
    # On AnalysisRun error (metric collection failure) → same channel
    notifications.argoproj.io/subscribe.on-analysis-run-error.slack: "deploy-failures"
 
    # On rollback → two Slack channels simultaneously + PagerDuty production service key
    notifications.argoproj.io/subscribe.on-rollout-aborted.slack: "incidents,deploy-failures"
    notifications.argoproj.io/subscribe.on-rollout-aborted.pagerduty-v2: "production"
spec:
  # ... rest of Rollout spec

The annotation key structure follows the format notifications.argoproj.io/subscribe.{trigger-name}.{service-name}: "{channel-name}". Multiple Slack channels can be specified with a comma to send to all of them simultaneously.

Step 4: Prometheus-Based AnalysisTemplate — The Starting Point for Automatic Rollbacks

Setting up notifications is only half the job. For an AnalysisRun to actually fail, an AnalysisTemplate must be defined — and its results are what activate the notification pipeline.

Taking a Spring Boot service as an example, here's how to configure a template that determines deployment success based on the HTTP 5xx error rate:

yaml
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-error-rate
  namespace: my-team
spec:
  args:
    # Separating namespace as an argument allows reuse across multiple teams
    - name: namespace
      value: my-team
  metrics:
    - name: error-rate
      interval: 5m
      count: 3
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus-server.monitoring.svc.cluster.local
          query: |
            sum(rate(http_requests_total{
              namespace="{{args.namespace}}",
              status=~"5.."
            }[5m]))
            /
            sum(rate(http_requests_total{
              namespace="{{args.namespace}}"
            }[5m]))
      # A "gray zone" between the two conditions prevents ambiguous behavior at boundary values.
      # The 1–2% range stays Inconclusive, leaving room for human judgment.
      successCondition: result[0] < 0.01    # error rate below 1% → success
      failureCondition: result[0] >= 0.02   # error rate 2% or above → failure

Rather than hardcoding namespace, it's extracted as an args parameter. This means no manual edits are needed when the namespace name changes or when multiple teams share the same template.

It's also worth noting that the thresholds for successCondition and failureCondition are intentionally set differently. Setting them to the same value leads to ambiguous behavior at the boundary (e.g., exactly 1%). During an incident response, encountering a state that's neither failure nor success can be genuinely confusing. By leaving the 1–2% range as Inconclusive, if the undecided state persists across all three measurements (count: 3), the deployment is held and a person can step in to make the call.


Pros and Cons

Advantages

Item Details
Built-in integration Slack, PagerDuty, GitHub, Teams, Webhook, and more supported natively without extra plugins
Notification as Code ConfigMap-based approach integrates naturally into GitOps pipelines
Rich context Since v1.7, the .analysisRuns array lets you include failed metrics and causes directly in messages
Dual-channel simultaneous delivery A single trigger handles both Slack (team awareness) and PagerDuty (on-call escalation)
Per-team independent management Annotation-based approach allows distributed subscription management per namespace and resource

Disadvantages and Caveats

The issue I encountered most often through real-world operations was duplicate notifications. For high-frequency deployment services, I experienced three or four notifications firing back-to-back for a single rollback. In staging, where intentional failure tests are common, channels quickly filled with noise, and team members ended up ignoring the alerts. We resolved this by routing staging to a dedicated channel or adding filter conditions to triggers.

Item Details Mitigation
Duplicate notifications A bug causing on-rollout-aborted to fire duplicate notifications in certain situations has been reported (GitHub Discussion #1964) Thoroughly validate in staging before applying to production. Separate staging into its own channel.
Namespace scope ConfigMap and Secret must be in the same namespace as the Rollout For multi-namespace environments, consider centralizing with the namespace-install approach
Notification delay Notifications Controller processing lag may result in delivery 30 seconds to several minutes after the actual rollback Set the first-stage wait time in PagerDuty escalation policy to 5 minutes or more
Template rendering errors Go template syntax errors fail silently without sending any notification Pre-validate with a dry-run using the kubectl argo rollouts notifications template notify command
Mixed PagerDuty integration types V1 (Incidents API) and V2 (Events API) coexist, creating potential for confusion Always use pagerduty-v2 for new setups

The Most Common Mistakes in Practice

  1. Secret and ConfigMap are in different namespaces — The $slack-token reference fails silently, causing API calls to go out without a token and resulting in authentication errors. Check with kubectl get configmap,secret -n argo-rollouts to confirm both resources are in the same namespace.

  2. Typos in the trigger name within the annotation key — A typo like on-analysis-run-faild will silently do nothing without any error. It's safer to verify the trigger list with kubectl argo rollouts notifications trigger get before writing the annotation. If the kubectl argo rollouts plugin is not installed, refer to the official installation documentation first.

  3. PagerDuty serviceKey name and annotation channel name mismatch — If the ConfigMap defines serviceKeys.production but the annotation says pagerduty-v2: "prod", nothing will be sent. The key name in service.pagerduty-v2 and the channel name in the annotation must match exactly.


Closing Thoughts

With this pipeline in place, when a rollback occurs in the middle of the night, the on-call engineer can immediately see — via PagerDuty mobile notification and Slack message — exactly which service and which image had the problem. No more arriving at work in the morning and typing 'rollback' into the Slack search bar.

Here are three steps you can take right now:

  1. Deploy the ConfigMap and Secret first, then validate with a dry-run — After running kubectl apply -f argo-rollouts-notification-configmap.yaml, use the command below to confirm an actual Slack message is delivered. The kubectl argo rollouts plugin must be installed beforehand.

    bash
    kubectl argo rollouts notifications template notify \
      analysis-run-failed {rollout-name} -n {namespace}

    If this runs without errors, the critical path is verified.

  2. Add annotations to a staging Rollout and reproduce a failure scenario — Setting failureLimit: 0 in a test AnalysisTemplate causes an immediate failure verdict on the first measurement, letting you quickly validate the notification pipeline.

  3. Wire up PagerDuty escalation policies and gradually apply to production Rollouts — Once staging validation is complete, it's safest to start adding the on-rollout-aborted.pagerduty-v2 annotation to lower-traffic services first. Given the notification delay mentioned earlier (30 seconds to several minutes), setting the first-stage wait time in the PagerDuty escalation policy to 5 minutes or more is also advisable.


References

  • Argo Rollouts Notifications Official Documentation | Argo Rollouts
  • Slack Integration Official Documentation | Argo Rollouts
  • PagerDuty V2 Integration Official Documentation | Argo Rollouts
  • PagerDuty Integration Official Documentation | Argo Rollouts
  • Best Practices | Argo Rollouts
  • Analysis & Progressive Delivery | Argo Rollouts
  • notifications-install.yaml | GitHub argoproj/argo-rollouts
  • Argo Rollouts v1.7 Release Candidate | Argo Official Blog
  • Notifications for Argo | Argo Official Blog
  • Progressive Delivery with Argo Rollouts: Canary with Analysis | InfraCloud
  • Automating Rollbacks in Spring Boot with Argo Rollouts and Prometheus | Medium
  • Implementing Production-Grade Progressive Delivery with Automated SLO-Based Rollbacks | Medium
  • Stop Watching the Argo CD Dashboard: Set Up Slack and PagerDuty Notifications | burrell.tech
  • Argo Rollouts GitHub Releases | GitHub
#ArgoRollouts#카나리배포#Kubernetes#PagerDuty#Slack알림#Prometheus#GitOps#AnalysisRun#알림자동화#배포파이프라인
Share

Table of Contents

Core ConceptsWhat's Helpful to Know Before Reading ThisThe Complete Notification Pipeline FlowThe Four Core ComponentsWhat Changed in v1.7 — Accessing thePractical ImplementationStep 1: Declaring Services, Templates, and Triggers via ConfigMapStep 2: Separating Credentials via SecretStep 3: Adding Subscription Annotations to the Rollout ResourceStep 4: Prometheus-Based AnalysisTemplate — The Starting Point for Automatic RollbacksPros and ConsAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

How to Detect Argo Rollouts Rollbacks with Argo Events and Automatically Create Jira Incidents and Confluence Postmortems
DevOps

How to Detect Argo Rollouts Rollbacks with Argo Events and Automatically Create Jira Incidents and Confluence Postmortems

It's two in the morning, and a canary deployment is quietly rolling back. Argo Rollouts dutifully reverts to the previous version, but nobody knows it happened....

May 26, 202620 min read
Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook
DevOps

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook

When I first introduced canary deployments, staring intently at Grafana dashboards was part of the deployment process for quite some time. The workflow involved...

May 26, 202623 min read
Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It
DevOps

Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It

Whenever I think through deployment strategies, I always pause for a moment at "should I go with canary or BlueGreen?" At first, I vaguely assumed canary was sa...

May 26, 202619 min read
Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline
DevOps

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Honestly, when I first introduced canary deployments, I was running deployment scripts by hand. I'd type in the terminal, post "canary is now at 5%" in Slack, ...

May 26, 202623 min read
Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy
DevOps

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

There was a time when I'd wait in Slack during every deployment and manually type rollback commands whenever error rates spiked. I thought introducing canary de...

May 26, 202620 min read
Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition
DevOps

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn...

May 26, 202624 min read