Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Honestly, when I first introduced canary deployments, I was running deployment scripts by hand. I'd type kubectl set image in the terminal, post "canary is now at 5%" in Slack, check the console again 30 minutes later to review metrics, then manually bump the weight again. It was embarrassing to call it automation.

Once I started using ArgoCD and Argo Rollouts together, this entire flow changed completely. Now, from the moment a PR is merged, canary traffic shifting, Prometheus metric validation, full promotion, or automatic rollback all happen with almost no human intervention. By the end of this article, you'll be able to set up a canary pipeline that runs on just four YAML files: a Rollout, an AnalysisTemplate, an ArgoCD Application, and a GitHub Actions workflow.

This article does assume some familiarity with Kubernetes and hands-on experience writing Deployments and Services. Some PromQL examples appear along the way, but you don't need deep Prometheus knowledge to follow the overall flow.

Core Concepts

GitOps: PR Merge = Deployment Approval

The core idea of GitOps is simple. You treat the Git repository as the Single Source of Truth and declare all infrastructure and application state in Git. If you want to deploy, reflect it in Git. If you need an approval process, it's replaced by PR reviews.

GitOps: "A methodology for declaring operational state as code and managing changes through Git." Instead of running deployment scripts, a Git commit becomes the deployment trigger.

Another benefit of this approach is that an audit trail is created automatically. Who deployed what, when, and why is all preserved in the Git history.

ArgoCD vs Argo Rollouts: They Have Different Roles

When you first encounter both tools, it's natural to ask, "Aren't they both Argo? What's the difference?" I was confused at first too.

Tool	Role	One-line Summary
ArgoCD	Git → cluster synchronization	Decides what to deploy
Argo Rollouts	Progressive traffic shifting + automatic promotion/rollback	Decides how to deploy

ArgoCD reconciles the cluster to match the state in Git whenever a change occurs. Argo Rollouts, on the other hand, progressively shifts traffic from 5% → 25% → 50% when a new version arrives, checks metrics like Prometheus at each step, and immediately rolls back if something goes wrong. The two tools share responsibilities across separate layers.

The Complete Canary Automation Flow

Here's the sequence of events from Git merge to deployment completion in practice:

yaml

Developer merges PR
  │
  ▼
GitHub Actions: build image → commit image tag to config repo
  │
  ▼
ArgoCD: detects config repo change → updates Rollout resource (sync)
  │
  ▼
Argo Rollouts controller: starts canary steps
  ├── Step 1: traffic 5% → wait 5 minutes → run AnalysisRun
  ├── Step 2: traffic 25% → wait 10 minutes
  ├── Step 3: traffic 50% → wait 10 minutes
  └── If all steps pass: controller automatically promotes to stable
       └── On failure: immediately restores to 0% (automatic rollback)

Rollout, AnalysisTemplate, AnalysisRun

Understanding three CRDs (Custom Resource Definitions) provided by Argo Rollouts makes everything else easy to follow.

Rollout: A resource that replaces the standard Deployment. Canary steps and strategy are defined here.

AnalysisTemplate: A template that defines metric-based pass criteria, such as "success rate must be 95% or higher."

AnalysisRun: An instance of an AnalysisTemplate executed at a specific deployment step. The result is PASS or FAIL.

Practical Implementation

Basic Rollout YAML Structure — Replacing a Deployment with a Canary Strategy

The first thing to do is replace your existing Deployment resource with a Rollout. The apiVersion and kind change, and you declare the canary steps in the strategy block.

yaml

# k8s/rollout.yaml (stored in the config repository)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: ghcr.io/myorg/my-app:1.0.0   # GitHub Actions automatically updates this value
  strategy:
    canary:
      canaryService: my-app-canary    # Service dedicated to canary
      stableService: my-app-stable    # Service dedicated to stable
      trafficRouting:
        nginx:
          stableIngress: my-app-ingress
      steps:
      - setWeight: 5           # Shift 5% of traffic to canary
      - pause: {duration: 5m}  # Observe for 5 minutes
      - analysis:              # Validate Prometheus metrics
          templates:
          - templateName: success-rate
          args:
          - name: app-name
            value: my-app
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      # Once all steps pass, the controller automatically promotes to stable
      # You don't need to explicitly set setWeight: 100

The canaryService and stableService must be created separately ahead of time. Without these two Services, the controller will throw errors and the canary will not start. The minimum configuration looks like this:

yaml

# k8s/services.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-stable
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-canary
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

The selectors on both Services can be identical at first. Argo Rollouts automatically manipulates the selectors during deployment to ensure traffic only reaches canary Pods.

Field	Role
`canaryService` / `stableService`	A pair of Services for traffic separation. They must be created in advance.
`trafficRouting.nginx`	Weight-based routing via Nginx Ingress. Can be swapped for Istio or Linkerd. Nginx adjusts weights at the Ingress Controller level, while Istio/Linkerd use a sidecar mesh approach that enables fine-grained header-based routing but increases operational complexity.
`setWeight`	The percentage of traffic (%) to send to canary Pods
`pause`	Wait time before advancing to the next step. Using `{}` (no duration) creates a manual approval gate.
`analysis`	Runs an AnalysisTemplate at this step. Triggers automatic rollback on FAIL.

Using pause: {} as a manual approval gate is actually quite useful in practice. For major deployments, a team lead might sit at this gate, review metrics on the ArgoCD dashboard, and personally issue the kubectl argo rollouts promote my-app command to make the promotion decision.

Declaring Prometheus Analysis Conditions — Setting Automatic Rollback Criteria with AnalysisTemplate

Declaring metric analysis conditions in a separate template allows them to be reused across multiple Rollouts. Parameterizing the app name via args is standard practice for reusability.

yaml

# k8s/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: app-name            # The app name is injected from the Rollout
  metrics:
  - name: success-rate
    interval: 1m              # Measure every 1 minute
    successCondition: result[0] >= 0.95   # Must pass with a success rate of 95% or higher
    failureLimit: 3           # AnalysisRun fails after 3 consecutive failures
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status!~"5.*",app="{{args.app-name}}"}[2m]))
          /
          sum(rate(http_requests_total{app="{{args.app-name}}"}[2m]))

This lets you reuse the same AnalysisTemplate across multiple Rollouts — for my-app, my-other-service, etc. — by simply changing the app name.

failureLimit: 3 is an important setting in practice. Momentary traffic spikes or measurement noise can temporarily push the rate below 95%, so using 3 consecutive failures as the threshold significantly reduces the chance of a false positive. Early on, I set failureLimit: 1 and experienced rollbacks on perfectly healthy deployments, which was frustrating. Conversely, if interval is too short, there won't be enough samples for statistically meaningful results. Querying at 30-second intervals leaves you vulnerable to noise due to insufficient samples, so starting at 1–2 minute intervals and adjusting based on traffic volume is the right approach.

Configuring ArgoCD Auto-Sync — Immediately Applying Git Changes to the Cluster

Declare which Git repository and which path ArgoCD should watch. The repository referenced here is my-app-config — a dedicated Kubernetes config repo, not the application code repo.

yaml

# argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/myorg/my-app-config   # Dedicated config repository
    targetRevision: HEAD
    path: k8s/    # The Rollout, Services, and AnalysisTemplate all live here
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # Resources deleted from Git are also deleted from the cluster
      selfHeal: true    # Automatically restores cluster state if it diverges from Git

With syncPolicy.automated enabled, the cluster syncs immediately whenever a change occurs in the k8s/ path of the my-app-config repository.

Why separate repositories? Splitting the app code repo (source code, Dockerfile) from the config repo (Kubernetes YAML) keeps the deployment history cleanly contained in the config repo's Git log. Application feature commits and deployment commits don't get mixed together, making rollbacks and audit trails much cleaner.

Automatically Updating the Image Tag from CI — Connecting GitHub Actions to the Config Repo

This is the step that builds the image in the CI pipeline and then automatically updates the image tag in the config repo. The moment this commit is pushed to the my-app-config repo, ArgoCD detects it and the Rollout begins.

yaml

# .github/workflows/deploy.yml (located in the app code repo)
name: Deploy
 
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Build and push image
        run: |
          # Use the full image name including the registry address
          docker build -t ghcr.io/myorg/my-app:${{ github.sha }} .
          docker push ghcr.io/myorg/my-app:${{ github.sha }}
 
      - name: Checkout config repo
        uses: actions/checkout@v4
        with:
          repository: myorg/my-app-config
          token: ${{ secrets.CONFIG_REPO_TOKEN }}
          path: config-repo
 
      - name: Update image tag in config repo
        run: |
          cd config-repo
          # Note: the sed replacement approach is an illustrative example for concept explanation.
          # In production, Kustomize's images field or Helm values updates are recommended.
          # The sed approach can break immediately if the image tag format changes.
          sed -i "s|image: ghcr.io/myorg/my-app:.*|image: ghcr.io/myorg/my-app:${{ github.sha }}|" k8s/rollout.yaml
          git config user.email "actions@github.com"
          git config user.name "GitHub Actions"
          git add k8s/rollout.yaml
          git commit -m "chore: deploy ${{ github.sha }}"
          # If another commit lands on main first, the push may fail.
          # In production, it's safer to retry after git pull --rebase
          # or to use the peter-evans/create-pull-request action.
          git push

The key to this flow is separating the app code repo (my-app) from the config repo (my-app-config). App code change → image build → update image tag in config repo → ArgoCD detects → Rollout begins. This separation keeps the deployment history cleanly in the config repo, and ArgoCD only needs to watch the config repo.

Pros and Cons

Advantages

Item	Details
Full GitOps consistency	Everything, including deployment strategy, is declared in Git. A PR review is a deployment review.
Metric-based automatic promotion	Prometheus, Datadog, New Relic, etc. can be used directly as deployment gates.
Automatic rollback	On AnalysisRun failure, canary weight is immediately restored to 0. No human monitoring required.
Fine-grained step control	Traffic percentages, wait times, and analysis conditions can be explicitly declared per step.
Manual approval gate	`pause: {}` for indefinite hold → for critical deployments, a human can make the promotion decision directly.
ArgoCD UI integration	Rollout progress, canary steps, and AnalysisRun results can be monitored in real time from the dashboard.

Disadvantages and Caveats

Item	Details	Mitigation
Deployment migration overhead	Existing `Deployment` resources must be replaced with `Rollout`. Existing HPA and monitoring integrations must also be reviewed.	Start with your lowest-traffic service, let the team get comfortable with the pattern, then migrate the rest.
Rollback not reflected in Git	When an automatic rollback occurs, the cluster state diverges from Git state. If ArgoCD attempts a re-sync, it may redeploy the same new version.	Set up Slack webhook alerts on rollback, and agree on a team process where the responsible person immediately reverts the image tag in the config repo to the previous value.
Increased operational complexity	You must operate ArgoCD + Argo Rollouts + Ingress Controller + Prometheus together.	Start with just traffic shifting (no AnalysisTemplate), then add the analysis step incrementally once operations feel comfortable.
Traffic routing infrastructure required	One of Nginx Ingress, Istio, or Linkerd must be configured in advance for weight-based routing.	Most teams already use Nginx Ingress, so you can leverage your existing stack as-is.
DB schema compatibility	Since canary and stable versions run simultaneously, both versions must be able to read the same DB schema.	Use the Expand/Contract pattern as a prerequisite — add new columns first, then remove old columns only after the previous version is fully gone.

Expand/Contract Pattern: A DB migration approach where new columns or tables are added first (Expand) and both versions coexist, then the old schema is removed (Contract) once the previous version is gone. Necessary for maintaining backward compatibility when two versions run simultaneously, as in canary deployments.

The Most Common Mistakes in Practice

Applying the Rollout without creating canaryService and stableService first. Argo Rollouts requires both Services to be created in advance for traffic separation. Without them, the controller throws errors and the canary never starts. When I first set this up, I missed this and spent a long time stuck on ProgressDeadlineExceeded errors.
Setting the interval in AnalysisTemplate too short. Querying Prometheus at 30-second intervals yields insufficient samples and produces statistically meaningless results. Starting at 1–2 minute intervals and adjusting based on traffic volume is the right approach.
Leaving Git state untouched after an automatic rollback. Even after a rollback, the image tag in the config repo still points to the new version. If ArgoCD attempts a re-sync, it may redeploy the same new version. It's important to share rollback alert notifications and config repo recovery procedures with the team ahead of time.

Closing Thoughts

Combining ArgoCD and Argo Rollouts turns the simple act of a Git merge into a trigger that drives an entire safe, automated canary deployment. The initial setup takes time, but once it's up and running, you'll stop dreading weekend deployments, and the system will handle anomaly detection and rollbacks without you manually checking metrics in the middle of the night.

Three steps you can take right now:

You can install the Argo Rollouts controller in your cluster. Create the namespace with kubectl create namespace argo-rollouts, then check the official documentation for the latest install.yaml and apply it. (Installation commands can change between versions, so checking the official docs for the latest version is recommended.) If you also install the kubectl plugin (kubectl argo rollouts), you can watch canary progress in real time with kubectl argo rollouts get rollout my-app --watch.
You can pick an existing Deployment and replace it with a Rollout. Change the apiVersion and kind, then start with a simple 3-step flow in strategy.canary.steps: setWeight: 10 → pause: {duration: 2m} → automatic controller promotion. It's best to verify traffic shifting first, without an AnalysisTemplate.
If your Prometheus metrics are ready, you can attach an AnalysisTemplate. Start with a lenient threshold like successCondition: result[0] >= 0.90, then gradually raise the bar as you gather operational data.

References

#ArgoCD#ArgoRollouts#카나리배포#Kubernetes#GitOps#Prometheus#GitHubActions#ProgressiveDelivery#NginxIngress#AnalysisTemplate

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline | DEV BAK - 기술블로그

DevOps

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Core Concepts

GitOps: PR Merge = Deployment Approval

GitOps: "A methodology for declaring operational state as code and managing changes through Git." Instead of running deployment scripts, a Git commit becomes the deployment trigger.

Another benefit of this approach is that an audit trail is created automatically. Who deployed what, when, and why is all preserved in the Git history.

ArgoCD vs Argo Rollouts: They Have Different Roles

When you first encounter both tools, it's natural to ask, "Aren't they both Argo? What's the difference?" I was confused at first too.

Tool	Role	One-line Summary
ArgoCD	Git → cluster synchronization	Decides what to deploy
Argo Rollouts	Progressive traffic shifting + automatic promotion/rollback	Decides how to deploy

The Complete Canary Automation Flow

Here's the sequence of events from Git merge to deployment completion in practice:

yaml

Developer merges PR
  │
  ▼
GitHub Actions: build image → commit image tag to config repo
  │
  ▼
ArgoCD: detects config repo change → updates Rollout resource (sync)
  │
  ▼
Argo Rollouts controller: starts canary steps
  ├── Step 1: traffic 5% → wait 5 minutes → run AnalysisRun
  ├── Step 2: traffic 25% → wait 10 minutes
  ├── Step 3: traffic 50% → wait 10 minutes
  └── If all steps pass: controller automatically promotes to stable
       └── On failure: immediately restores to 0% (automatic rollback)

Rollout, AnalysisTemplate, AnalysisRun

Understanding three CRDs (Custom Resource Definitions) provided by Argo Rollouts makes everything else easy to follow.

Rollout: A resource that replaces the standard Deployment. Canary steps and strategy are defined here.

AnalysisTemplate: A template that defines metric-based pass criteria, such as "success rate must be 95% or higher."

AnalysisRun: An instance of an AnalysisTemplate executed at a specific deployment step. The result is PASS or FAIL.

Practical Implementation

Basic Rollout YAML Structure — Replacing a Deployment with a Canary Strategy

The first thing to do is replace your existing Deployment resource with a Rollout. The apiVersion and kind change, and you declare the canary steps in the strategy block.

yaml

# k8s/rollout.yaml (stored in the config repository)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: ghcr.io/myorg/my-app:1.0.0   # GitHub Actions automatically updates this value
  strategy:
    canary:
      canaryService: my-app-canary    # Service dedicated to canary
      stableService: my-app-stable    # Service dedicated to stable
      trafficRouting:
        nginx:
          stableIngress: my-app-ingress
      steps:
      - setWeight: 5           # Shift 5% of traffic to canary
      - pause: {duration: 5m}  # Observe for 5 minutes
      - analysis:              # Validate Prometheus metrics
          templates:
          - templateName: success-rate
          args:
          - name: app-name
            value: my-app
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      # Once all steps pass, the controller automatically promotes to stable
      # You don't need to explicitly set setWeight: 100

yaml

# k8s/services.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-app-stable
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-canary
spec:
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

The selectors on both Services can be identical at first. Argo Rollouts automatically manipulates the selectors during deployment to ensure traffic only reaches canary Pods.

Field	Role
`canaryService` / `stableService`	A pair of Services for traffic separation. They must be created in advance.
`trafficRouting.nginx`	Weight-based routing via Nginx Ingress. Can be swapped for Istio or Linkerd. Nginx adjusts weights at the Ingress Controller level, while Istio/Linkerd use a sidecar mesh approach that enables fine-grained header-based routing but increases operational complexity.
`setWeight`	The percentage of traffic (%) to send to canary Pods
`pause`	Wait time before advancing to the next step. Using `{}` (no duration) creates a manual approval gate.
`analysis`	Runs an AnalysisTemplate at this step. Triggers automatic rollback on FAIL.

Declaring Prometheus Analysis Conditions — Setting Automatic Rollback Criteria with AnalysisTemplate

Declaring metric analysis conditions in a separate template allows them to be reused across multiple Rollouts. Parameterizing the app name via args is standard practice for reusability.

yaml

# k8s/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: app-name            # The app name is injected from the Rollout
  metrics:
  - name: success-rate
    interval: 1m              # Measure every 1 minute
    successCondition: result[0] >= 0.95   # Must pass with a success rate of 95% or higher
    failureLimit: 3           # AnalysisRun fails after 3 consecutive failures
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status!~"5.*",app="{{args.app-name}}"}[2m]))
          /
          sum(rate(http_requests_total{app="{{args.app-name}}"}[2m]))

This lets you reuse the same AnalysisTemplate across multiple Rollouts — for my-app, my-other-service, etc. — by simply changing the app name.

Configuring ArgoCD Auto-Sync — Immediately Applying Git Changes to the Cluster

Declare which Git repository and which path ArgoCD should watch. The repository referenced here is my-app-config — a dedicated Kubernetes config repo, not the application code repo.

yaml

# argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/myorg/my-app-config   # Dedicated config repository
    targetRevision: HEAD
    path: k8s/    # The Rollout, Services, and AnalysisTemplate all live here
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # Resources deleted from Git are also deleted from the cluster
      selfHeal: true    # Automatically restores cluster state if it diverges from Git

With syncPolicy.automated enabled, the cluster syncs immediately whenever a change occurs in the k8s/ path of the my-app-config repository.

Why separate repositories? Splitting the app code repo (source code, Dockerfile) from the config repo (Kubernetes YAML) keeps the deployment history cleanly contained in the config repo's Git log. Application feature commits and deployment commits don't get mixed together, making rollbacks and audit trails much cleaner.

Automatically Updating the Image Tag from CI — Connecting GitHub Actions to the Config Repo

yaml

# .github/workflows/deploy.yml (located in the app code repo)
name: Deploy
 
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Build and push image
        run: |
          # Use the full image name including the registry address
          docker build -t ghcr.io/myorg/my-app:${{ github.sha }} .
          docker push ghcr.io/myorg/my-app:${{ github.sha }}
 
      - name: Checkout config repo
        uses: actions/checkout@v4
        with:
          repository: myorg/my-app-config
          token: ${{ secrets.CONFIG_REPO_TOKEN }}
          path: config-repo
 
      - name: Update image tag in config repo
        run: |
          cd config-repo
          # Note: the sed replacement approach is an illustrative example for concept explanation.
          # In production, Kustomize's images field or Helm values updates are recommended.
          # The sed approach can break immediately if the image tag format changes.
          sed -i "s|image: ghcr.io/myorg/my-app:.*|image: ghcr.io/myorg/my-app:${{ github.sha }}|" k8s/rollout.yaml
          git config user.email "actions@github.com"
          git config user.name "GitHub Actions"
          git add k8s/rollout.yaml
          git commit -m "chore: deploy ${{ github.sha }}"
          # If another commit lands on main first, the push may fail.
          # In production, it's safer to retry after git pull --rebase
          # or to use the peter-evans/create-pull-request action.
          git push

Pros and Cons

Advantages

Item	Details
Full GitOps consistency	Everything, including deployment strategy, is declared in Git. A PR review is a deployment review.
Metric-based automatic promotion	Prometheus, Datadog, New Relic, etc. can be used directly as deployment gates.
Automatic rollback	On AnalysisRun failure, canary weight is immediately restored to 0. No human monitoring required.
Fine-grained step control	Traffic percentages, wait times, and analysis conditions can be explicitly declared per step.
Manual approval gate	`pause: {}` for indefinite hold → for critical deployments, a human can make the promotion decision directly.
ArgoCD UI integration	Rollout progress, canary steps, and AnalysisRun results can be monitored in real time from the dashboard.

Disadvantages and Caveats

Item	Details	Mitigation
Deployment migration overhead	Existing `Deployment` resources must be replaced with `Rollout`. Existing HPA and monitoring integrations must also be reviewed.	Start with your lowest-traffic service, let the team get comfortable with the pattern, then migrate the rest.
Rollback not reflected in Git	When an automatic rollback occurs, the cluster state diverges from Git state. If ArgoCD attempts a re-sync, it may redeploy the same new version.	Set up Slack webhook alerts on rollback, and agree on a team process where the responsible person immediately reverts the image tag in the config repo to the previous value.
Increased operational complexity	You must operate ArgoCD + Argo Rollouts + Ingress Controller + Prometheus together.	Start with just traffic shifting (no AnalysisTemplate), then add the analysis step incrementally once operations feel comfortable.
Traffic routing infrastructure required	One of Nginx Ingress, Istio, or Linkerd must be configured in advance for weight-based routing.	Most teams already use Nginx Ingress, so you can leverage your existing stack as-is.
DB schema compatibility	Since canary and stable versions run simultaneously, both versions must be able to read the same DB schema.	Use the Expand/Contract pattern as a prerequisite — add new columns first, then remove old columns only after the previous version is fully gone.

Expand/Contract Pattern: A DB migration approach where new columns or tables are added first (Expand) and both versions coexist, then the old schema is removed (Contract) once the previous version is gone. Necessary for maintaining backward compatibility when two versions run simultaneously, as in canary deployments.

The Most Common Mistakes in Practice

Applying the Rollout without creating canaryService and stableService first. Argo Rollouts requires both Services to be created in advance for traffic separation. Without them, the controller throws errors and the canary never starts. When I first set this up, I missed this and spent a long time stuck on ProgressDeadlineExceeded errors.
Setting the interval in AnalysisTemplate too short. Querying Prometheus at 30-second intervals yields insufficient samples and produces statistically meaningless results. Starting at 1–2 minute intervals and adjusting based on traffic volume is the right approach.
Leaving Git state untouched after an automatic rollback. Even after a rollback, the image tag in the config repo still points to the new version. If ArgoCD attempts a re-sync, it may redeploy the same new version. It's important to share rollback alert notifications and config repo recovery procedures with the team ahead of time.

Closing Thoughts

Three steps you can take right now:

You can install the Argo Rollouts controller in your cluster. Create the namespace with kubectl create namespace argo-rollouts, then check the official documentation for the latest install.yaml and apply it. (Installation commands can change between versions, so checking the official docs for the latest version is recommended.) If you also install the kubectl plugin (kubectl argo rollouts), you can watch canary progress in real time with kubectl argo rollouts get rollout my-app --watch.
You can pick an existing Deployment and replace it with a Rollout. Change the apiVersion and kind, then start with a simple 3-step flow in strategy.canary.steps: setWeight: 10 → pause: {duration: 2m} → automatic controller promotion. It's best to verify traffic shifting first, without an AnalysisTemplate.
If your Prometheus metrics are ready, you can attach an AnalysisTemplate. Start with a lenient threshold like successCondition: result[0] >= 0.90, then gradually raise the bar as you gather operational data.

References

#ArgoCD#ArgoRollouts#카나리배포#Kubernetes#GitOps#Prometheus#GitHubActions#ProgressiveDelivery#NginxIngress#AnalysisTemplate

Core Concepts

GitOps: PR Merge = Deployment Approval

ArgoCD vs Argo Rollouts: They Have Different Roles

The Complete Canary Automation Flow

Rollout, AnalysisTemplate, AnalysisRun

Practical Implementation

Basic Rollout YAML Structure — Replacing a Deployment with a Canary Strategy

Declaring Prometheus Analysis Conditions — Setting Automatic Rollback Criteria with AnalysisTemplate

Configuring ArgoCD Auto-Sync — Immediately Applying Git Changes to the Cluster

Automatically Updating the Image Tag from CI — Connecting GitHub Actions to the Config Repo

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

GitOps: PR Merge = Deployment Approval

ArgoCD vs Argo Rollouts: They Have Different Roles

The Complete Canary Automation Flow

Rollout, AnalysisTemplate, AnalysisRun

Practical Implementation

Basic Rollout YAML Structure — Replacing a Deployment with a Canary Strategy

Declaring Prometheus Analysis Conditions — Setting Automatic Rollback Criteria with AnalysisTemplate

Configuring ArgoCD Auto-Sync — Immediately Applying Git Changes to the Cluster

Automatically Updating the Image Tag from CI — Connecting GitHub Actions to the Config Repo

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Argo Rollouts BlueGreen Deployment Strategy — How It Differs from Canary, and When to Choose It

Argo Rollouts AnalysisTemplate — Implementing Automated Canary Rollbacks with Prometheus, Datadog, and Webhook

Automating Canary Deployment Notifications to Deliver Argo Rollouts AnalysisRun Failures Instantly via Slack and PagerDuty

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift