Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments

When deploying a new version in a Kubernetes environment, the most dreaded moment is when you have to determine, "Is this version actually safe?" Imagine starting a canary deployment at 2 AM, refreshing the Datadog dashboard all night, and discovering 30 minutes later that the error rate has skyrocketed—after thousands of users have already been affected. If your organization is already using an APM like Datadog or New Relic, you can let the system make that judgment instead of a human. Using Flagger's MetricTemplate CRD, you can link actual service metrics stored in external APMs to the automated judgment criteria of canary analytics.

Flagger is the official progressive delivery operator of the Flux CD project. A similar tool is Argo Rollouts; however, while Argo Rollouts excels in granular manual control and phased approval, Flagger focuses on metric-based full automation and integration with GitOps workflows. This article covers practical examples of integrating with Datadog and New Relic using Flagger's MetricTemplate, reuse patterns utilizing templateVariables, and common mistakes that occur in production.

Without the need to build a separate new metric stack, you can directly connect the APM queries you are already using today as canary criteria.

Who is this guide for?: This guide is intended for platform and DevOps engineers in teams running Kubernetes clusters, equipped with an L7 traffic management layer such as Istio or NGINX Ingress, and under contract with Datadog or New Relic. For Flagger installation, refer to the Official Installation Guide.

Key Concepts

What is MetricTemplate CRD

MetricTemplate is a Kubernetes CRD that defines the questions Flagger will ask external metric providers during canary analysis. It is a declarative specification stating, "Connect to Datadog and execute this query, and if the result falls outside this criterion, stop the deployment," and the analysis criteria are completed by the Canary object referencing it.

트래픽 일부 → 카나리 Pod
    → Flagger가 interval마다 MetricTemplate 쿼리 실행
    → 반환값이 thresholdRange 내 → weight 증가
    → 최대 weight 도달 → 자동 프로모션
    → threshold 위반 횟수 초과 → 자동 롤백

MetricTemplate has three core operating principles.

Principle	Explanation
Returns a single float64	The query must return a single number, and this value is compared to `thresholdRange`
Template Variable Auto-Injection	`{{ namespace }}`, `{{ target }}`, `{{ interval }}`, etc. are automatically substituted in the canary context
Namespace Sharing	Multiple teams and services can reference a single MetricTemplate — Unify analysis standards to organizational standards

Progressive Delivery: A deployment strategy that exposes a new version to a portion of traffic first, rather than the entire audience, and then gradually scales it up while verifying safety based on metrics. Canary deployment is a representative implementation method.

Overall Architecture Structure

For MetricTemplate to actually function, a layer for splitting traffic is required. Flagger itself handles only the decision logic, while the actual traffic routing is performed by tools such as Istio, Linkerd, and NGINX.

css

[개발자] → Git 커밋 → [Flux CD] → Kubernetes 적용
                                        ↓
                               [Flagger 오퍼레이터]
                                 ↙            ↘
              [Istio/NGINX]               [MetricTemplate]
              트래픽 분할                  APM 쿼리 실행
                  ↓                            ↓
            카나리 Pod               Datadog / New Relic
                  ↘                        ↙
                    판단: 프로모션 or 롤백

Service Mesh: An infrastructure layer, such as tools like Istio and Linkerd, that intercepts traffic between Pods to control routing, observation, and security. Flagger's canary analysis relies on this layer's traffic splitting capabilities.

Practical Application

We cover three examples. Example 1 explains how to analyze mesh layer metrics using Datadog in an environment using an Istio service mesh, and Example 2 explains how to analyze ingress layer metrics using New Relic in an environment using NGINX Ingress. The reasons why these two APMs are specialized for different layers are explained between the examples. Example 3 is a pattern for reusing a single MetricTemplate across multiple services. You may start with the example that best suits your environment.

Example 1: Datadog Integration — Istio Mesh Layer 404 Error Rate-Based Automatic Rollback

Prerequisite: This example is based on an environment using the Istio Service Mesh. The istio.mesh.request.count metric is a mesh layer metric collected by Istio and does not work in an NGINX Ingress environment. If you are using NGINX, please refer to Example 2.

Datadog automatically collects mesh layer metrics generated by Istio sidecar proxies. Because the error rate measured at this layer is accurately attributed to the specific service's canary Pod, it provides high reliability for deployment decisions.

Step 1 — Register API Key as Kubernetes Secret

sql

# kubectl create secret 명령어 사용 (평문 노출 위험 없음)
kubectl create secret generic datadog \
  --namespace istio-system \
  --from-literal=datadog_api_key=여기에_실제_API_키 \
  --from-literal=datadog_application_key=여기에_실제_APP_키

Alternatively, if managing directly with YAML, use the value encoded with the echo -n "실제키값" | base64 command.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog
  namespace: istio-system
type: Opaque
data:
  datadog_api_key: <base64-encoded-api-key>   # echo -n "키값" | base64
  datadog_application_key: <base64-encoded-app-key>

Step 2 — Define MetricTemplate

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: not-found-percentage
  namespace: istio-system
spec:
  provider:
    type: datadog
    address: https://api.datadoghq.com
    secretRef:
      name: datadog
  query: |
    100 - (
      sum:istio.mesh.request.count{
        reporter:destination,
        destination_workload_namespace:{{ namespace }},
        destination_workload:{{ target }},
        !response_code:404
      }.as_count()
      /
      sum:istio.mesh.request.count{
        reporter:destination,
        destination_workload_namespace:{{ namespace }},
        destination_workload:{{ target }}
      }.as_count()
    ) * 100

Query Element	Role
`{{ namespace }}`	Automatically replace with the namespace containing Canary
`{{ target }}`	Auto-replace with canary target workload name
`!response_code:404`	Count only non-404 requests (valid requests)
`100 - (정상/전체 * 100)`	Consequently, returns 404 error rate (%)

Step 3 — Refer to MetricTemplate in Canary

yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app
  namespace: production
spec:
  # targetRef, service 섹션 등 필수 필드는 생략됨 — 전체 스펙은 공식 문서 참고
  analysis:
    interval: 1m         # 1분마다 MetricTemplate 쿼리 실행
    threshold: 5         # 5번 연속 분석 실패 시 롤백 트리거
    maxWeight: 50        # 카나리 트래픽 최대 50%까지 증가
    stepWeight: 10       # 분석 성공 1회당 트래픽 10%씩 증가
    metrics:
      - name: "not-found-percentage"
        templateRef:
          name: not-found-percentage
          namespace: istio-system   # 다른 네임스페이스의 MetricTemplate 참조
        thresholdRange:
          max: 5          # 404 에러율 5% 초과 시 해당 분석 실패 판정
        interval: 1m

YAML Value Selection Criteria: threshold: 5 means "rollback only after 5 consecutive failures to avoid mistaking temporary spikes for a rollback." If it is a payment service where stability is critical, you can lower it to threshold: 3, and if it is an experimental feature, you can raise it to threshold: 7. stepWeight: 10 means a total of 5 verification steps, increasing by 10% every 5 minutes up to 50%. If deployment speed is important, adjust it to stepWeight: 20, but be aware of the trade-off of reducing the number of verification steps.

thresholdRange: If only max is specified, it is considered normal if the value is less than or equal to this value; if only min is specified, it is considered normal if the value is greater than or equal to this value; and if both are specified, it is considered normal if the value is within the range. It is common to use min for the success rate and max for the error rate.

Cross-Namespace References and RBAC: If templateRef references a MetricTemplate from another namespace, the Flagger operator's ServiceAccount may require a ClusterRole capable of reading the MetricTemplate and Secret from that namespace. If cross-namespace access fails, first check the Flagger pod logs for RBAC-related errors.

How to check operation

bash

# Canary 오브젝트 상태 확인
kubectl describe canary my-app -n production
 
# 분석 관련 이벤트 스트림 확인
kubectl get events -n production --field-selector reason=Synced
 
# Flagger 오퍼레이터 로그에서 MetricTemplate 쿼리 결과 확인
kubectl logs -n flagger-system deploy/flagger -f | grep "not-found-percentage"

Example 2: New Relic Integration — NGINX Ingress Layer HTTP 5xx Error Rate Analysis

While Datadog excels in the mesh layer (inter-service traffic), New Relic is well-suited for scenarios measuring the ingress layer (client → cluster entry point) thanks to its flexible query language called NRQL. It is a more natural choice for teams using NGINX Ingress.

NRQL Query Validation Recommendation: The query below uses the FROM Metric WHERE metricName = 'nginx_ingress_controller_requests' format. Depending on the environment, directly specifying the event type in the FROM nginx_ingress_controller_requests format may be more common. It is strongly recommended that you execute the query directly in the New Relic Query Builder to verify that values within the expected range are returned before applying it to the MetricTemplate.

Step 1 — Register New Relic Credential Secret

kubectl create secret generic newrelic \
  --namespace ingress-nginx \
  --from-literal=newrelic_account_id=여기에_계정_ID \
  --from-literal=newrelic_query_key=여기에_쿼리_키

Step 2 — Define NRQL MetricTemplate

sql

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: newrelic-error-rate
  namespace: ingress-nginx
spec:
  provider:
    type: newrelic
    secretRef:
      name: newrelic
  query: |
    SELECT
      filter(sum(nginx_ingress_controller_requests), WHERE status >= '500')
      / sum(nginx_ingress_controller_requests) * 100
    FROM Metric
    WHERE metricName = 'nginx_ingress_controller_requests'
      AND ingress = '{{ ingress }}'
      AND namespace = '{{ namespace }}'

Here, {{ ingress }} is a user-defined variable, not a standard Flagger built-in variable. This value is injected by each service through templateVariables in the next step.

Example 3: Reusing a single template across multiple services with templateVariables

templateVariables is a feature supported in Flagger v1.12.0 or later that allows the same MetricTemplate to be parameterized and shared across different services. The above newrelic-error-rate template is an example of using different ingress names in two different services.

yaml

# 서비스 A의 Canary
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: service-a
  namespace: production
spec:
  analysis:
    metrics:
      - name: "newrelic-error-rate"
        templateRef:
          name: newrelic-error-rate
          namespace: ingress-nginx
        templateVariables:
          ingress: service-a-ingress    # 서비스 A의 ingress 이름 주입
        thresholdRange:
          max: 1                        # 5xx 에러율 1% 초과 시 실패 판정
        interval: 2m

yaml

# 서비스 B의 Canary — 동일 템플릿, 다른 변수와 임계값
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: service-b
  namespace: production
spec:
  analysis:
    metrics:
      - name: "newrelic-error-rate"
        templateRef:
          name: newrelic-error-rate
          namespace: ingress-nginx
        templateVariables:
          ingress: service-b-ingress    # 서비스 B의 ingress 이름 주입
        thresholdRange:
          max: 2                        # 서비스 B는 임계값을 다르게 설정
        interval: 2m

The practical value of this pattern is significant. Once the platform team defines the MetricTemplate in the central namespace, each service team can adhere to organizational standard analysis criteria by adjusting only templateVariables and thresholdRange. Even when changes to query logic are needed, modifying just the MetricTemplate immediately reflects the changes across all services.

Pros and Cons Analysis

Advantages

Item	Content
Immediately Leverage Existing APM	Directly connect Datadog and New Relic, already contracted and in operation, as deployment criteria — No need to build a separate metric stack
Organizational Standardization	Placing MetricTemplate in a shared namespace unifies analysis standards across teams and services — deployment standards become code, not verbal conventions
Fully Automated	Automatic promotion and rollback decisions based on metrics without human monitoring the dashboard — Reduces the burden of nighttime deployments
GitOps Friendly	Both MetricTemplate and Canary specifications are managed in YAML, so PR reviews, change history, and rollbacks are handled within the Git workflow
Multiple Metrics AND Conditions	Promotion proceeds only if multiple criteria, such as Error Rate + P99 Latency + Business Conversion Rate, are simultaneously satisfied
Flux CD Ecosystem Integration	Configure a fully automated pipeline where Flagger automatically starts canary analysis when Flux detects Git changes and applies them to Kubernetes

Disadvantages and Precautions

Item	Content	Response Plan
Simple Threshold Comparison	Possible misjudgment as thresholds are compared without statistical reliability when traffic is low	Induce sufficient traffic in the canary and set `interval` to a long value to accumulate sufficient data
L7 Traffic Splitting is Essential	Canary itself is impossible without a service mesh or ingress such as Istio, Linkerd, or NGINX	Build the traffic management layer first, then introduce Flagger
APM API Call Costs	Execute Datadog·New Relic API queries every `interval` — Costs can accumulate over short intervals	Do not set `interval` too short (minimum 1m recommended)
Late Discovery of Query Errors	MetricTemplate query errors revealed at the time of canary analysis	Validate queries in the APM console first, then apply
Cross-Namespace Secret Restriction	`secretRef` can only reference the same namespace by default	Place MetricTemplate and Secret in the same namespace
Cold Start Misanalysis	Analysis reliability decreases due to insufficient metric data during the initial Canary stage	Set initial warm-up phase with `spec.analysis.iterations`, adjust `progressDeadlineSeconds`·`canaryReadyThreshold` to stabilize Pod, then start analysis

Cold Start: When a canary Pod is just launched, there is almost no metric data in the APM. If analysis is started in this state, a normal service may be misidentified as abnormal. In practice, it is recommended to skip the initial analysis using spec.analysis.iterations or configure it so that analysis starts after checking the Pod's readiness status by setting canaryReadyThreshold.

The Most Common Mistakes in Practice

Skip Query Pre-validation: This applies when query syntax errors or empty results are discovered only after deploying the MetricTemplate and running the actual canary analysis. You must first verify that values within the expected range are returned by running the same query directly in Datadog Metrics Explorer or New Relic Query Builder. Query errors are flagged by the Flagger as "Unable to read metric value," which can lead to an immediate rollback.
Relying on a Single Metric: Judging solely based on the error rate misses cases where there are no errors but P99 latency skyrockets. If you register both the error rate and P99 latency in the analysis.metrics array, the promotion will only proceed if both criteria are met. It is safest to add the business conversion rate metric as well, if possible.
Overly lenient initial threshold: A case where the threshold is set loosely like max: 20 with the mindset of "let's deploy quickly first," and a rollback is not triggered in the event of an actual failure. The correct order is to start conservatively like max: 1 and gradually relax as operational data is accumulated.

In Conclusion

Flagger MetricTemplate CRD is the most direct way to transform existing APM investments into a deployment safety net. Introducing this pattern to your team reduces nightly deployment alerts and replaces the time spent "watching for 30 minutes to see if a deployment is safe" with automated analytics. Deployment approval cycles are shortened, and as the platform team manages the central MetricTemplate, each service team can focus solely on threshold adjustments.

3 Steps to Start Right Now:

Verify the query first in the APM console: If using Datadog, write a query that returns the error rate of the Canary Pod directly in Metrics Explorer; if using New Relic, use Query Builder, and check the expected range — this step is essential before applying MetricTemplate.
Connecting Pipelines After Standalone MetricTemplate Deployment: Deploy the template first with kubectl apply -f metric-template.yaml, then gradually transition by adding only templateRef to the existing Canary object — view analysis results in real-time with kubectl describe canary <이름>.
Reinforcing the safety net with multiple metric combinations: If the Error Rate MetricTemplate operates successfully, add the P99 Latency MetricTemplate and register it in parallel in the analysis.metrics array — completing a multi-safety net where promotion proceeds only when both criteria are satisfied.

Next Post: Setting up HTTP Header-Based A/B Routing YAML in a Flagger + Istio Environment, and Integrating Business Metrics Like Conversion Rate and Session Length as Analysis Criteria Using New Relic NRQL

Reference Materials

Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments | DEV BAK - 기술블로그

Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments

Without the need to build a separate new metric stack, you can directly connect the APM queries you are already using today as canary criteria.

Key Concepts

What is MetricTemplate CRD

트래픽 일부 → 카나리 Pod
    → Flagger가 interval마다 MetricTemplate 쿼리 실행
    → 반환값이 thresholdRange 내 → weight 증가
    → 최대 weight 도달 → 자동 프로모션
    → threshold 위반 횟수 초과 → 자동 롤백

MetricTemplate has three core operating principles.

Principle	Explanation
Returns a single float64	The query must return a single number, and this value is compared to `thresholdRange`
Template Variable Auto-Injection	`{{ namespace }}`, `{{ target }}`, `{{ interval }}`, etc. are automatically substituted in the canary context
Namespace Sharing	Multiple teams and services can reference a single MetricTemplate — Unify analysis standards to organizational standards

Overall Architecture Structure

css

[개발자] → Git 커밋 → [Flux CD] → Kubernetes 적용
                                        ↓
                               [Flagger 오퍼레이터]
                                 ↙            ↘
              [Istio/NGINX]               [MetricTemplate]
              트래픽 분할                  APM 쿼리 실행
                  ↓                            ↓
            카나리 Pod               Datadog / New Relic
                  ↘                        ↙
                    판단: 프로모션 or 롤백

Practical Application

Example 1: Datadog Integration — Istio Mesh Layer 404 Error Rate-Based Automatic Rollback

Step 1 — Register API Key as Kubernetes Secret

sql

# kubectl create secret 명령어 사용 (평문 노출 위험 없음)
kubectl create secret generic datadog \
  --namespace istio-system \
  --from-literal=datadog_api_key=여기에_실제_API_키 \
  --from-literal=datadog_application_key=여기에_실제_APP_키

Alternatively, if managing directly with YAML, use the value encoded with the echo -n "실제키값" | base64 command.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog
  namespace: istio-system
type: Opaque
data:
  datadog_api_key: <base64-encoded-api-key>   # echo -n "키값" | base64
  datadog_application_key: <base64-encoded-app-key>

Step 2 — Define MetricTemplate

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: not-found-percentage
  namespace: istio-system
spec:
  provider:
    type: datadog
    address: https://api.datadoghq.com
    secretRef:
      name: datadog
  query: |
    100 - (
      sum:istio.mesh.request.count{
        reporter:destination,
        destination_workload_namespace:{{ namespace }},
        destination_workload:{{ target }},
        !response_code:404
      }.as_count()
      /
      sum:istio.mesh.request.count{
        reporter:destination,
        destination_workload_namespace:{{ namespace }},
        destination_workload:{{ target }}
      }.as_count()
    ) * 100

Query Element	Role
`{{ namespace }}`	Automatically replace with the namespace containing Canary
`{{ target }}`	Auto-replace with canary target workload name
`!response_code:404`	Count only non-404 requests (valid requests)
`100 - (정상/전체 * 100)`	Consequently, returns 404 error rate (%)

Step 3 — Refer to MetricTemplate in Canary

yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app
  namespace: production
spec:
  # targetRef, service 섹션 등 필수 필드는 생략됨 — 전체 스펙은 공식 문서 참고
  analysis:
    interval: 1m         # 1분마다 MetricTemplate 쿼리 실행
    threshold: 5         # 5번 연속 분석 실패 시 롤백 트리거
    maxWeight: 50        # 카나리 트래픽 최대 50%까지 증가
    stepWeight: 10       # 분석 성공 1회당 트래픽 10%씩 증가
    metrics:
      - name: "not-found-percentage"
        templateRef:
          name: not-found-percentage
          namespace: istio-system   # 다른 네임스페이스의 MetricTemplate 참조
        thresholdRange:
          max: 5          # 404 에러율 5% 초과 시 해당 분석 실패 판정
        interval: 1m

How to check operation

bash

# Canary 오브젝트 상태 확인
kubectl describe canary my-app -n production
 
# 분석 관련 이벤트 스트림 확인
kubectl get events -n production --field-selector reason=Synced
 
# Flagger 오퍼레이터 로그에서 MetricTemplate 쿼리 결과 확인
kubectl logs -n flagger-system deploy/flagger -f | grep "not-found-percentage"

Example 2: New Relic Integration — NGINX Ingress Layer HTTP 5xx Error Rate Analysis

Step 1 — Register New Relic Credential Secret

kubectl create secret generic newrelic \
  --namespace ingress-nginx \
  --from-literal=newrelic_account_id=여기에_계정_ID \
  --from-literal=newrelic_query_key=여기에_쿼리_키

Step 2 — Define NRQL MetricTemplate

sql

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: newrelic-error-rate
  namespace: ingress-nginx
spec:
  provider:
    type: newrelic
    secretRef:
      name: newrelic
  query: |
    SELECT
      filter(sum(nginx_ingress_controller_requests), WHERE status >= '500')
      / sum(nginx_ingress_controller_requests) * 100
    FROM Metric
    WHERE metricName = 'nginx_ingress_controller_requests'
      AND ingress = '{{ ingress }}'
      AND namespace = '{{ namespace }}'

Here, {{ ingress }} is a user-defined variable, not a standard Flagger built-in variable. This value is injected by each service through templateVariables in the next step.

Example 3: Reusing a single template across multiple services with templateVariables

yaml

# 서비스 A의 Canary
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: service-a
  namespace: production
spec:
  analysis:
    metrics:
      - name: "newrelic-error-rate"
        templateRef:
          name: newrelic-error-rate
          namespace: ingress-nginx
        templateVariables:
          ingress: service-a-ingress    # 서비스 A의 ingress 이름 주입
        thresholdRange:
          max: 1                        # 5xx 에러율 1% 초과 시 실패 판정
        interval: 2m

yaml

# 서비스 B의 Canary — 동일 템플릿, 다른 변수와 임계값
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: service-b
  namespace: production
spec:
  analysis:
    metrics:
      - name: "newrelic-error-rate"
        templateRef:
          name: newrelic-error-rate
          namespace: ingress-nginx
        templateVariables:
          ingress: service-b-ingress    # 서비스 B의 ingress 이름 주입
        thresholdRange:
          max: 2                        # 서비스 B는 임계값을 다르게 설정
        interval: 2m

Pros and Cons Analysis

Advantages

Item	Content
Immediately Leverage Existing APM	Directly connect Datadog and New Relic, already contracted and in operation, as deployment criteria — No need to build a separate metric stack
Organizational Standardization	Placing MetricTemplate in a shared namespace unifies analysis standards across teams and services — deployment standards become code, not verbal conventions
Fully Automated	Automatic promotion and rollback decisions based on metrics without human monitoring the dashboard — Reduces the burden of nighttime deployments
GitOps Friendly	Both MetricTemplate and Canary specifications are managed in YAML, so PR reviews, change history, and rollbacks are handled within the Git workflow
Multiple Metrics AND Conditions	Promotion proceeds only if multiple criteria, such as Error Rate + P99 Latency + Business Conversion Rate, are simultaneously satisfied
Flux CD Ecosystem Integration	Configure a fully automated pipeline where Flagger automatically starts canary analysis when Flux detects Git changes and applies them to Kubernetes

Disadvantages and Precautions

Item	Content	Response Plan
Simple Threshold Comparison	Possible misjudgment as thresholds are compared without statistical reliability when traffic is low	Induce sufficient traffic in the canary and set `interval` to a long value to accumulate sufficient data
L7 Traffic Splitting is Essential	Canary itself is impossible without a service mesh or ingress such as Istio, Linkerd, or NGINX	Build the traffic management layer first, then introduce Flagger
APM API Call Costs	Execute Datadog·New Relic API queries every `interval` — Costs can accumulate over short intervals	Do not set `interval` too short (minimum 1m recommended)
Late Discovery of Query Errors	MetricTemplate query errors revealed at the time of canary analysis	Validate queries in the APM console first, then apply
Cross-Namespace Secret Restriction	`secretRef` can only reference the same namespace by default	Place MetricTemplate and Secret in the same namespace
Cold Start Misanalysis	Analysis reliability decreases due to insufficient metric data during the initial Canary stage	Set initial warm-up phase with `spec.analysis.iterations`, adjust `progressDeadlineSeconds`·`canaryReadyThreshold` to stabilize Pod, then start analysis

The Most Common Mistakes in Practice

Skip Query Pre-validation: This applies when query syntax errors or empty results are discovered only after deploying the MetricTemplate and running the actual canary analysis. You must first verify that values within the expected range are returned by running the same query directly in Datadog Metrics Explorer or New Relic Query Builder. Query errors are flagged by the Flagger as "Unable to read metric value," which can lead to an immediate rollback.
Relying on a Single Metric: Judging solely based on the error rate misses cases where there are no errors but P99 latency skyrockets. If you register both the error rate and P99 latency in the analysis.metrics array, the promotion will only proceed if both criteria are met. It is safest to add the business conversion rate metric as well, if possible.
Overly lenient initial threshold: A case where the threshold is set loosely like max: 20 with the mindset of "let's deploy quickly first," and a rollback is not triggered in the event of an actual failure. The correct order is to start conservatively like max: 1 and gradually relax as operational data is accumulated.

In Conclusion

3 Steps to Start Right Now:

Verify the query first in the APM console: If using Datadog, write a query that returns the error rate of the Canary Pod directly in Metrics Explorer; if using New Relic, use Query Builder, and check the expected range — this step is essential before applying MetricTemplate.
Connecting Pipelines After Standalone MetricTemplate Deployment: Deploy the template first with kubectl apply -f metric-template.yaml, then gradually transition by adding only templateRef to the existing Canary object — view analysis results in real-time with kubectl describe canary <이름>.
Reinforcing the safety net with multiple metric combinations: If the Error Rate MetricTemplate operates successfully, add the P99 Latency MetricTemplate and register it in parallel in the analysis.metrics array — completing a multi-safety net where promotion proceeds only when both criteria are satisfied.

Key Concepts

What is MetricTemplate CRD

Overall Architecture Structure

Practical Application

Example 1: Datadog Integration — Istio Mesh Layer 404 Error Rate-Based Automatic Rollback

Example 2: New Relic Integration — NGINX Ingress Layer HTTP 5xx Error Rate Analysis

Example 3: Reusing a single template across multiple services with templateVariables

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

What is MetricTemplate CRD

Overall Architecture Structure

Practical Application

Example 1: Datadog Integration — Istio Mesh Layer 404 Error Rate-Based Automatic Rollback

Example 2: New Relic Integration — NGINX Ingress Layer HTTP 5xx Error Rate Analysis

Example 3: Reusing a single template across multiple services with templateVariables

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Flagger + Istio A/B Routing: Integrating New Relic NRQL with Conversion Rate as Distribution Gating Criteria

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate

Simplifying Canary Deployment with a Single Flagger CRD: From KEDA ScaledObject Separation Issues to Argo CD ApplicationSet Multicluster MCP Server Automation

Implementing AI Agent (MCP) Server Canary Deployment and Automatic Rollback with KEDA + Argo Rollouts