Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments
When deploying a new version in a Kubernetes environment, the most dreaded moment is when you have to determine, "Is this version actually safe?" Imagine starting a canary deployment at 2 AM, refreshing the Datadog dashboard all night, and discovering 30 minutes later that the error rate has skyrocketed—after thousands of users have already been affected. If your organization is already using an APM like Datadog or New Relic, you can let the system make that judgment instead of a human. Using Flagger's MetricTemplate CRD, you can link actual service metrics stored in external APMs to the automated judgment criteria of canary analytics.
Flagger is the official progressive delivery operator of the Flux CD project. A similar tool is Argo Rollouts; however, while Argo Rollouts excels in granular manual control and phased approval, Flagger focuses on metric-based full automation and integration with GitOps workflows. This article covers practical examples of integrating with Datadog and New Relic using Flagger's MetricTemplate, reuse patterns utilizing templateVariables, and common mistakes that occur in production.
Without the need to build a separate new metric stack, you can directly connect the APM queries you are already using today as canary criteria.
Who is this guide for?: This guide is intended for platform and DevOps engineers in teams running Kubernetes clusters, equipped with an L7 traffic management layer such as Istio or NGINX Ingress, and under contract with Datadog or New Relic. For Flagger installation, refer to the Official Installation Guide.
Key Concepts
What is MetricTemplate CRD
MetricTemplate is a Kubernetes CRD that defines the questions Flagger will ask external metric providers during canary analysis. It is a declarative specification stating, "Connect to Datadog and execute this query, and if the result falls outside this criterion, stop the deployment," and the analysis criteria are completed by the Canary object referencing it.
트래픽 일부 → 카나리 Pod
→ Flagger가 interval마다 MetricTemplate 쿼리 실행
→ 반환값이 thresholdRange 내 → weight 증가
→ 최대 weight 도달 → 자동 프로모션
→ threshold 위반 횟수 초과 → 자동 롤백MetricTemplate has three core operating principles.
| Principle | Explanation |
|---|---|
| Returns a single float64 | The query must return a single number, and this value is compared to thresholdRange |
| Template Variable Auto-Injection | {{ namespace }}, {{ target }}, {{ interval }}, etc. are automatically substituted in the canary context |
| Namespace Sharing | Multiple teams and services can reference a single MetricTemplate — Unify analysis standards to organizational standards |
Progressive Delivery: A deployment strategy that exposes a new version to a portion of traffic first, rather than the entire audience, and then gradually scales it up while verifying safety based on metrics. Canary deployment is a representative implementation method.
Overall Architecture Structure
For MetricTemplate to actually function, a layer for splitting traffic is required. Flagger itself handles only the decision logic, while the actual traffic routing is performed by tools such as Istio, Linkerd, and NGINX.
[개발자] → Git 커밋 → [Flux CD] → Kubernetes 적용
↓
[Flagger 오퍼레이터]
↙ ↘
[Istio/NGINX] [MetricTemplate]
트래픽 분할 APM 쿼리 실행
↓ ↓
카나리 Pod Datadog / New Relic
↘ ↙
판단: 프로모션 or 롤백Service Mesh: An infrastructure layer, such as tools like Istio and Linkerd, that intercepts traffic between Pods to control routing, observation, and security. Flagger's canary analysis relies on this layer's traffic splitting capabilities.
Practical Application
We cover three examples. Example 1 explains how to analyze mesh layer metrics using Datadog in an environment using an Istio service mesh, and Example 2 explains how to analyze ingress layer metrics using New Relic in an environment using NGINX Ingress. The reasons why these two APMs are specialized for different layers are explained between the examples. Example 3 is a pattern for reusing a single MetricTemplate across multiple services. You may start with the example that best suits your environment.
Example 1: Datadog Integration — Istio Mesh Layer 404 Error Rate-Based Automatic Rollback
Prerequisite: This example is based on an environment using the Istio Service Mesh. The istio.mesh.request.count metric is a mesh layer metric collected by Istio and does not work in an NGINX Ingress environment. If you are using NGINX, please refer to Example 2.
Datadog automatically collects mesh layer metrics generated by Istio sidecar proxies. Because the error rate measured at this layer is accurately attributed to the specific service's canary Pod, it provides high reliability for deployment decisions.
Step 1 — Register API Key as Kubernetes Secret
# kubectl create secret 명령어 사용 (평문 노출 위험 없음)
kubectl create secret generic datadog \
--namespace istio-system \
--from-literal=datadog_api_key=여기에_실제_API_키 \
--from-literal=datadog_application_key=여기에_실제_APP_키Alternatively, if managing directly with YAML, use the value encoded with the echo -n "실제키값" | base64 command.
apiVersion: v1
kind: Secret
metadata:
name: datadog
namespace: istio-system
type: Opaque
data:
datadog_api_key: <base64-encoded-api-key> # echo -n "키값" | base64
datadog_application_key: <base64-encoded-app-key>Step 2 — Define MetricTemplate
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: not-found-percentage
namespace: istio-system
spec:
provider:
type: datadog
address: https://api.datadoghq.com
secretRef:
name: datadog
query: |
100 - (
sum:istio.mesh.request.count{
reporter:destination,
destination_workload_namespace:{{ namespace }},
destination_workload:{{ target }},
!response_code:404
}.as_count()
/
sum:istio.mesh.request.count{
reporter:destination,
destination_workload_namespace:{{ namespace }},
destination_workload:{{ target }}
}.as_count()
) * 100| Query Element | Role |
|---|---|
{{ namespace }} |
Automatically replace with the namespace containing Canary |
{{ target }} |
Auto-replace with canary target workload name |
!response_code:404 |
Count only non-404 requests (valid requests) |
100 - (정상/전체 * 100) |
Consequently, returns 404 error rate (%) |
Step 3 — Refer to MetricTemplate in Canary
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-app
namespace: production
spec:
# targetRef, service 섹션 등 필수 필드는 생략됨 — 전체 스펙은 공식 문서 참고
analysis:
interval: 1m # 1분마다 MetricTemplate 쿼리 실행
threshold: 5 # 5번 연속 분석 실패 시 롤백 트리거
maxWeight: 50 # 카나리 트래픽 최대 50%까지 증가
stepWeight: 10 # 분석 성공 1회당 트래픽 10%씩 증가
metrics:
- name: "not-found-percentage"
templateRef:
name: not-found-percentage
namespace: istio-system # 다른 네임스페이스의 MetricTemplate 참조
thresholdRange:
max: 5 # 404 에러율 5% 초과 시 해당 분석 실패 판정
interval: 1mYAML Value Selection Criteria: threshold: 5 means "rollback only after 5 consecutive failures to avoid mistaking temporary spikes for a rollback." If it is a payment service where stability is critical, you can lower it to threshold: 3, and if it is an experimental feature, you can raise it to threshold: 7. stepWeight: 10 means a total of 5 verification steps, increasing by 10% every 5 minutes up to 50%. If deployment speed is important, adjust it to stepWeight: 20, but be aware of the trade-off of reducing the number of verification steps.
thresholdRange: If only max is specified, it is considered normal if the value is less than or equal to this value; if only min is specified, it is considered normal if the value is greater than or equal to this value; and if both are specified, it is considered normal if the value is within the range. It is common to use min for the success rate and max for the error rate.
Cross-Namespace References and RBAC: If templateRef references a MetricTemplate from another namespace, the Flagger operator's ServiceAccount may require a ClusterRole capable of reading the MetricTemplate and Secret from that namespace. If cross-namespace access fails, first check the Flagger pod logs for RBAC-related errors.
How to check operation
# Canary 오브젝트 상태 확인
kubectl describe canary my-app -n production
# 분석 관련 이벤트 스트림 확인
kubectl get events -n production --field-selector reason=Synced
# Flagger 오퍼레이터 로그에서 MetricTemplate 쿼리 결과 확인
kubectl logs -n flagger-system deploy/flagger -f | grep "not-found-percentage"Example 2: New Relic Integration — NGINX Ingress Layer HTTP 5xx Error Rate Analysis
While Datadog excels in the mesh layer (inter-service traffic), New Relic is well-suited for scenarios measuring the ingress layer (client → cluster entry point) thanks to its flexible query language called NRQL. It is a more natural choice for teams using NGINX Ingress.
NRQL Query Validation Recommendation: The query below uses the FROM Metric WHERE metricName = 'nginx_ingress_controller_requests' format. Depending on the environment, directly specifying the event type in the FROM nginx_ingress_controller_requests format may be more common. It is strongly recommended that you execute the query directly in the New Relic Query Builder to verify that values within the expected range are returned before applying it to the MetricTemplate.
Step 1 — Register New Relic Credential Secret
kubectl create secret generic newrelic \
--namespace ingress-nginx \
--from-literal=newrelic_account_id=여기에_계정_ID \
--from-literal=newrelic_query_key=여기에_쿼리_키Step 2 — Define NRQL MetricTemplate
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: newrelic-error-rate
namespace: ingress-nginx
spec:
provider:
type: newrelic
secretRef:
name: newrelic
query: |
SELECT
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500')
/ sum(nginx_ingress_controller_requests) * 100
FROM Metric
WHERE metricName = 'nginx_ingress_controller_requests'
AND ingress = '{{ ingress }}'
AND namespace = '{{ namespace }}'Here, {{ ingress }} is a user-defined variable, not a standard Flagger built-in variable. This value is injected by each service through templateVariables in the next step.
Example 3: Reusing a single template across multiple services with templateVariables
templateVariables is a feature supported in Flagger v1.12.0 or later that allows the same MetricTemplate to be parameterized and shared across different services. The above newrelic-error-rate template is an example of using different ingress names in two different services.
# 서비스 A의 Canary
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: service-a
namespace: production
spec:
analysis:
metrics:
- name: "newrelic-error-rate"
templateRef:
name: newrelic-error-rate
namespace: ingress-nginx
templateVariables:
ingress: service-a-ingress # 서비스 A의 ingress 이름 주입
thresholdRange:
max: 1 # 5xx 에러율 1% 초과 시 실패 판정
interval: 2m# 서비스 B의 Canary — 동일 템플릿, 다른 변수와 임계값
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: service-b
namespace: production
spec:
analysis:
metrics:
- name: "newrelic-error-rate"
templateRef:
name: newrelic-error-rate
namespace: ingress-nginx
templateVariables:
ingress: service-b-ingress # 서비스 B의 ingress 이름 주입
thresholdRange:
max: 2 # 서비스 B는 임계값을 다르게 설정
interval: 2mThe practical value of this pattern is significant. Once the platform team defines the MetricTemplate in the central namespace, each service team can adhere to organizational standard analysis criteria by adjusting only templateVariables and thresholdRange. Even when changes to query logic are needed, modifying just the MetricTemplate immediately reflects the changes across all services.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Immediately Leverage Existing APM | Directly connect Datadog and New Relic, already contracted and in operation, as deployment criteria — No need to build a separate metric stack |
| Organizational Standardization | Placing MetricTemplate in a shared namespace unifies analysis standards across teams and services — deployment standards become code, not verbal conventions |
| Fully Automated | Automatic promotion and rollback decisions based on metrics without human monitoring the dashboard — Reduces the burden of nighttime deployments |
| GitOps Friendly | Both MetricTemplate and Canary specifications are managed in YAML, so PR reviews, change history, and rollbacks are handled within the Git workflow |
| Multiple Metrics AND Conditions | Promotion proceeds only if multiple criteria, such as Error Rate + P99 Latency + Business Conversion Rate, are simultaneously satisfied |
| Flux CD Ecosystem Integration | Configure a fully automated pipeline where Flagger automatically starts canary analysis when Flux detects Git changes and applies them to Kubernetes |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Simple Threshold Comparison | Possible misjudgment as thresholds are compared without statistical reliability when traffic is low | Induce sufficient traffic in the canary and set interval to a long value to accumulate sufficient data |
| L7 Traffic Splitting is Essential | Canary itself is impossible without a service mesh or ingress such as Istio, Linkerd, or NGINX | Build the traffic management layer first, then introduce Flagger |
| APM API Call Costs | Execute Datadog·New Relic API queries every interval — Costs can accumulate over short intervals |
Do not set interval too short (minimum 1m recommended) |
| Late Discovery of Query Errors | MetricTemplate query errors revealed at the time of canary analysis | Validate queries in the APM console first, then apply |
| Cross-Namespace Secret Restriction | secretRef can only reference the same namespace by default |
Place MetricTemplate and Secret in the same namespace |
| Cold Start Misanalysis | Analysis reliability decreases due to insufficient metric data during the initial Canary stage | Set initial warm-up phase with spec.analysis.iterations, adjust progressDeadlineSeconds·canaryReadyThreshold to stabilize Pod, then start analysis |
Cold Start: When a canary Pod is just launched, there is almost no metric data in the APM. If analysis is started in this state, a normal service may be misidentified as abnormal. In practice, it is recommended to skip the initial analysis using spec.analysis.iterations or configure it so that analysis starts after checking the Pod's readiness status by setting canaryReadyThreshold.
The Most Common Mistakes in Practice
- Skip Query Pre-validation: This applies when query syntax errors or empty results are discovered only after deploying the MetricTemplate and running the actual canary analysis. You must first verify that values within the expected range are returned by running the same query directly in Datadog Metrics Explorer or New Relic Query Builder. Query errors are flagged by the Flagger as "Unable to read metric value," which can lead to an immediate rollback.
- Relying on a Single Metric: Judging solely based on the error rate misses cases where there are no errors but P99 latency skyrockets. If you register both the error rate and P99 latency in the
analysis.metricsarray, the promotion will only proceed if both criteria are met. It is safest to add the business conversion rate metric as well, if possible. - Overly lenient initial threshold: A case where the threshold is set loosely like
max: 20with the mindset of "let's deploy quickly first," and a rollback is not triggered in the event of an actual failure. The correct order is to start conservatively likemax: 1and gradually relax as operational data is accumulated.
In Conclusion
Flagger MetricTemplate CRD is the most direct way to transform existing APM investments into a deployment safety net. Introducing this pattern to your team reduces nightly deployment alerts and replaces the time spent "watching for 30 minutes to see if a deployment is safe" with automated analytics. Deployment approval cycles are shortened, and as the platform team manages the central MetricTemplate, each service team can focus solely on threshold adjustments.
3 Steps to Start Right Now:
- Verify the query first in the APM console: If using Datadog, write a query that returns the error rate of the Canary Pod directly in Metrics Explorer; if using New Relic, use Query Builder, and check the expected range — this step is essential before applying MetricTemplate.
- Connecting Pipelines After Standalone MetricTemplate Deployment: Deploy the template first with
kubectl apply -f metric-template.yaml, then gradually transition by adding onlytemplateRefto the existing Canary object — view analysis results in real-time withkubectl describe canary <이름>. - Reinforcing the safety net with multiple metric combinations: If the Error Rate MetricTemplate operates successfully, add the P99 Latency MetricTemplate and register it in parallel in the
analysis.metricsarray — completing a multi-safety net where promotion proceeds only when both criteria are satisfied.
Next Post: Setting up HTTP Header-Based A/B Routing YAML in a Flagger + Istio Environment, and Integrating Business Metrics Like Conversion Rate and Session Length as Analysis Criteria Using New Relic NRQL
Reference Materials
- Flagger Official Documentation - Metrics Analysis | flagger.app
- Flux CD - Flagger Metrics Analysis | fluxcd.io
- fluxcd/flagger GitHub - metrics.md source
- fluxcd/flagger GitHub - CRD definition
- How to Configure Flagger Metrics Analysis with Datadog | OneUptime
- Canary analysis metrics templating - Issue #418 | GitHub
- MetricTemplate secretRef cross-namespace - Issue #716 | GitHub
- Canary deployments with Amazon Managed Prometheus and Flagger | AWS Open Source Blog
- Mastering Progressive Delivery with Istio and Flagger | Medium
- Progressive Delivery: Argo Rollouts vs Flagger | Medium