Automating Fast-burn/Slow-burn Alerts with Grafana SLO

What if the service you operate fires a "CPU over 80%" alert 50 times a week, but only 2 of those actually lead to an incident? The remaining 48 are meaningless noise, and on-call engineers eventually become desensitized even to the critical 2. This is the fundamental limitation of simple threshold-based alerting: it cannot answer the core question, "How much is this situation actually impacting users right now?"

The systematic strategy that addresses this problem is the Multi-Window, Multi-Burn-Rate (MWMB) alerting strategy, documented by the Google SRE team in the SRE Workbook based on years of operational experience. By configuring Critical and Warning alerts based on the rate at which the Prometheus Error Budget is being consumed (the Burn Rate), you can significantly reduce noise while precisely catching real threats to service reliability.

Following this guide, you will complete a production-grade SLO alerting pipeline with a dramatically lower false-positive rate compared to traditional threshold-based alerts. It covers everything from how the Grafana SLO plugin (grafana-slo-app) automatically provisions MWMB alert rules, to a complete pipeline that uses Alertmanager routing to branch Fast-burn alerts into immediate on-call pages and Slow-burn alerts into ticket creation. Both Grafana Cloud and Self-hosted Prometheus environments are covered.

Recommended prior knowledge: Basic Prometheus concepts (metric collection, PromQL), YAML syntax. For Kubernetes examples, basic kubectl usage and an understanding of CRDs are also helpful.

Core Concepts

If you're already familiar with these concepts, feel free to jump directly to the 'Practical Application' section.

SLI, SLO, Error Budget, and Burn Rate

Establishing the relationship between these three concepts will make everything that follows much clearer.

SLI (Service Level Indicator): A metric that measures service reliability. Example: "The ratio of successful responses (2xx) out of all requests"
SLO (Service Level Objective): A target value for an SLI. Example: "99.9% of monthly requests must succeed"
Error Budget: The total allowable amount of errors within the bounds of achieving the SLO

For a service with a 99.9% SLO, the allowable error ratio over one month (approximately 720 hours) is 0.1%. This entire allowance is the Error Budget, and Burn Rate represents the speed at which this budget is being consumed.

Burn Rate Calculation Example

Based on a 99.9% SLO, monthly error budget = 720 hours × 0.1% ≈ 43.2 minutes

Burn Rate Expected Depletion Time Calculation Meaning

1× 30 days ÷ 1 = 30 days Normal consumption rate

14.4× 30 days ÷ 14.4 ≈ 2.1 days (~50 hours) Immediate action required

6× 30 days ÷ 6 = 5 days Same-day response required

3× 30 days ÷ 3 = 10 days Respond via ticket

Use the formula Expected Depletion Time = 30 days ÷ Burn Rate when adjusting thresholds to match your own SLO target. Google's default values (14.4×, 6×, 3×, 1×) are designed for a 99.9% SLO and cannot be applied as-is to a 99.0% SLO service.

Burn Rate	Expected Depletion Time Calculation	Meaning
1×	30 days ÷ 1 = 30 days	Normal consumption rate
14.4×	30 days ÷ 14.4 ≈ 2.1 days (~50 hours)	Immediate action required
6×	30 days ÷ 6 = 5 days	Same-day response required
3×	30 days ÷ 3 = 10 days	Respond via ticket

Multi-Window, Multi-Burn-Rate (MWMB) Alerting Strategy

Evaluating Burn Rate with a single window creates two problems. A short window produces many false positives from temporary traffic spikes, while a long window makes detection too slow. The MWMB strategy solves both problems by firing an alert only when both the short window AND the long window conditions are simultaneously met.

The thresholds recommended by the Google SRE Workbook are as follows:

Level	Burn Rate	Short Window	Long Window	Expected Depletion Time	Response
Critical (Fast-Burn)	≥ 14.4×	5 min	1 hour	~2 days	Immediate on-call page
Critical (Fast-Burn)	≥ 6×	30 min	6 hours	~5 days	Immediate on-call page
Warning (Slow-Burn)	≥ 3×	2 hours	24 hours	~10 days	Create ticket
Warning (Slow-Burn)	≥ 1×	6 hours	72 hours	~30 days	Create ticket

The effect of dual-window validation: Evaluating with a 5-minute window alone will fire alerts even during temporary traffic surges. Using the 5-minute AND 1-hour conditions together means an alert only fires when conditions deteriorated sharply over 5 minutes and that trend persisted for 1 hour.

Recording Rules

The MWMB strategy requires repeatedly running SLI queries across multiple time windows: 5 minutes, 30 minutes, 1 hour, 6 hours, 24 hours, 72 hours, and more. Computing these queries in real time every time places a significant load on Prometheus.

Recording Rules are a Prometheus mechanism that solves this problem. By pre-computing complex queries and saving the results as new metrics, the pre-aggregated results can be reused during alert evaluation, simultaneously reducing evaluation latency and storage load.

grafana-slo-app automatically provisions these Recording Rules alongside the SLO when it is created.

What Alertmanager and grafana-slo-app Create Automatically

Alertmanager is the component that receives alerts fired by Prometheus and handles deduplication, grouping, and routing. Which alerts are sent to which channel (PagerDuty, Slack, webhooks, etc.) is controlled via the alertmanager.yml configuration file.

grafana-slo-app is an official Grafana plugin; once you create an SLO through the UI, it automatically provisions all of the following resources:

SLO Creation
  ├── Dashboard (including error budget burndown chart)
  ├── Recording Rules — optimizes multi-window SLI queries
  └── Alert Rules
        ├── [Critical] fast-burn: grafana_slo_severity="critical"
        └── [Warning]  slow-burn: grafana_slo_severity="warning"

The auto-generated alerts come with labels that can be used directly in Alertmanager routing:

grafana_slo_uuid — Unique SLO identifier
grafana_slo_window — Evaluation window (e.g., 5m, 1h)
grafana_slo_severity — critical or warning

Practical Application

Example 1: Grafana Cloud — Configuring Dual Alertmanager Routing

By first looking at what the alert rules generated by grafana-slo-app actually look like, you can directly see which labels the matchers in the Alertmanager routing configuration operate on.

yaml

# Example Alert Rule auto-generated by grafana-slo-app (simplified)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-api-gateway
spec:
  groups:
    - name: slo-api-gateway.rules
      rules:
        - alert: SLOFastBurn
          expr: |
            (
              slo:sli_error:ratio_rate5m{grafana_slo_uuid="<uuid>"}
              > (14.4 * (1 - 0.999))
            )
            and
            (
              slo:sli_error:ratio_rate1h{grafana_slo_uuid="<uuid>"}
              > (14.4 * (1 - 0.999))
            )
          labels:
            grafana_slo_severity: critical
            grafana_slo_uuid: "<uuid>"
          annotations:
            summary: "Fast-burn: SLO error budget is being depleted rapidly"

Alertmanager routing is branched based on the grafana_slo_severity label from these auto-generated rules.

yaml

# alertmanager.yml
route:
  receiver: 'default'
  routes:
    # Fast-Burn: immediate on-call page (always place more specific matchers at the top)
    - matchers:
        - grafana_slo_severity = "critical"
      receiver: 'pagerduty-oncall'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h
 
    # Slow-Burn: create ticket
    - matchers:
        - grafana_slo_severity = "warning"
      receiver: 'jira-ticket'
      group_wait: 5m
      group_interval: 30m
      repeat_interval: 6h
 
receivers:
  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_ROUTING_KEY}'   # inject via environment variable or Secret
        description: '{{ .CommonAnnotations.summary }}'
        severity: 'critical'
 
  - name: 'jira-ticket'
    webhook_configs:
      - url: '${JIRA_AUTOMATION_WEBHOOK_URL}'      # Jira Automation webhook URL
        send_resolved: true
 
  - name: 'default'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'

How to inject sensitive information: Hard-coding credentials like routing_key directly into YAML can lead to security incidents. In Grafana Cloud environments, use Grafana Secrets Manager or Kubernetes Secrets. In Self-hosted environments, using the ${ENV_VAR} environment variable reference syntax is recommended.

How to apply alertmanager.yml in Grafana Cloud: In Grafana Cloud, navigate to the left menu → Alerting → Alertmanager → Alertmanager configuration tab, where you can paste the full YAML directly or configure via the UI's Contact points / Notification policies. GitOps via the Terraform grafana_alertmanager_config resource is also supported.

Jira webhook integration: You can automatically create Jira tickets upon alert receipt using Jira Automation's "Trigger a rule via webhook" feature or an Atlassian Forge app. If you want a generic configuration not tied to a specific environment, using a self-hosted middleware URL in webhook_configs is also a valid approach.

Configuration Item	Fast-Burn Value	Slow-Burn Value	Reason
`group_wait`	30 seconds	5 minutes	Minimize aggregation wait for Critical
`repeat_interval`	1 hour	6 hours	Reduce frequency of Warning repeat notifications
`receiver`	PagerDuty	Jira webhook	Separate channels by severity

Example 2: Self-hosted Environment — Generating MWMB Rules with Sloth

In a Self-hosted Prometheus environment without Grafana Cloud, you can use Sloth to define SLOs in YAML and automatically generate MWMB alert rules.

yaml

# slo-api-gateway.yaml (Sloth input file)
version: "prometheus/v1"
service: "api-gateway"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "99.9% of requests should be successful"
    sli:
      events:
        # [{{.window}}] is Sloth's Go template syntax, automatically substituted
        # with each evaluation window value such as 5m, 30m, 1h, 6h during rule generation.
        error_query: sum(rate(http_requests_total{job="api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="api"}[{{.window}}]))
    alerting:
      name: APIGatewayAvailability
      page_alert:
        labels:
          severity: critical
          team: platform
      ticket_alert:
        labels:
          severity: warning
          team: platform

Processing the above file with the Sloth CLI generates recording rule and alert rule YAML ready to apply directly to Prometheus:

bash

# Generate PrometheusRule with Sloth
sloth generate -i slo-api-gateway.yaml -o rules/api-gateway-slo.yaml
 
# In a Kubernetes environment, apply as a PrometheusRule CRD
kubectl apply -f rules/api-gateway-slo.yaml

Since the alert rules generated by Sloth also include severity: critical / severity: warning labels, the Alertmanager routing can be applied with the same matchers structure as Example 1.

Example 3: Environment-specific Approaches — Grafana-managed vs Data source-managed

The choice of alert management mode depends on your operational environment and GitOps requirements.

	Grafana-managed	Data source-managed
Alert storage location	Grafana internal DB	Stored directly in Prometheus/Mimir
Alertmanager	Grafana built-in AM	External Prometheus AM
GitOps suitability	API / Terraform	PrometheusRule CRD
Recommended environment	Grafana Cloud	Self-hosted Kubernetes
Migration note	—	Must reconfigure Contact Points

When switching to Data source-managed mode, it is recommended to validate the end-to-end routing with the following procedure:

bash

# Validate external Alertmanager configuration syntax
amtool check-config alertmanager.yml
 
# Fire a test alert to verify the actual routing path
amtool alert add alertname="TestSLOAlert" grafana_slo_severity="critical" \
  --alertmanager.url=http://localhost:9093

Caution when switching to Data source-managed: Notification Policies and Contact Points configured in the existing Grafana Alertmanager must be reconfigured in the external Alertmanager. If you omit this step, SLO alerts will fire but nobody will be notified, creating an operational blind spot. It is strongly recommended to verify the full routing path with a test alert immediately after switching.

Pros and Cons Analysis

Advantages

Item	Details
Reduced alert fatigue	Noise is greatly reduced since alerts only fire when the actual error budget depletion is genuinely threatened
One-click automation	`grafana-slo-app` provisions recording rules, dashboards, and alert rules all from a single SLO creation
False positive suppression	Dual short-window AND long-window validation suppresses false alerts caused by momentary spikes
Separated response	Critical triggers immediate on-call pages; Warning goes to tickets — preventing unnecessary late-night calls
SLO-as-Code	Can be integrated into a GitOps pipeline using the Terraform `grafana_slo` resource

Disadvantages and Caveats

Item	Details	Mitigation
Cloud dependency	Full `grafana-slo-app` functionality is Grafana Cloud only	Use Sloth + Prometheus combination for Self-hosted environments
Alert latency	Slow-Burn uses up to a 72-hour window, so detection can take several hours	Tune by lowering Fast-Burn thresholds to compensate for earlier detection
Threshold tuning	Google's default values (14.4×, 6×, 3×, 1×) are not optimal for all services	Recalculate thresholds to match your SLO target using `Expected Depletion Time = 30 days ÷ Burn Rate`
Label complexity	Maintaining consistency between labels like `grafana_slo_uuid`, `team`, etc. and the routing tree becomes burdensome when managing many SLOs	Document label naming conventions as a team standard
Dual AM management	Operational burden of synchronizing Contact Points when switching from Grafana-managed to an external AM	Create a checklist before migration and validate in a staging environment

Most Common Mistakes in Practice

Alertmanager routing order errors: If the grafana_slo_severity = "critical" rule is positioned below a catch-all rule in the routes array, Critical alerts will be delivered to the default channel instead of PagerDuty. Always place the most specific matchers at the top of the array.
Failing to reconfigure Contact Points after switching to Data source-managed: After switching to an external Alertmanager but neglecting to migrate the Notification Policy, SLO alerts fire but nobody gets notified. It is recommended to validate the end-to-end path with a test alert immediately after switching.
Applying Google's default thresholds as-is: Applying default values designed for a 99.9% SLO directly to a 99.0% SLO service will cause the burn rate scale to differ, resulting in alerts firing too frequently or too late. It is recommended to recalculate thresholds to match your SLO target using the formula Expected Depletion Time = 30 days ÷ Burn Rate.

Closing Thoughts

The MWMB dual-alert strategy uses "error budget depletion rate" rather than "threshold exceeded" as its criterion, reducing alert noise while catching real threats to service reliability more precisely. When alert fatigue decreases, the team's on-call burden is reduced, and when the error budget becomes visible, discussions about reliability improvement priorities shift from gut feelings to data-driven decisions.

Three steps you can start right now:

Define your SLO: In Grafana Cloud, create an availability SLO (e.g., 99.9%) for a core API endpoint using grafana-slo-app. In Self-hosted environments, you can auto-generate MWMB rule files with sloth generate -i my-slo.yaml -o rules/my-slo.yaml.
Apply Alertmanager routing: Using Example 1's alertmanager.yml as a reference, configure routing so grafana_slo_severity = "critical" → PagerDuty/IRM and grafana_slo_severity = "warning" → webhook ticket, then validate the end-to-end path with a test alert.
Monitor the error budget dashboard: Share the auto-generated burndown chart in your team channel to make the remaining error budget visible, and use monthly trends to decide whether threshold tuning is needed.

Next post: A practical SLO-as-Code guide for managing SLOs as code using the Terraform grafana_slo resource and integrating them into a GitOps pipeline

References

Automating Fast-burn/Slow-burn Alerts with Grafana SLO | DEV BAK - 기술블로그

DevOps

Automating Fast-burn/Slow-burn Alerts with Grafana SLO

Recommended prior knowledge: Basic Prometheus concepts (metric collection, PromQL), YAML syntax. For Kubernetes examples, basic kubectl usage and an understanding of CRDs are also helpful.

Core Concepts

If you're already familiar with these concepts, feel free to jump directly to the 'Practical Application' section.

SLI, SLO, Error Budget, and Burn Rate

Establishing the relationship between these three concepts will make everything that follows much clearer.

SLI (Service Level Indicator): A metric that measures service reliability. Example: "The ratio of successful responses (2xx) out of all requests"
SLO (Service Level Objective): A target value for an SLI. Example: "99.9% of monthly requests must succeed"
Error Budget: The total allowable amount of errors within the bounds of achieving the SLO

Burn Rate Calculation Example

Based on a 99.9% SLO, monthly error budget = 720 hours × 0.1% ≈ 43.2 minutes

Burn Rate Expected Depletion Time Calculation Meaning

1× 30 days ÷ 1 = 30 days Normal consumption rate

14.4× 30 days ÷ 14.4 ≈ 2.1 days (~50 hours) Immediate action required

6× 30 days ÷ 6 = 5 days Same-day response required

3× 30 days ÷ 3 = 10 days Respond via ticket

Use the formula Expected Depletion Time = 30 days ÷ Burn Rate when adjusting thresholds to match your own SLO target. Google's default values (14.4×, 6×, 3×, 1×) are designed for a 99.9% SLO and cannot be applied as-is to a 99.0% SLO service.

Burn Rate	Expected Depletion Time Calculation	Meaning
1×	30 days ÷ 1 = 30 days	Normal consumption rate
14.4×	30 days ÷ 14.4 ≈ 2.1 days (~50 hours)	Immediate action required
6×	30 days ÷ 6 = 5 days	Same-day response required
3×	30 days ÷ 3 = 10 days	Respond via ticket

Multi-Window, Multi-Burn-Rate (MWMB) Alerting Strategy

The thresholds recommended by the Google SRE Workbook are as follows:

Level	Burn Rate	Short Window	Long Window	Expected Depletion Time	Response
Critical (Fast-Burn)	≥ 14.4×	5 min	1 hour	~2 days	Immediate on-call page
Critical (Fast-Burn)	≥ 6×	30 min	6 hours	~5 days	Immediate on-call page
Warning (Slow-Burn)	≥ 3×	2 hours	24 hours	~10 days	Create ticket
Warning (Slow-Burn)	≥ 1×	6 hours	72 hours	~30 days	Create ticket

The effect of dual-window validation: Evaluating with a 5-minute window alone will fire alerts even during temporary traffic surges. Using the 5-minute AND 1-hour conditions together means an alert only fires when conditions deteriorated sharply over 5 minutes and that trend persisted for 1 hour.

Recording Rules

grafana-slo-app automatically provisions these Recording Rules alongside the SLO when it is created.

What Alertmanager and grafana-slo-app Create Automatically

grafana-slo-app is an official Grafana plugin; once you create an SLO through the UI, it automatically provisions all of the following resources:

SLO Creation
  ├── Dashboard (including error budget burndown chart)
  ├── Recording Rules — optimizes multi-window SLI queries
  └── Alert Rules
        ├── [Critical] fast-burn: grafana_slo_severity="critical"
        └── [Warning]  slow-burn: grafana_slo_severity="warning"

The auto-generated alerts come with labels that can be used directly in Alertmanager routing:

grafana_slo_uuid — Unique SLO identifier
grafana_slo_window — Evaluation window (e.g., 5m, 1h)
grafana_slo_severity — critical or warning

Practical Application

Example 1: Grafana Cloud — Configuring Dual Alertmanager Routing

By first looking at what the alert rules generated by grafana-slo-app actually look like, you can directly see which labels the matchers in the Alertmanager routing configuration operate on.

yaml

# Example Alert Rule auto-generated by grafana-slo-app (simplified)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-api-gateway
spec:
  groups:
    - name: slo-api-gateway.rules
      rules:
        - alert: SLOFastBurn
          expr: |
            (
              slo:sli_error:ratio_rate5m{grafana_slo_uuid="<uuid>"}
              > (14.4 * (1 - 0.999))
            )
            and
            (
              slo:sli_error:ratio_rate1h{grafana_slo_uuid="<uuid>"}
              > (14.4 * (1 - 0.999))
            )
          labels:
            grafana_slo_severity: critical
            grafana_slo_uuid: "<uuid>"
          annotations:
            summary: "Fast-burn: SLO error budget is being depleted rapidly"

Alertmanager routing is branched based on the grafana_slo_severity label from these auto-generated rules.

yaml

# alertmanager.yml
route:
  receiver: 'default'
  routes:
    # Fast-Burn: immediate on-call page (always place more specific matchers at the top)
    - matchers:
        - grafana_slo_severity = "critical"
      receiver: 'pagerduty-oncall'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h
 
    # Slow-Burn: create ticket
    - matchers:
        - grafana_slo_severity = "warning"
      receiver: 'jira-ticket'
      group_wait: 5m
      group_interval: 30m
      repeat_interval: 6h
 
receivers:
  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_ROUTING_KEY}'   # inject via environment variable or Secret
        description: '{{ .CommonAnnotations.summary }}'
        severity: 'critical'
 
  - name: 'jira-ticket'
    webhook_configs:
      - url: '${JIRA_AUTOMATION_WEBHOOK_URL}'      # Jira Automation webhook URL
        send_resolved: true
 
  - name: 'default'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#alerts'

How to inject sensitive information: Hard-coding credentials like routing_key directly into YAML can lead to security incidents. In Grafana Cloud environments, use Grafana Secrets Manager or Kubernetes Secrets. In Self-hosted environments, using the ${ENV_VAR} environment variable reference syntax is recommended.

How to apply alertmanager.yml in Grafana Cloud: In Grafana Cloud, navigate to the left menu → Alerting → Alertmanager → Alertmanager configuration tab, where you can paste the full YAML directly or configure via the UI's Contact points / Notification policies. GitOps via the Terraform grafana_alertmanager_config resource is also supported.

Jira webhook integration: You can automatically create Jira tickets upon alert receipt using Jira Automation's "Trigger a rule via webhook" feature or an Atlassian Forge app. If you want a generic configuration not tied to a specific environment, using a self-hosted middleware URL in webhook_configs is also a valid approach.

Configuration Item	Fast-Burn Value	Slow-Burn Value	Reason
`group_wait`	30 seconds	5 minutes	Minimize aggregation wait for Critical
`repeat_interval`	1 hour	6 hours	Reduce frequency of Warning repeat notifications
`receiver`	PagerDuty	Jira webhook	Separate channels by severity

Example 2: Self-hosted Environment — Generating MWMB Rules with Sloth

In a Self-hosted Prometheus environment without Grafana Cloud, you can use Sloth to define SLOs in YAML and automatically generate MWMB alert rules.

yaml

# slo-api-gateway.yaml (Sloth input file)
version: "prometheus/v1"
service: "api-gateway"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "99.9% of requests should be successful"
    sli:
      events:
        # [{{.window}}] is Sloth's Go template syntax, automatically substituted
        # with each evaluation window value such as 5m, 30m, 1h, 6h during rule generation.
        error_query: sum(rate(http_requests_total{job="api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="api"}[{{.window}}]))
    alerting:
      name: APIGatewayAvailability
      page_alert:
        labels:
          severity: critical
          team: platform
      ticket_alert:
        labels:
          severity: warning
          team: platform

Processing the above file with the Sloth CLI generates recording rule and alert rule YAML ready to apply directly to Prometheus:

bash

# Generate PrometheusRule with Sloth
sloth generate -i slo-api-gateway.yaml -o rules/api-gateway-slo.yaml
 
# In a Kubernetes environment, apply as a PrometheusRule CRD
kubectl apply -f rules/api-gateway-slo.yaml

Since the alert rules generated by Sloth also include severity: critical / severity: warning labels, the Alertmanager routing can be applied with the same matchers structure as Example 1.

Example 3: Environment-specific Approaches — Grafana-managed vs Data source-managed

The choice of alert management mode depends on your operational environment and GitOps requirements.

	Grafana-managed	Data source-managed
Alert storage location	Grafana internal DB	Stored directly in Prometheus/Mimir
Alertmanager	Grafana built-in AM	External Prometheus AM
GitOps suitability	API / Terraform	PrometheusRule CRD
Recommended environment	Grafana Cloud	Self-hosted Kubernetes
Migration note	—	Must reconfigure Contact Points

When switching to Data source-managed mode, it is recommended to validate the end-to-end routing with the following procedure:

bash

# Validate external Alertmanager configuration syntax
amtool check-config alertmanager.yml
 
# Fire a test alert to verify the actual routing path
amtool alert add alertname="TestSLOAlert" grafana_slo_severity="critical" \
  --alertmanager.url=http://localhost:9093

Caution when switching to Data source-managed: Notification Policies and Contact Points configured in the existing Grafana Alertmanager must be reconfigured in the external Alertmanager. If you omit this step, SLO alerts will fire but nobody will be notified, creating an operational blind spot. It is strongly recommended to verify the full routing path with a test alert immediately after switching.

Pros and Cons Analysis

Advantages

Item	Details
Reduced alert fatigue	Noise is greatly reduced since alerts only fire when the actual error budget depletion is genuinely threatened
One-click automation	`grafana-slo-app` provisions recording rules, dashboards, and alert rules all from a single SLO creation
False positive suppression	Dual short-window AND long-window validation suppresses false alerts caused by momentary spikes
Separated response	Critical triggers immediate on-call pages; Warning goes to tickets — preventing unnecessary late-night calls
SLO-as-Code	Can be integrated into a GitOps pipeline using the Terraform `grafana_slo` resource

Disadvantages and Caveats

Item	Details	Mitigation
Cloud dependency	Full `grafana-slo-app` functionality is Grafana Cloud only	Use Sloth + Prometheus combination for Self-hosted environments
Alert latency	Slow-Burn uses up to a 72-hour window, so detection can take several hours	Tune by lowering Fast-Burn thresholds to compensate for earlier detection
Threshold tuning	Google's default values (14.4×, 6×, 3×, 1×) are not optimal for all services	Recalculate thresholds to match your SLO target using `Expected Depletion Time = 30 days ÷ Burn Rate`
Label complexity	Maintaining consistency between labels like `grafana_slo_uuid`, `team`, etc. and the routing tree becomes burdensome when managing many SLOs	Document label naming conventions as a team standard
Dual AM management	Operational burden of synchronizing Contact Points when switching from Grafana-managed to an external AM	Create a checklist before migration and validate in a staging environment

Most Common Mistakes in Practice

Alertmanager routing order errors: If the grafana_slo_severity = "critical" rule is positioned below a catch-all rule in the routes array, Critical alerts will be delivered to the default channel instead of PagerDuty. Always place the most specific matchers at the top of the array.
Failing to reconfigure Contact Points after switching to Data source-managed: After switching to an external Alertmanager but neglecting to migrate the Notification Policy, SLO alerts fire but nobody gets notified. It is recommended to validate the end-to-end path with a test alert immediately after switching.
Applying Google's default thresholds as-is: Applying default values designed for a 99.9% SLO directly to a 99.0% SLO service will cause the burn rate scale to differ, resulting in alerts firing too frequently or too late. It is recommended to recalculate thresholds to match your SLO target using the formula Expected Depletion Time = 30 days ÷ Burn Rate.

Closing Thoughts

Three steps you can start right now:

Define your SLO: In Grafana Cloud, create an availability SLO (e.g., 99.9%) for a core API endpoint using grafana-slo-app. In Self-hosted environments, you can auto-generate MWMB rule files with sloth generate -i my-slo.yaml -o rules/my-slo.yaml.
Apply Alertmanager routing: Using Example 1's alertmanager.yml as a reference, configure routing so grafana_slo_severity = "critical" → PagerDuty/IRM and grafana_slo_severity = "warning" → webhook ticket, then validate the end-to-end path with a test alert.
Monitor the error budget dashboard: Share the auto-generated burndown chart in your team channel to make the remaining error budget visible, and use monthly trends to decide whether threshold tuning is needed.

Next post: A practical SLO-as-Code guide for managing SLOs as code using the Terraform grafana_slo resource and integrating them into a GitOps pipeline

Core Concepts

SLI, SLO, Error Budget, and Burn Rate

Multi-Window, Multi-Burn-Rate (MWMB) Alerting Strategy

Recording Rules

What Alertmanager and grafana-slo-app Create Automatically

Practical Application

Example 1: Grafana Cloud — Configuring Dual Alertmanager Routing

Example 2: Self-hosted Environment — Generating MWMB Rules with Sloth

Example 3: Environment-specific Approaches — Grafana-managed vs Data source-managed

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

SLI, SLO, Error Budget, and Burn Rate

Multi-Window, Multi-Burn-Rate (MWMB) Alerting Strategy

Recording Rules

What Alertmanager and grafana-slo-app Create Automatically

Practical Application

Example 1: Grafana Cloud — Configuring Dual Alertmanager Routing

Example 2: Self-hosted Environment — Generating MWMB Rules with Sloth

Example 3: Environment-specific Approaches — Grafana-managed vs Data source-managed

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

LogQL Pipeline Parser Practical Guide — Extracting Fields from Unstructured Logs in Grafana Loki 3.x

Grafana Loki Practical Guide: Managing Kubernetes Logs Without ELK