Automating Fast-burn/Slow-burn Alerts with Grafana SLO
What if the service you operate fires a "CPU over 80%" alert 50 times a week, but only 2 of those actually lead to an incident? The remaining 48 are meaningless noise, and on-call engineers eventually become desensitized even to the critical 2. This is the fundamental limitation of simple threshold-based alerting: it cannot answer the core question, "How much is this situation actually impacting users right now?"
The systematic strategy that addresses this problem is the Multi-Window, Multi-Burn-Rate (MWMB) alerting strategy, documented by the Google SRE team in the SRE Workbook based on years of operational experience. By configuring Critical and Warning alerts based on the rate at which the Prometheus Error Budget is being consumed (the Burn Rate), you can significantly reduce noise while precisely catching real threats to service reliability.
Following this guide, you will complete a production-grade SLO alerting pipeline with a dramatically lower false-positive rate compared to traditional threshold-based alerts. It covers everything from how the Grafana SLO plugin (grafana-slo-app) automatically provisions MWMB alert rules, to a complete pipeline that uses Alertmanager routing to branch Fast-burn alerts into immediate on-call pages and Slow-burn alerts into ticket creation. Both Grafana Cloud and Self-hosted Prometheus environments are covered.
Recommended prior knowledge: Basic Prometheus concepts (metric collection, PromQL), YAML syntax. For Kubernetes examples, basic
kubectlusage and an understanding of CRDs are also helpful.
Core Concepts
If you're already familiar with these concepts, feel free to jump directly to the 'Practical Application' section.
SLI, SLO, Error Budget, and Burn Rate
Establishing the relationship between these three concepts will make everything that follows much clearer.
- SLI (Service Level Indicator): A metric that measures service reliability. Example: "The ratio of successful responses (2xx) out of all requests"
- SLO (Service Level Objective): A target value for an SLI. Example: "99.9% of monthly requests must succeed"
- Error Budget: The total allowable amount of errors within the bounds of achieving the SLO
For a service with a 99.9% SLO, the allowable error ratio over one month (approximately 720 hours) is 0.1%. This entire allowance is the Error Budget, and Burn Rate represents the speed at which this budget is being consumed.
Burn Rate Calculation Example
Based on a 99.9% SLO, monthly error budget = 720 hours × 0.1% ≈ 43.2 minutes
Burn Rate Expected Depletion Time Calculation Meaning 1× 30 days ÷ 1 = 30 days Normal consumption rate 14.4× 30 days ÷ 14.4 ≈ 2.1 days (~50 hours) Immediate action required 6× 30 days ÷ 6 = 5 days Same-day response required 3× 30 days ÷ 3 = 10 days Respond via ticket Use the formula
Expected Depletion Time = 30 days ÷ Burn Ratewhen adjusting thresholds to match your own SLO target. Google's default values (14.4×, 6×, 3×, 1×) are designed for a 99.9% SLO and cannot be applied as-is to a 99.0% SLO service.
Multi-Window, Multi-Burn-Rate (MWMB) Alerting Strategy
Evaluating Burn Rate with a single window creates two problems. A short window produces many false positives from temporary traffic spikes, while a long window makes detection too slow. The MWMB strategy solves both problems by firing an alert only when both the short window AND the long window conditions are simultaneously met.
The thresholds recommended by the Google SRE Workbook are as follows:
| Level | Burn Rate | Short Window | Long Window | Expected Depletion Time | Response |
|---|---|---|---|---|---|
| Critical (Fast-Burn) | ≥ 14.4× | 5 min | 1 hour | ~2 days | Immediate on-call page |
| Critical (Fast-Burn) | ≥ 6× | 30 min | 6 hours | ~5 days | Immediate on-call page |
| Warning (Slow-Burn) | ≥ 3× | 2 hours | 24 hours | ~10 days | Create ticket |
| Warning (Slow-Burn) | ≥ 1× | 6 hours | 72 hours | ~30 days | Create ticket |
The effect of dual-window validation: Evaluating with a 5-minute window alone will fire alerts even during temporary traffic surges. Using the 5-minute AND 1-hour conditions together means an alert only fires when conditions deteriorated sharply over 5 minutes and that trend persisted for 1 hour.
Recording Rules
The MWMB strategy requires repeatedly running SLI queries across multiple time windows: 5 minutes, 30 minutes, 1 hour, 6 hours, 24 hours, 72 hours, and more. Computing these queries in real time every time places a significant load on Prometheus.
Recording Rules are a Prometheus mechanism that solves this problem. By pre-computing complex queries and saving the results as new metrics, the pre-aggregated results can be reused during alert evaluation, simultaneously reducing evaluation latency and storage load.
grafana-slo-app automatically provisions these Recording Rules alongside the SLO when it is created.
What Alertmanager and grafana-slo-app Create Automatically
Alertmanager is the component that receives alerts fired by Prometheus and handles deduplication, grouping, and routing. Which alerts are sent to which channel (PagerDuty, Slack, webhooks, etc.) is controlled via the alertmanager.yml configuration file.
grafana-slo-app is an official Grafana plugin; once you create an SLO through the UI, it automatically provisions all of the following resources:
SLO Creation
├── Dashboard (including error budget burndown chart)
├── Recording Rules — optimizes multi-window SLI queries
└── Alert Rules
├── [Critical] fast-burn: grafana_slo_severity="critical"
└── [Warning] slow-burn: grafana_slo_severity="warning"The auto-generated alerts come with labels that can be used directly in Alertmanager routing:
grafana_slo_uuid— Unique SLO identifiergrafana_slo_window— Evaluation window (e.g.,5m,1h)grafana_slo_severity—criticalorwarning
Practical Application
Example 1: Grafana Cloud — Configuring Dual Alertmanager Routing
By first looking at what the alert rules generated by grafana-slo-app actually look like, you can directly see which labels the matchers in the Alertmanager routing configuration operate on.
# Example Alert Rule auto-generated by grafana-slo-app (simplified)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-api-gateway
spec:
groups:
- name: slo-api-gateway.rules
rules:
- alert: SLOFastBurn
expr: |
(
slo:sli_error:ratio_rate5m{grafana_slo_uuid="<uuid>"}
> (14.4 * (1 - 0.999))
)
and
(
slo:sli_error:ratio_rate1h{grafana_slo_uuid="<uuid>"}
> (14.4 * (1 - 0.999))
)
labels:
grafana_slo_severity: critical
grafana_slo_uuid: "<uuid>"
annotations:
summary: "Fast-burn: SLO error budget is being depleted rapidly"Alertmanager routing is branched based on the grafana_slo_severity label from these auto-generated rules.
# alertmanager.yml
route:
receiver: 'default'
routes:
# Fast-Burn: immediate on-call page (always place more specific matchers at the top)
- matchers:
- grafana_slo_severity = "critical"
receiver: 'pagerduty-oncall'
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
# Slow-Burn: create ticket
- matchers:
- grafana_slo_severity = "warning"
receiver: 'jira-ticket'
group_wait: 5m
group_interval: 30m
repeat_interval: 6h
receivers:
- name: 'pagerduty-oncall'
pagerduty_configs:
- routing_key: '${PAGERDUTY_ROUTING_KEY}' # inject via environment variable or Secret
description: '{{ .CommonAnnotations.summary }}'
severity: 'critical'
- name: 'jira-ticket'
webhook_configs:
- url: '${JIRA_AUTOMATION_WEBHOOK_URL}' # Jira Automation webhook URL
send_resolved: true
- name: 'default'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts'How to inject sensitive information: Hard-coding credentials like
routing_keydirectly into YAML can lead to security incidents. In Grafana Cloud environments, use Grafana Secrets Manager or Kubernetes Secrets. In Self-hosted environments, using the${ENV_VAR}environment variable reference syntax is recommended.
How to apply alertmanager.yml in Grafana Cloud: In Grafana Cloud, navigate to the left menu → Alerting → Alertmanager → Alertmanager configuration tab, where you can paste the full YAML directly or configure via the UI's Contact points / Notification policies. GitOps via the Terraform
grafana_alertmanager_configresource is also supported.
Jira webhook integration: You can automatically create Jira tickets upon alert receipt using Jira Automation's "Trigger a rule via webhook" feature or an Atlassian Forge app. If you want a generic configuration not tied to a specific environment, using a self-hosted middleware URL in
webhook_configsis also a valid approach.
| Configuration Item | Fast-Burn Value | Slow-Burn Value | Reason |
|---|---|---|---|
group_wait |
30 seconds | 5 minutes | Minimize aggregation wait for Critical |
repeat_interval |
1 hour | 6 hours | Reduce frequency of Warning repeat notifications |
receiver |
PagerDuty | Jira webhook | Separate channels by severity |
Example 2: Self-hosted Environment — Generating MWMB Rules with Sloth
In a Self-hosted Prometheus environment without Grafana Cloud, you can use Sloth to define SLOs in YAML and automatically generate MWMB alert rules.
# slo-api-gateway.yaml (Sloth input file)
version: "prometheus/v1"
service: "api-gateway"
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of requests should be successful"
sli:
events:
# [{{.window}}] is Sloth's Go template syntax, automatically substituted
# with each evaluation window value such as 5m, 30m, 1h, 6h during rule generation.
error_query: sum(rate(http_requests_total{job="api",code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{job="api"}[{{.window}}]))
alerting:
name: APIGatewayAvailability
page_alert:
labels:
severity: critical
team: platform
ticket_alert:
labels:
severity: warning
team: platformProcessing the above file with the Sloth CLI generates recording rule and alert rule YAML ready to apply directly to Prometheus:
# Generate PrometheusRule with Sloth
sloth generate -i slo-api-gateway.yaml -o rules/api-gateway-slo.yaml
# In a Kubernetes environment, apply as a PrometheusRule CRD
kubectl apply -f rules/api-gateway-slo.yamlSince the alert rules generated by Sloth also include severity: critical / severity: warning labels, the Alertmanager routing can be applied with the same matchers structure as Example 1.
Example 3: Environment-specific Approaches — Grafana-managed vs Data source-managed
The choice of alert management mode depends on your operational environment and GitOps requirements.
| Grafana-managed | Data source-managed | |
|---|---|---|
| Alert storage location | Grafana internal DB | Stored directly in Prometheus/Mimir |
| Alertmanager | Grafana built-in AM | External Prometheus AM |
| GitOps suitability | API / Terraform | PrometheusRule CRD |
| Recommended environment | Grafana Cloud | Self-hosted Kubernetes |
| Migration note | — | Must reconfigure Contact Points |
When switching to Data source-managed mode, it is recommended to validate the end-to-end routing with the following procedure:
# Validate external Alertmanager configuration syntax
amtool check-config alertmanager.yml
# Fire a test alert to verify the actual routing path
amtool alert add alertname="TestSLOAlert" grafana_slo_severity="critical" \
--alertmanager.url=http://localhost:9093Caution when switching to Data source-managed: Notification Policies and Contact Points configured in the existing Grafana Alertmanager must be reconfigured in the external Alertmanager. If you omit this step, SLO alerts will fire but nobody will be notified, creating an operational blind spot. It is strongly recommended to verify the full routing path with a test alert immediately after switching.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Reduced alert fatigue | Noise is greatly reduced since alerts only fire when the actual error budget depletion is genuinely threatened |
| One-click automation | grafana-slo-app provisions recording rules, dashboards, and alert rules all from a single SLO creation |
| False positive suppression | Dual short-window AND long-window validation suppresses false alerts caused by momentary spikes |
| Separated response | Critical triggers immediate on-call pages; Warning goes to tickets — preventing unnecessary late-night calls |
| SLO-as-Code | Can be integrated into a GitOps pipeline using the Terraform grafana_slo resource |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Cloud dependency | Full grafana-slo-app functionality is Grafana Cloud only |
Use Sloth + Prometheus combination for Self-hosted environments |
| Alert latency | Slow-Burn uses up to a 72-hour window, so detection can take several hours | Tune by lowering Fast-Burn thresholds to compensate for earlier detection |
| Threshold tuning | Google's default values (14.4×, 6×, 3×, 1×) are not optimal for all services | Recalculate thresholds to match your SLO target using Expected Depletion Time = 30 days ÷ Burn Rate |
| Label complexity | Maintaining consistency between labels like grafana_slo_uuid, team, etc. and the routing tree becomes burdensome when managing many SLOs |
Document label naming conventions as a team standard |
| Dual AM management | Operational burden of synchronizing Contact Points when switching from Grafana-managed to an external AM | Create a checklist before migration and validate in a staging environment |
Most Common Mistakes in Practice
-
Alertmanager routing order errors: If the
grafana_slo_severity = "critical"rule is positioned below a catch-all rule in theroutesarray, Critical alerts will be delivered to the default channel instead of PagerDuty. Always place the most specific matchers at the top of the array. -
Failing to reconfigure Contact Points after switching to Data source-managed: After switching to an external Alertmanager but neglecting to migrate the Notification Policy, SLO alerts fire but nobody gets notified. It is recommended to validate the end-to-end path with a test alert immediately after switching.
-
Applying Google's default thresholds as-is: Applying default values designed for a 99.9% SLO directly to a 99.0% SLO service will cause the burn rate scale to differ, resulting in alerts firing too frequently or too late. It is recommended to recalculate thresholds to match your SLO target using the formula
Expected Depletion Time = 30 days ÷ Burn Rate.
Closing Thoughts
The MWMB dual-alert strategy uses "error budget depletion rate" rather than "threshold exceeded" as its criterion, reducing alert noise while catching real threats to service reliability more precisely. When alert fatigue decreases, the team's on-call burden is reduced, and when the error budget becomes visible, discussions about reliability improvement priorities shift from gut feelings to data-driven decisions.
Three steps you can start right now:
- Define your SLO: In Grafana Cloud, create an availability SLO (e.g., 99.9%) for a core API endpoint using
grafana-slo-app. In Self-hosted environments, you can auto-generate MWMB rule files withsloth generate -i my-slo.yaml -o rules/my-slo.yaml. - Apply Alertmanager routing: Using Example 1's
alertmanager.ymlas a reference, configure routing sografana_slo_severity = "critical"→ PagerDuty/IRM andgrafana_slo_severity = "warning"→ webhook ticket, then validate the end-to-end path with a test alert. - Monitor the error budget dashboard: Share the auto-generated burndown chart in your team channel to make the remaining error budget visible, and use monthly trends to decide whether threshold tuning is needed.
Next post: A practical SLO-as-Code guide for managing SLOs as code using the Terraform
grafana_sloresource and integrating them into a GitOps pipeline
References
- Introduction to Grafana SLO | Grafana Plugins official docs
- Best Practices for Grafana SLOs | Grafana Plugins official docs
- Create SLOs | Grafana Plugins official docs
- Google SRE Workbook — Alerting on SLOs
- Alerting on Error Budget Burn Rate | Google Cloud official docs
- Sloth — Easy and simple Prometheus SLO generator | GitHub
- How We Use Sloth to do SLO Monitoring and Alerting with Prometheus | Mattermost Blog
- Alerting on SLOs like Pros | SoundCloud Backstage Blog
- Configure Alertmanagers | Grafana official docs
- Configure Notification Policies | Grafana official docs