How to Cut LogQL Costs with Loki Recording Rules and Structurally Eliminate Grafana Alerting Noise
After setting up log-based alerting for the first time, I felt pretty satisfied for a while. I had a query counting error logs in Loki, and a structure was in place to send Slack notifications whenever a threshold was exceeded. But before long, two problems emerged. Every time I opened a dashboard, it took dozens of seconds to respond because it had to aggregate tens of gigabytes of logs from scratch each time. And alerts fired on momentary spikes, so even when an on-call page came in at 3 AM, I developed a habit of ignoring it thinking "it's probably just noise again." When alert fatigue builds up, you start missing real incident alerts. That's exactly what happened. One night there was a DB connection error, but because alerts had been coming so frequently, I dismissed it several times and ended up responding too late.
Loki Recording Rules pre-compute costly LogQL queries and store the results as metrics, and when combined with Grafana Alerting's Pending Period and Recovering state, they can structurally eliminate noisy alerts. After reading this post, you'll walk away with practical answers to "what alert configurations generate noise?" and "what changes will reduce on-call pages?" Starting from the principles of Recording Rules through SLO alerting, I'll walk through configuration examples you can put to use immediately in production.
Core Concepts
The Problem Recording Rules Solve
Grafana Loki executes LogQL against log streams. The problem is that when the same aggregation query runs simultaneously across 5 dashboard panels and 3 alerting rules, it scans the raw logs 7 times. Duplicate costs accumulate in proportion to the number of queries, and the larger the log volume, the faster these costs compound.
Recording Rules solve this problem by having Loki's Ruler component evaluate a LogQL expression just once per configured interval and store the result as a Prometheus-compatible time series.
Log stream (Loki) → Ruler (LogQL evaluation, once) → Remote Write → Mimir/Prometheus
↓
Grafana Alerting (metric-based alerts)Ruler: A component in the Loki architecture that periodically evaluates alerting/recording rules. In microservices mode it runs as an independent process; in monolithic mode it operates as a built-in feature.
Two Types of Rules
There are two kinds of Loki Rules, and in practice you combine both within the same group.
| Type | Role | Output |
|---|---|---|
| Recording Rule | Periodically evaluates and stores complex LogQL aggregations | Prometheus time series metrics |
| Alerting Rule | Evaluates conditions (LogQL or metric queries) and fires alerts when thresholds are exceeded | Firing alert events |
I used to think I only needed alerting rules, but in actual operation the pattern of "expensive LogQL → Recording Rule to convert to metrics → write Alerting Rule based on those metrics" is far more stable.
Basic Configuration Structure
The core YAML structure looks like this. Recording Rules and Alerting Rules appear in sequence within the same group.
groups:
- name: http_error_rates
interval: 1m # evaluation interval for this group
rules:
# Step 1: Pre-compute expensive LogQL as a metric
- record: job:http_errors:rate5m
expr: |
sum by (job, status_code) (
rate({job="api-server"} | json | status_code >= 400 [5m])
)
# Step 2: Configure alerting that references the pre-computed metric
- alert: HighErrorRate
expr: job:http_errors:rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate {{ $value | humanize }} on {{ $labels.job }}"The naming convention for the record field (job:metric_name:aggregation) is the Prometheus community standard. As the number of metrics grows, following this convention makes management much easier.
Ruler Component Configuration
Enable the Ruler in the Loki configuration file and specify the Remote Write destination.
ruler:
enable_api: true
evaluation_interval: 1m
storage:
type: local
local:
directory: /loki/rules
remote_write:
enabled: true
client:
url: http://mimir:9009/api/v1/pushRemote Write: The standard protocol in the Prometheus ecosystem for sending metrics to external storage. Mimir, Thanos, and VictoriaMetrics all support this protocol.
WAL (Write-Ahead Log): A log where the Ruler temporarily stores evaluation results locally before sending them via Remote Write. This allows unsent data to be recovered if the Ruler restarts, which is why a PersistentVolume connection is required.
Recent Features for Reducing Alert Noise (2025–2026)
There are two changes to Grafana Alerting that you must know about alongside Recording Rules. Understanding them upfront will help since they appear directly in the practical examples below.
Recovering state (May 2025): Instead of immediately transitioning to Normal once an alert condition clears, the alert stays in a Recovering state for the duration set in keep_firing_for. This is effective at suppressing flapping (the phenomenon where alerts repeatedly fire and resolve).
NoData/Error Pending Period (February 2026): Previously, an alert would fire immediately when data was missing or an evaluation error occurred. Now the Pending Period applies to these states as well, preventing temporary data gaps from causing false positives.
Practical Application
Example 1: API Error Rate Alert — Before and After Noise Reduction
This is the most common pattern seen in production. Many people start with a configuration like this:
# Noisy configuration
- alert: ApiErrors
expr: count_over_time({job="api"} |= "ERROR" [1m]) > 5
# no for → fires immediately if just 6 errors occur in a 1-minute windowWithout the for parameter, an alert fires the moment the condition becomes true on any evaluation cycle. Things like errors that briefly spike right after a deployment or temporary timeouts from external dependencies all come through as alerts.
# Improved configuration
groups:
- name: api_slo
interval: 1m
rules:
# Step 1: Calculate error rate as a ratio (not absolute count)
- record: job:api_error_rate:rate5m
expr: |
sum(rate({job="api"} | logfmt | level="error" [5m]))
/
sum(rate({job="api"} | logfmt [5m]))
# Step 2: Ratio-based alert + Pending Period + Recovering
- alert: ApiHighErrorRate
expr: job:api_error_rate:rate5m > 0.01
for: 10m
keep_firing_for: 5m
labels:
severity: warning
annotations:
summary: "API error rate {{ $value | humanizePercentage }} (job={{ $labels.job }})"The example above assumes the log format is
logfmt(e.g.,level=error msg="..."). Replacelogfmtwith thejsonparser or a|= "ERROR"line filter to match your actual log format.
| Setting | Role |
|---|---|
rate()-based ratio |
Error ratio remains stable regardless of traffic fluctuations |
for: 10m |
Only fires when sustained above 1% for 10 continuous minutes (ignores momentary spikes) |
keep_firing_for: 5m |
Maintains Recovering state for 5 minutes after the condition clears, preventing flapping |
Example 2: AWS ALB Access Logs → Metrics Pipeline
A pattern for extracting metrics directly from ALB access logs without a separate Prometheus exporter. Honestly, at first I thought "can this really work?" — but it does, and quite well.
groups:
- name: alb_metrics
interval: 1m
rules:
- record: alb:request_count:rate5m
expr: |
sum by (target_group, status_code) (
rate({job="alb-access-logs"}
| regexp `(?P<status_code>\d{3}) (?P<target_group>[^ ]+)` [5m])
)
- record: alb:error_rate:rate5m
expr: |
sum by (target_group) (
rate({job="alb-access-logs"}
| regexp `(?P<status_code>[45]\d{2})` [5m])
)
/
sum by (target_group) (
rate({job="alb-access-logs"} [5m])
)
- alert: AlbHighErrorRate
expr: alb:error_rate:rate5m > 0.05
for: 5m
labels:
severity: criticalCaution: ALB access logs consist of 20+ space-delimited fields, so the regexp pattern above may not accurately extract the desired fields depending on the actual log format. Rather than copying this directly, it's recommended to first parse a real ALB log sample in Grafana Explore and verify that field extraction works correctly before use.
regexp pipeline: Extracts portions of a log line as labels using regular expressions in LogQL. Complex regular expressions increase Ruler evaluation costs, so prefer the
jsonorlogfmtparsers when possible.
Example 3: Multi-Window Multi-Burn-Rate SLO Alerting
The following is an SLO alerting pattern at the level applied by SRE teams. If this is your first time implementing it, it's recommended to stabilize Examples 1 and 2 first before referencing this one.
This is a pattern for implementing a 99.9% availability SLO (monthly allowed error budget: approximately 43.8 minutes) using log data. It measures far more precisely than simple thresholds — specifically, "how quickly is the current error rate burning through the error budget?"
The burn rate multipliers may feel unfamiliar at first. Intuitively: 14.4x means the monthly error budget would be exhausted in about 2 days at this rate, and 6x means it would be exhausted in about 5 days. Fast burn rates are detected immediately using short windows, while slow but sustained burn rates are caught with long windows.
groups:
- name: api_slo_burn_rate
interval: 1m
rules:
# Pre-compute error rate metrics for various time windows
- record: job:api_error_rate:rate5m
expr: |
sum(rate({job="api"} | logfmt | level="error" [5m]))
/ sum(rate({job="api"} | logfmt [5m]))
- record: job:api_error_rate:rate1h
expr: |
sum(rate({job="api"} | logfmt | level="error" [1h]))
/ sum(rate({job="api"} | logfmt [1h]))
- record: job:api_error_rate:rate6h
expr: |
sum(rate({job="api"} | logfmt | level="error" [6h]))
/ sum(rate({job="api"} | logfmt [6h]))
# Fast-burn: both 5m + 1h windows exceeded → immediate action required
- alert: SLO_FastBurn
expr: |
job:api_error_rate:rate5m > (14.4 * 0.001)
and
job:api_error_rate:rate1h > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: api_availability
# Slow-burn: 6h window exceeded → error budget is slowly draining
- alert: SLO_SlowBurn
expr: job:api_error_rate:rate6h > (6 * 0.001)
for: 1h
labels:
severity: warning
slo: api_availability| Alert | Window | Multiplier | Meaning |
|---|---|---|---|
SLO_FastBurn |
5m + 1h | 14.4x | At this rate, the monthly error budget will be exhausted in about 2 days (immediate action required) |
SLO_SlowBurn |
6h | 6x | At this rate, the monthly error budget will be exhausted in about 5 days (dangerous if left unaddressed) |
The reason for connecting two windows with and is to prevent false positives from short-term spikes. Using only the 5-minute window fires on temporary traffic surges, while using only the 1-hour window delays rapid incident detection.
Pros and Cons Analysis
Advantages
| Item | Description |
|---|---|
| Reduced query costs | Reduces log scans to once when the same LogQL is referenced across multiple dashboards and alerts |
| Dashboard response speed | Reads pre-computed time series, so aggregating tens of GB of logs becomes millisecond-level responses |
| Noise suppression | The for + keep_firing_for combination structurally blocks momentary spike alerts and flapping |
| SLO implementation | Can build a Prometheus SLO ecosystem (burn rate, error budget) using only log data |
| Cloud cost reduction | Reduces dependency on external metric services like CloudWatch (Qonto case study) |
| UI management | Recording Rules can be created and managed directly in the Grafana UI (operable without Terraform/YAML) |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Additional Ruler infrastructure | Requires operating a separate Ruler component. Setting up the Ruler StatefulSet on initial adoption takes more work than expected. | Recommend StatefulSet + PV (WAL) setup; manage with Helm grafana/loki chart |
| No retroactive computation | Metrics don't exist for periods before the rule was activated | Ensure sufficient warm-up time after rule deployment before making dashboards public |
| Label cardinality | Including high-cardinality labels (request IDs, user IDs, etc.) causes metric storage costs to explode. I learned this the hard way after adding user_id as a label and watching Mimir die from OOM. |
Select only the necessary labels in sum by() |
| Evaluation interval design | Short intervals increase Ruler load; long intervals delay alerts | Separate groups by criticality and assign different intervals |
| LogQL complexity | Nesting unwrap, json, and regexp increases Ruler evaluation costs |
Narrow the stream with label selectors and line filters before applying parsers |
Label Cardinality: The number of unique possible values that a metric's labels can take. Including labels like
user_idwhose values grow unboundedly causes the number of time series to explode, driving up both storage and query costs.
Most Common Mistakes in Production
-
Omitting the
forparameter: Many people leave outforwhen first writing alerting rules. The default value is 0, so the alert fires the instant the condition becomes true, and every momentary spike becomes an alert. It's recommended to start with at leastfor: 5m. -
Setting thresholds with absolute counts: Setting thresholds with absolute values like
count_over_time(...) > 100will cause alerts to fire in normal conditions when traffic increases. Switching to an error ratio (rate(error) / rate(total)) produces stable alerts regardless of traffic fluctuations. -
Including high-cardinality labels in Recording Rules: Adding an aggregation like
by (user_id, request_id)to a Recording Rule causes the number of metric time series to explode to the order of (number of users) × (number of requests). Recording Rules are best suited for reducing aggregated results to only meaningful dimensions (job,service,status_codelevel).
Closing Thoughts
After applying these configurations, the first thing that changed was the frequency of late-night on-call pages. Once the momentary spike alerts disappeared after adding for: 10m and keep_firing_for: 5m, I actually developed a sense of trust that when an alert did fire, "this is something I need to go check." When alert quality goes up, team response speed follows.
Three steps you can start right now:
-
Check the
forparameter in your current Loki alerting rules. If you have rules with noforor withfor: 0, simply changing them tofor: 5mwill significantly reduce noise. You can use thelokitool rules validate <rules.yaml>command to validate your YAML syntax beforehand. -
Move your most frequently queried LogQL expressions to Recording Rules. You can start by finding slow queries in Grafana Explore, converting them to
record:blocks, and replacing dashboard panels with references to the pre-computed metrics. -
Replace one simple threshold alert with a Multi-Window Burn Rate alert. Copying the YAML from Example 3 above and changing only the
joblabel to your own service name is enough to experience the basic SLO alerting structure. However, it's recommended to first verify in Grafana Explore that thelogfmtparser matches your actual log format.
References
- Alerting and recording rules | Grafana Loki documentation
- Manage recording rules | Grafana Loki documentation
- Create recording rules | Grafana documentation
- Alert rule evaluation | Grafana documentation
- Pending period added to alert states | Grafana Labs
- Grafana-managed alert rule "Recovering" state | Grafana Labs
- When SLOs reduce alert noise | Grafana documentation
- How to implement multi-window, multi-burn-rate alerts with Grafana Cloud | Grafana Labs
- Grafana Loki: performance optimization with Recording Rules, caching, and parallel queries
- Grafana Loki: LogQL and Recording Rules for metrics from AWS Load Balancer logs (ITNEXT)
- Building a cost-efficient network observability platform using Grafana Loki | Qonto (Medium)
- Grafana Alerting best practices
- Grafana Loki: Optimising log based metrics — DEV Community