Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL
Without the cost of adopting an APM tool or modifying your code, you can measure P99 latency and error rates using only the logs you already have, and build an SLO dashboard around them. Grafana Loki provides the ability to transform logs into metrics through a query language called LogQL, with the unwrap operator and the avg_over_time and quantile_over_time functions at its core.
This guide is for developers and SREs who have experience operating Grafana and Loki (2.3+) and want to quickly establish an SLO framework before adopting an APM solution. If you already have structured logs in JSON or logfmt format being collected in Loki, you can run your first P99 query within 5 minutes.
This guide walks you step by step through the entire flow — from HTTP API latency SLOs to Recording Rules based on AWS ALB logs and Grafana dashboard panel configuration — covering how to extract reliable SLIs from logs and visualize them.
Table of Contents
- Understanding the Concepts
- Practical Application
- Operational Considerations
- Closing Thoughts
- References
Understanding the Concepts
The 3 Components of SLO
Before building a Loki-based SLO dashboard, it helps to briefly review three key concepts.
| Component | Description | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurement metric that represents service quality | P99 response time, error rate |
| SLO (Service Level Objective) | A target value for an SLI | P99 < 500ms, error rate < 1% |
| Error Budget | The allowable margin of failure while meeting the SLO | 43.2 minutes over 30 days (for a 99.9% SLO) |
With Loki, you can extract SLIs directly from existing logs, eliminating the need to add separate instrumentation code.
unwrap — The Core Operator for Transforming Log Fields into Metrics
unwrap is an operator that extracts numeric values from parsed log fields and passes them to metric functions. It must be placed after a parser such as | json, | logfmt, or | regexp.
{app="api-server", env="production"}
| json
| unwrap response_time_ms [5m]What is
unwrap? It is a LogQL operator that extracts a parsed numeric field from a log line and passes it to range aggregation functions likeavg_over_timeandquantile_over_time. Withoutunwrap, you cannot treat logs as numeric metrics.
Depending on your log format, you can choose a parser as shown below.
# JSON parser
{app="api"} | json | unwrap response_time_ms
# Logfmt parser
{app="api"} | logfmt | unwrap latency
# Regex parser
{app="api"} | regexp `(?P<duration>\d+)ms` | unwrap duration
# Pattern parser (Nginx access log example)
{app="nginx"} | pattern `<_> - - <_> "<method> <_> <_>" <status> <bytes>` | unwrap bytesavg_over_time and quantile_over_time
Both functions aggregate the values extracted by unwrap over a specified time range.
# Average response time
avg_over_time(
{app="api-server", env="production"}
| json
| unwrap response_time_ms [5m]
) by (service)
# P95 latency
quantile_over_time(0.95,
{app="api-server", env="production"}
| json
| unwrap response_time_ms [5m]
) by (service)
# P99 latency + remove parse errors (recommended pattern for production)
quantile_over_time(0.99,
{container="ingress-nginx"}
| json
| unwrap response_latency_seconds
| __error__="" [1m]
) by (cluster)| Function | Use Case | φ Value Examples |
|---|---|---|
avg_over_time(...) |
Calculating averages (average response time, throughput) | — |
quantile_over_time(φ, ...) |
Calculating the φ-quantile | P50=0.5, P95=0.95, P99=0.99 |
__error__=""If a field cannot be converted to a number duringunwrap, a parse error log is generated. Adding the| __error__=""filter removes these error logs, improving the accuracy of your SLI values. It is recommended to include this in production environments.
quantile_over_time operates by computing exact quantiles. However, for endpoints with low traffic, the sample count itself may be insufficient, leading to lower statistical confidence. Before setting SLO thresholds, it is worth checking whether enough samples are available.
Criteria for Choosing an Aggregation Window
When selecting a time range window such as [5m], you need to adjust it to fit the characteristics of your service.
| Window | Suitable Situation | Caveats |
|---|---|---|
[1m] or less |
High-frequency APIs with heavy traffic | Low quantile reliability for endpoints with few log entries |
[5m] |
Most production services | Good balance between detecting latency spikes and statistical accuracy |
[15m] or more |
Low-traffic batch services | Slower to detect sudden latency spikes |
In practice, [5m] is widely used as a default because windows of 1 minute or less suffer from low quantile accuracy due to sparse logs, while windows of 15 minutes or more are slow to detect latency spikes. Adjust this to fit the traffic characteristics of your own service.
Now let's apply these concepts to real SLO scenarios.
Practical Application
Example 1: HTTP API Latency SLO
Scenario: P99 response time under 500ms, maintained 99.9% of the time over a 30-day period
-- Replace label names ({app="order-api"}, field name duration_ms, etc.) to match your actual environment.
-- SLI: P99 latency
quantile_over_time(0.99,
{namespace="production", app="order-api"}
| json
| unwrap duration_ms
| __error__="" [5m]
) by (endpoint)
-- SLI: P95 latency
quantile_over_time(0.95,
{namespace="production", app="order-api"}
| json
| unwrap duration_ms
| __error__="" [5m]
) by (endpoint)
-- SLI: Average latency
avg_over_time(
{namespace="production", app="order-api"}
| json
| unwrap duration_ms
| __error__="" [5m]
) by (endpoint)| Query Component | Role |
|---|---|
{namespace="production", app="order-api"} |
Selects logs for a specific app in the production environment |
| json |
Parses logs in JSON format |
| unwrap duration_ms |
Extracts the duration_ms field as a numeric value |
| __error__="" |
Removes parse error logs |
[5m] |
Aggregates over a 5-minute window |
by (endpoint) |
Splits results by endpoint |
Example 2: Error Rate SLO
Error rate can be calculated without unwrap by combining the rate() function with label filters.
-- Numeric label comparison after JSON parsing (>= 500) is supported in Loki 2.3+.
-- For older environments, you can substitute with | status_code !~ "1..|2..|3..|4.."
(
sum(rate(
{app="order-api"} | json | status_code >= 500 [5m]
)) by (service)
/
clamp_min(
sum(rate({app="order-api"} | json [5m])) by (service),
0.001
)
) * 100Handling a zero denominator When there is no traffic, the denominator can become 0, resulting in
NaNor infinity values. It is recommended to specify a minimum denominator value withclamp_min(..., 0.001), or to set a default value usingor vector(0). Without this handling, alerts may malfunction.
Example 3: Recording Rules Based on AWS ALB Logs
Instead of running expensive queries on the dashboard every time, you can significantly improve query performance by using Recording Rules to pre-compute results and store them as time-series metrics.
# loki-rules.yaml
groups:
- name: alb_slo_rules
interval: 1m # Recording Rule execution interval
rules:
- record: job:alb_request_duration_seconds:p99
expr: |
quantile_over_time(0.99,
{job="alb-logs"}
| logfmt
| unwrap target_processing_time
| __error__="" [5m]
) by (target_group)
- record: job:alb_error_rate:ratio
expr: |
sum(rate({job="alb-logs"} | logfmt | elb_status_code >= 500 [5m]))
/
clamp_min(
sum(rate({job="alb-logs"} | logfmt [5m])),
0.001
)What are Recording Rules? They are a feature that pre-computes frequently executed, expensive LogQL queries and stores the results as new time-series metrics. This is the same concept as Prometheus Recording Rules — since the dashboard queries already-computed metrics, the query load is greatly reduced.
The relationship between
intervaland the range window Withinterval: 1m, the rule runs every minute. If the query window is[5m], there will be 4 minutes of overlapping data between consecutive execution results. This is intentional behavior, but it can lead to double-counting depending on your aggregation method, so caution is warranted. To avoid overlap, you can setinterval: 5mto match the window size.
Example 4: Grafana Dashboard Panel Configuration
It is recommended to structure your SLO dashboard with the following four panels.
┌────────────────────────────────────────────┐
│ SLO Status Panel │
│ - Current SLI value (Stat Panel) │
│ - Compliance rate vs. SLO target (Gauge) │
├────────────────────────────────────────────┤
│ Latency Trend │
│ - P50 / P95 / P99 Time Series │
│ - avg_over_time average overlay │
├────────────────────────────────────────────┤
│ Error Budget │
│ - Remaining error budget (%) — Stat Panel │
│ - Burn rate time series — Time Series │
├────────────────────────────────────────────┤
│ Burn Rate Alert Status │
│ - Fast-burn (1h/6h windows) │
│ - Slow-burn (24h/72h windows) │
└────────────────────────────────────────────┘For each panel's data source, reference the job:alb_request_duration_seconds:p99 and job:alb_error_rate:ratio metrics stored via Recording Rules in Example 3.
Burn Rate is the multiplier that compares the current rate of error consumption against the SLO's allowance. For example, at a 99.9% SLO, a Burn Rate of 1 means the error budget is being exhausted at exactly the 30-day rate, while a Burn Rate of 10 means it would be exhausted in 3 days. The multi-window burn rate pattern from the Google SRE Workbook is a dual-alerting strategy that uses short windows (1h/6h) to detect sudden outages and long windows (24h/72h) to detect gradual service degradation.
Operational Considerations
Advantages
| Item | Details |
|---|---|
| No code changes | SLIs can be extracted from existing logs — no app instrumentation required |
| Cost-efficient | Loki can be operated standalone without Prometheus or a separate metrics store |
| Flexible field extraction | Supports diverse log formats via json, logfmt, regexp, and pattern parsers |
| Historical analysis | Past logs can be re-queried to calculate SLIs retroactively |
| Granular SLOs | Easy to segment by service, endpoint, or region using label-based grouping |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Performance overhead | quantile_over_time is memory-intensive and may time out with high log volumes |
Apply pre-computation with Recording Rules |
| Scalability constraints | Because exact quantiles are computed, memory usage grows linearly as the time range widens | Use short aggregation windows (1m–5m) and persist results with Recording Rules |
| Sparse data accuracy | For endpoints with low traffic, insufficient sample counts can lower quantile reliability | It is recommended to adjust the label aggregation level to the service level |
| Single-field limitation | Only one field can be extracted per unwrap expression |
Use separate queries when multiple metrics are needed |
| Known bug | Combining with outer aggregation functions (avg by (...) (quantile_over_time(...))) causes errors in certain Loki versions (Issue #13793) |
Verify your Loki version and avoid outer aggregations |
| Log structure dependency | SLI quality is entirely dependent on the consistency of the log format | Queries must be updated whenever the log format changes |
What is Cardinality? It refers to the number of unique values for a label. The more granular your
by (endpoint)clause, the more the number of time series explodes. For example, 200 endpoints means a single Recording Rule generates 200 time series, which directly increases memory and index costs in Loki and Grafana. If you have many endpoints, it is recommended to reduce the number of aggregation labels or normalize label values.
The Most Common Mistakes in Practice
- Omitting the
__error__=""filter — Unwrap parse error logs get included in SLI values, distorting the results. This is especially problematic when some log entries are missing the field entirely or contain mixed string values, which can cause values to spike dramatically. - Omitting zero-denominator handling in error rate queries — During low-traffic hours such as late at night,
NaNor+Infvalues can appear, causing alerts to malfunction or leaving gaps in the dashboard. Usingclamp_minoror vector(0)prevents this issue. - Ignoring the relationship between a Recording Rule's
intervaland the range window size — Usinginterval: 1mwith a[5m]window causes overlapping data references. First verify whether this is your intended aggregation behavior, and if you want overlap-free operation, consider aligning the interval with the window size.
Closing Thoughts
By combining Grafana Loki's unwrap, avg_over_time, and quantile_over_time, you can build a reliable SLO dashboard from your existing production logs without any code changes.
Here are 3 steps you can take right now to get started.
- Identify numeric fields in your existing logs. In Grafana Explore, run
{app="your-app-name"} | jsonand check whether there are parseable numeric fields such asresponse_time_ms,duration, orlatency. - Modify the P99 query from Example 1 to fit your app and run it. You can see your first SLI value simply by replacing the
app,namespace, and the target field name forunwrap. - Register the validated query as a Recording Rule. Save it in
loki-rules.yamlasrecord: job:your_api_p99_latency_ms, then connect this metric to a Time Series panel and a Gauge panel in your Grafana dashboard.
Your logs are already in your infrastructure — from now on, those logs become your SLO instrument panel.
How to automate the Fast-burn/Slow-burn dual-alerting strategy by integrating the Grafana SLO plugin (grafana-slo-app) with Alertmanager will be covered in the next post.
References
- Metric queries | Grafana Loki documentation
- Alerting and recording rules | Grafana Loki documentation
- How to use LogQL range aggregations in Loki | Grafana Labs
- Best practices for Grafana SLOs | Grafana Cloud documentation
- Introduction to Grafana SLO | Grafana Cloud documentation
- Create Grafana Loki SLO | Nobl9 Documentation
- Grafana Loki: LogQL and Recording Rules for metrics from AWS Load Balancer logs | ITNEXT
- Loki Community Call July 2025: How to generate metrics from logs | Grafana Community
- A Comprehensive Guide to Log Query Language (LogQL) | DEV Community
- Avg with quantile_over_time throws an error · Issue #13793 · grafana/loki
- How to Build SLO Dashboards with Loki Logs | OneUptime