Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

Without the cost of adopting an APM tool or modifying your code, you can measure P99 latency and error rates using only the logs you already have, and build an SLO dashboard around them. Grafana Loki provides the ability to transform logs into metrics through a query language called LogQL, with the unwrap operator and the avg_over_time and quantile_over_time functions at its core.

This guide is for developers and SREs who have experience operating Grafana and Loki (2.3+) and want to quickly establish an SLO framework before adopting an APM solution. If you already have structured logs in JSON or logfmt format being collected in Loki, you can run your first P99 query within 5 minutes.

This guide walks you step by step through the entire flow — from HTTP API latency SLOs to Recording Rules based on AWS ALB logs and Grafana dashboard panel configuration — covering how to extract reliable SLIs from logs and visualize them.

Understanding the Concepts
Practical Application
Operational Considerations
Closing Thoughts
References

Understanding the Concepts

The 3 Components of SLO

Before building a Loki-based SLO dashboard, it helps to briefly review three key concepts.

Component	Description	Example
SLI (Service Level Indicator)	A measurement metric that represents service quality	P99 response time, error rate
SLO (Service Level Objective)	A target value for an SLI	P99 < 500ms, error rate < 1%
Error Budget	The allowable margin of failure while meeting the SLO	43.2 minutes over 30 days (for a 99.9% SLO)

With Loki, you can extract SLIs directly from existing logs, eliminating the need to add separate instrumentation code.

unwrap — The Core Operator for Transforming Log Fields into Metrics

unwrap is an operator that extracts numeric values from parsed log fields and passes them to metric functions. It must be placed after a parser such as | json, | logfmt, or | regexp.

logql

{app="api-server", env="production"}
| json
| unwrap response_time_ms [5m]

What is unwrap? It is a LogQL operator that extracts a parsed numeric field from a log line and passes it to range aggregation functions like avg_over_time and quantile_over_time. Without unwrap, you cannot treat logs as numeric metrics.

Depending on your log format, you can choose a parser as shown below.

logql

# JSON parser
{app="api"} | json | unwrap response_time_ms
 
# Logfmt parser
{app="api"} | logfmt | unwrap latency
 
# Regex parser
{app="api"} | regexp `(?P<duration>\d+)ms` | unwrap duration
 
# Pattern parser (Nginx access log example)
{app="nginx"} | pattern `<_> - - <_> "<method> <_> <_>" <status> <bytes>` | unwrap bytes

avg_over_time and quantile_over_time

Both functions aggregate the values extracted by unwrap over a specified time range.

logql

# Average response time
avg_over_time(
  {app="api-server", env="production"}
  | json
  | unwrap response_time_ms [5m]
) by (service)
 
# P95 latency
quantile_over_time(0.95,
  {app="api-server", env="production"}
  | json
  | unwrap response_time_ms [5m]
) by (service)
 
# P99 latency + remove parse errors (recommended pattern for production)
quantile_over_time(0.99,
  {container="ingress-nginx"}
  | json
  | unwrap response_latency_seconds
  | __error__="" [1m]
) by (cluster)

Function	Use Case	φ Value Examples
`avg_over_time(...)`	Calculating averages (average response time, throughput)	—
`quantile_over_time(φ, ...)`	Calculating the φ-quantile	P50=0.5, P95=0.95, P99=0.99

__error__="" If a field cannot be converted to a number during unwrap, a parse error log is generated. Adding the | __error__="" filter removes these error logs, improving the accuracy of your SLI values. It is recommended to include this in production environments.

quantile_over_time operates by computing exact quantiles. However, for endpoints with low traffic, the sample count itself may be insufficient, leading to lower statistical confidence. Before setting SLO thresholds, it is worth checking whether enough samples are available.

Criteria for Choosing an Aggregation Window

When selecting a time range window such as [5m], you need to adjust it to fit the characteristics of your service.

Window	Suitable Situation	Caveats
`[1m]` or less	High-frequency APIs with heavy traffic	Low quantile reliability for endpoints with few log entries
`[5m]`	Most production services	Good balance between detecting latency spikes and statistical accuracy
`[15m]` or more	Low-traffic batch services	Slower to detect sudden latency spikes

In practice, [5m] is widely used as a default because windows of 1 minute or less suffer from low quantile accuracy due to sparse logs, while windows of 15 minutes or more are slow to detect latency spikes. Adjust this to fit the traffic characteristics of your own service.

Now let's apply these concepts to real SLO scenarios.

Practical Application

Example 1: HTTP API Latency SLO

Scenario: P99 response time under 500ms, maintained 99.9% of the time over a 30-day period

logql

-- Replace label names ({app="order-api"}, field name duration_ms, etc.) to match your actual environment.
 
-- SLI: P99 latency
quantile_over_time(0.99,
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)
 
-- SLI: P95 latency
quantile_over_time(0.95,
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)
 
-- SLI: Average latency
avg_over_time(
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)

Query Component	Role
`{namespace="production", app="order-api"}`	Selects logs for a specific app in the production environment
`\| json`	Parses logs in JSON format
`\| unwrap duration_ms`	Extracts the `duration_ms` field as a numeric value
`\| __error__=""`	Removes parse error logs
`[5m]`	Aggregates over a 5-minute window
`by (endpoint)`	Splits results by endpoint

Example 2: Error Rate SLO

Error rate can be calculated without unwrap by combining the rate() function with label filters.

logql

-- Numeric label comparison after JSON parsing (>= 500) is supported in Loki 2.3+.
-- For older environments, you can substitute with | status_code !~ "1..|2..|3..|4.."
 
(
  sum(rate(
    {app="order-api"} | json | status_code >= 500 [5m]
  )) by (service)
  /
  clamp_min(
    sum(rate({app="order-api"} | json [5m])) by (service),
    0.001
  )
) * 100

Handling a zero denominator When there is no traffic, the denominator can become 0, resulting in NaN or infinity values. It is recommended to specify a minimum denominator value with clamp_min(..., 0.001), or to set a default value using or vector(0). Without this handling, alerts may malfunction.

Example 3: Recording Rules Based on AWS ALB Logs

Instead of running expensive queries on the dashboard every time, you can significantly improve query performance by using Recording Rules to pre-compute results and store them as time-series metrics.

yaml

# loki-rules.yaml
groups:
  - name: alb_slo_rules
    interval: 1m   # Recording Rule execution interval
    rules:
      - record: job:alb_request_duration_seconds:p99
        expr: |
          quantile_over_time(0.99,
            {job="alb-logs"}
            | logfmt
            | unwrap target_processing_time
            | __error__="" [5m]
          ) by (target_group)
 
      - record: job:alb_error_rate:ratio
        expr: |
          sum(rate({job="alb-logs"} | logfmt | elb_status_code >= 500 [5m]))
          /
          clamp_min(
            sum(rate({job="alb-logs"} | logfmt [5m])),
            0.001
          )

What are Recording Rules? They are a feature that pre-computes frequently executed, expensive LogQL queries and stores the results as new time-series metrics. This is the same concept as Prometheus Recording Rules — since the dashboard queries already-computed metrics, the query load is greatly reduced.

The relationship between interval and the range window With interval: 1m, the rule runs every minute. If the query window is [5m], there will be 4 minutes of overlapping data between consecutive execution results. This is intentional behavior, but it can lead to double-counting depending on your aggregation method, so caution is warranted. To avoid overlap, you can set interval: 5m to match the window size.

Example 4: Grafana Dashboard Panel Configuration

It is recommended to structure your SLO dashboard with the following four panels.

sql

┌────────────────────────────────────────────┐
│ SLO Status Panel                           │
│  - Current SLI value (Stat Panel)          │
│  - Compliance rate vs. SLO target (Gauge)  │
├────────────────────────────────────────────┤
│ Latency Trend                              │
│  - P50 / P95 / P99 Time Series             │
│  - avg_over_time average overlay           │
├────────────────────────────────────────────┤
│ Error Budget                               │
│  - Remaining error budget (%) — Stat Panel │
│  - Burn rate time series — Time Series     │
├────────────────────────────────────────────┤
│ Burn Rate Alert Status                     │
│  - Fast-burn (1h/6h windows)               │
│  - Slow-burn (24h/72h windows)             │
└────────────────────────────────────────────┘

For each panel's data source, reference the job:alb_request_duration_seconds:p99 and job:alb_error_rate:ratio metrics stored via Recording Rules in Example 3.

Burn Rate is the multiplier that compares the current rate of error consumption against the SLO's allowance. For example, at a 99.9% SLO, a Burn Rate of 1 means the error budget is being exhausted at exactly the 30-day rate, while a Burn Rate of 10 means it would be exhausted in 3 days. The multi-window burn rate pattern from the Google SRE Workbook is a dual-alerting strategy that uses short windows (1h/6h) to detect sudden outages and long windows (24h/72h) to detect gradual service degradation.

Operational Considerations

Advantages

Item	Details
No code changes	SLIs can be extracted from existing logs — no app instrumentation required
Cost-efficient	Loki can be operated standalone without Prometheus or a separate metrics store
Flexible field extraction	Supports diverse log formats via json, logfmt, regexp, and pattern parsers
Historical analysis	Past logs can be re-queried to calculate SLIs retroactively
Granular SLOs	Easy to segment by service, endpoint, or region using label-based grouping

Disadvantages and Caveats

Item	Details	Mitigation
Performance overhead	`quantile_over_time` is memory-intensive and may time out with high log volumes	Apply pre-computation with Recording Rules
Scalability constraints	Because exact quantiles are computed, memory usage grows linearly as the time range widens	Use short aggregation windows (1m–5m) and persist results with Recording Rules
Sparse data accuracy	For endpoints with low traffic, insufficient sample counts can lower quantile reliability	It is recommended to adjust the label aggregation level to the service level
Single-field limitation	Only one field can be extracted per `unwrap` expression	Use separate queries when multiple metrics are needed
Known bug	Combining with outer aggregation functions (`avg by (...) (quantile_over_time(...))`) causes errors in certain Loki versions (Issue #13793)	Verify your Loki version and avoid outer aggregations
Log structure dependency	SLI quality is entirely dependent on the consistency of the log format	Queries must be updated whenever the log format changes

What is Cardinality? It refers to the number of unique values for a label. The more granular your by (endpoint) clause, the more the number of time series explodes. For example, 200 endpoints means a single Recording Rule generates 200 time series, which directly increases memory and index costs in Loki and Grafana. If you have many endpoints, it is recommended to reduce the number of aggregation labels or normalize label values.

The Most Common Mistakes in Practice

Omitting the __error__="" filter — Unwrap parse error logs get included in SLI values, distorting the results. This is especially problematic when some log entries are missing the field entirely or contain mixed string values, which can cause values to spike dramatically.
Omitting zero-denominator handling in error rate queries — During low-traffic hours such as late at night, NaN or +Inf values can appear, causing alerts to malfunction or leaving gaps in the dashboard. Using clamp_min or or vector(0) prevents this issue.
Ignoring the relationship between a Recording Rule's interval and the range window size — Using interval: 1m with a [5m] window causes overlapping data references. First verify whether this is your intended aggregation behavior, and if you want overlap-free operation, consider aligning the interval with the window size.

Closing Thoughts

By combining Grafana Loki's unwrap, avg_over_time, and quantile_over_time, you can build a reliable SLO dashboard from your existing production logs without any code changes.

Here are 3 steps you can take right now to get started.

Identify numeric fields in your existing logs. In Grafana Explore, run {app="your-app-name"} | json and check whether there are parseable numeric fields such as response_time_ms, duration, or latency.
Modify the P99 query from Example 1 to fit your app and run it. You can see your first SLI value simply by replacing the app, namespace, and the target field name for unwrap.
Register the validated query as a Recording Rule. Save it in loki-rules.yaml as record: job:your_api_p99_latency_ms, then connect this metric to a Time Series panel and a Gauge panel in your Grafana dashboard.

Your logs are already in your infrastructure — from now on, those logs become your SLO instrument panel.

How to automate the Fast-burn/Slow-burn dual-alerting strategy by integrating the Grafana SLO plugin (grafana-slo-app) with Alertmanager will be covered in the next post.

References

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL | DEV BAK - 기술블로그

DevOps

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

Understanding the Concepts
Practical Application
Operational Considerations
Closing Thoughts
References

Understanding the Concepts

The 3 Components of SLO

Before building a Loki-based SLO dashboard, it helps to briefly review three key concepts.

Component	Description	Example
SLI (Service Level Indicator)	A measurement metric that represents service quality	P99 response time, error rate
SLO (Service Level Objective)	A target value for an SLI	P99 < 500ms, error rate < 1%
Error Budget	The allowable margin of failure while meeting the SLO	43.2 minutes over 30 days (for a 99.9% SLO)

With Loki, you can extract SLIs directly from existing logs, eliminating the need to add separate instrumentation code.

unwrap — The Core Operator for Transforming Log Fields into Metrics

unwrap is an operator that extracts numeric values from parsed log fields and passes them to metric functions. It must be placed after a parser such as | json, | logfmt, or | regexp.

logql

{app="api-server", env="production"}
| json
| unwrap response_time_ms [5m]

What is unwrap? It is a LogQL operator that extracts a parsed numeric field from a log line and passes it to range aggregation functions like avg_over_time and quantile_over_time. Without unwrap, you cannot treat logs as numeric metrics.

Depending on your log format, you can choose a parser as shown below.

logql

# JSON parser
{app="api"} | json | unwrap response_time_ms
 
# Logfmt parser
{app="api"} | logfmt | unwrap latency
 
# Regex parser
{app="api"} | regexp `(?P<duration>\d+)ms` | unwrap duration
 
# Pattern parser (Nginx access log example)
{app="nginx"} | pattern `<_> - - <_> "<method> <_> <_>" <status> <bytes>` | unwrap bytes

avg_over_time and quantile_over_time

Both functions aggregate the values extracted by unwrap over a specified time range.

logql

# Average response time
avg_over_time(
  {app="api-server", env="production"}
  | json
  | unwrap response_time_ms [5m]
) by (service)
 
# P95 latency
quantile_over_time(0.95,
  {app="api-server", env="production"}
  | json
  | unwrap response_time_ms [5m]
) by (service)
 
# P99 latency + remove parse errors (recommended pattern for production)
quantile_over_time(0.99,
  {container="ingress-nginx"}
  | json
  | unwrap response_latency_seconds
  | __error__="" [1m]
) by (cluster)

Function	Use Case	φ Value Examples
`avg_over_time(...)`	Calculating averages (average response time, throughput)	—
`quantile_over_time(φ, ...)`	Calculating the φ-quantile	P50=0.5, P95=0.95, P99=0.99

__error__="" If a field cannot be converted to a number during unwrap, a parse error log is generated. Adding the | __error__="" filter removes these error logs, improving the accuracy of your SLI values. It is recommended to include this in production environments.

Criteria for Choosing an Aggregation Window

When selecting a time range window such as [5m], you need to adjust it to fit the characteristics of your service.

Window	Suitable Situation	Caveats
`[1m]` or less	High-frequency APIs with heavy traffic	Low quantile reliability for endpoints with few log entries
`[5m]`	Most production services	Good balance between detecting latency spikes and statistical accuracy
`[15m]` or more	Low-traffic batch services	Slower to detect sudden latency spikes

Now let's apply these concepts to real SLO scenarios.

Practical Application

Example 1: HTTP API Latency SLO

Scenario: P99 response time under 500ms, maintained 99.9% of the time over a 30-day period

logql

-- Replace label names ({app="order-api"}, field name duration_ms, etc.) to match your actual environment.
 
-- SLI: P99 latency
quantile_over_time(0.99,
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)
 
-- SLI: P95 latency
quantile_over_time(0.95,
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)
 
-- SLI: Average latency
avg_over_time(
  {namespace="production", app="order-api"}
  | json
  | unwrap duration_ms
  | __error__="" [5m]
) by (endpoint)

Query Component	Role
`{namespace="production", app="order-api"}`	Selects logs for a specific app in the production environment
`\| json`	Parses logs in JSON format
`\| unwrap duration_ms`	Extracts the `duration_ms` field as a numeric value
`\| __error__=""`	Removes parse error logs
`[5m]`	Aggregates over a 5-minute window
`by (endpoint)`	Splits results by endpoint

Example 2: Error Rate SLO

Error rate can be calculated without unwrap by combining the rate() function with label filters.

logql

-- Numeric label comparison after JSON parsing (>= 500) is supported in Loki 2.3+.
-- For older environments, you can substitute with | status_code !~ "1..|2..|3..|4.."
 
(
  sum(rate(
    {app="order-api"} | json | status_code >= 500 [5m]
  )) by (service)
  /
  clamp_min(
    sum(rate({app="order-api"} | json [5m])) by (service),
    0.001
  )
) * 100

Handling a zero denominator When there is no traffic, the denominator can become 0, resulting in NaN or infinity values. It is recommended to specify a minimum denominator value with clamp_min(..., 0.001), or to set a default value using or vector(0). Without this handling, alerts may malfunction.

Example 3: Recording Rules Based on AWS ALB Logs

yaml

# loki-rules.yaml
groups:
  - name: alb_slo_rules
    interval: 1m   # Recording Rule execution interval
    rules:
      - record: job:alb_request_duration_seconds:p99
        expr: |
          quantile_over_time(0.99,
            {job="alb-logs"}
            | logfmt
            | unwrap target_processing_time
            | __error__="" [5m]
          ) by (target_group)
 
      - record: job:alb_error_rate:ratio
        expr: |
          sum(rate({job="alb-logs"} | logfmt | elb_status_code >= 500 [5m]))
          /
          clamp_min(
            sum(rate({job="alb-logs"} | logfmt [5m])),
            0.001
          )

What are Recording Rules? They are a feature that pre-computes frequently executed, expensive LogQL queries and stores the results as new time-series metrics. This is the same concept as Prometheus Recording Rules — since the dashboard queries already-computed metrics, the query load is greatly reduced.

The relationship between interval and the range window With interval: 1m, the rule runs every minute. If the query window is [5m], there will be 4 minutes of overlapping data between consecutive execution results. This is intentional behavior, but it can lead to double-counting depending on your aggregation method, so caution is warranted. To avoid overlap, you can set interval: 5m to match the window size.

Example 4: Grafana Dashboard Panel Configuration

It is recommended to structure your SLO dashboard with the following four panels.

sql

┌────────────────────────────────────────────┐
│ SLO Status Panel                           │
│  - Current SLI value (Stat Panel)          │
│  - Compliance rate vs. SLO target (Gauge)  │
├────────────────────────────────────────────┤
│ Latency Trend                              │
│  - P50 / P95 / P99 Time Series             │
│  - avg_over_time average overlay           │
├────────────────────────────────────────────┤
│ Error Budget                               │
│  - Remaining error budget (%) — Stat Panel │
│  - Burn rate time series — Time Series     │
├────────────────────────────────────────────┤
│ Burn Rate Alert Status                     │
│  - Fast-burn (1h/6h windows)               │
│  - Slow-burn (24h/72h windows)             │
└────────────────────────────────────────────┘

For each panel's data source, reference the job:alb_request_duration_seconds:p99 and job:alb_error_rate:ratio metrics stored via Recording Rules in Example 3.

Operational Considerations

Advantages

Item	Details
No code changes	SLIs can be extracted from existing logs — no app instrumentation required
Cost-efficient	Loki can be operated standalone without Prometheus or a separate metrics store
Flexible field extraction	Supports diverse log formats via json, logfmt, regexp, and pattern parsers
Historical analysis	Past logs can be re-queried to calculate SLIs retroactively
Granular SLOs	Easy to segment by service, endpoint, or region using label-based grouping

Disadvantages and Caveats

Item	Details	Mitigation
Performance overhead	`quantile_over_time` is memory-intensive and may time out with high log volumes	Apply pre-computation with Recording Rules
Scalability constraints	Because exact quantiles are computed, memory usage grows linearly as the time range widens	Use short aggregation windows (1m–5m) and persist results with Recording Rules
Sparse data accuracy	For endpoints with low traffic, insufficient sample counts can lower quantile reliability	It is recommended to adjust the label aggregation level to the service level
Single-field limitation	Only one field can be extracted per `unwrap` expression	Use separate queries when multiple metrics are needed
Known bug	Combining with outer aggregation functions (`avg by (...) (quantile_over_time(...))`) causes errors in certain Loki versions (Issue #13793)	Verify your Loki version and avoid outer aggregations
Log structure dependency	SLI quality is entirely dependent on the consistency of the log format	Queries must be updated whenever the log format changes

What is Cardinality? It refers to the number of unique values for a label. The more granular your by (endpoint) clause, the more the number of time series explodes. For example, 200 endpoints means a single Recording Rule generates 200 time series, which directly increases memory and index costs in Loki and Grafana. If you have many endpoints, it is recommended to reduce the number of aggregation labels or normalize label values.

The Most Common Mistakes in Practice

Omitting the __error__="" filter — Unwrap parse error logs get included in SLI values, distorting the results. This is especially problematic when some log entries are missing the field entirely or contain mixed string values, which can cause values to spike dramatically.
Omitting zero-denominator handling in error rate queries — During low-traffic hours such as late at night, NaN or +Inf values can appear, causing alerts to malfunction or leaving gaps in the dashboard. Using clamp_min or or vector(0) prevents this issue.
Ignoring the relationship between a Recording Rule's interval and the range window size — Using interval: 1m with a [5m] window causes overlapping data references. First verify whether this is your intended aggregation behavior, and if you want overlap-free operation, consider aligning the interval with the window size.

Closing Thoughts

By combining Grafana Loki's unwrap, avg_over_time, and quantile_over_time, you can build a reliable SLO dashboard from your existing production logs without any code changes.

Here are 3 steps you can take right now to get started.

Identify numeric fields in your existing logs. In Grafana Explore, run {app="your-app-name"} | json and check whether there are parseable numeric fields such as response_time_ms, duration, or latency.
Modify the P99 query from Example 1 to fit your app and run it. You can see your first SLI value simply by replacing the app, namespace, and the target field name for unwrap.
Register the validated query as a Recording Rule. Save it in loki-rules.yaml as record: job:your_api_p99_latency_ms, then connect this metric to a Time Series panel and a Gauge panel in your Grafana dashboard.

Your logs are already in your infrastructure — from now on, those logs become your SLO instrument panel.

How to automate the Fast-burn/Slow-burn dual-alerting strategy by integrating the Grafana SLO plugin (grafana-slo-app) with Alertmanager will be covered in the next post.

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

Table of Contents

Understanding the Concepts

The 3 Components of SLO

unwrap — The Core Operator for Transforming Log Fields into Metrics

avg_over_time and quantile_over_time

Criteria for Choosing an Aggregation Window

Practical Application

Example 1: HTTP API Latency SLO

Example 2: Error Rate SLO

Example 3: Recording Rules Based on AWS ALB Logs

Example 4: Grafana Dashboard Panel Configuration

Operational Considerations

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

Table of Contents

Understanding the Concepts

The 3 Components of SLO

unwrap — The Core Operator for Transforming Log Fields into Metrics

avg_over_time and quantile_over_time

Criteria for Choosing an Aggregation Window

Practical Application

Example 1: HTTP API Latency SLO

Example 2: Error Rate SLO

Example 3: Recording Rules Based on AWS ALB Logs

Example 4: Grafana Dashboard Panel Configuration

Operational Considerations

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Table of Contents

Understanding the Concepts

The 3 Components of SLO

unwrap — The Core Operator for Transforming Log Fields into Metrics

avg_over_time and quantile_over_time

Criteria for Choosing an Aggregation Window

Practical Application

Example 1: HTTP API Latency SLO

Example 2: Error Rate SLO

Example 3: Recording Rules Based on AWS ALB Logs

Example 4: Grafana Dashboard Panel Configuration

Operational Considerations

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Table of Contents

Understanding the Concepts

The 3 Components of SLO

unwrap — The Core Operator for Transforming Log Fields into Metrics

avg_over_time and quantile_over_time

Criteria for Choosing an Aggregation Window

Practical Application

Example 1: HTTP API Latency SLO

Example 2: Error Rate SLO

Example 3: Recording Rules Based on AWS ALB Logs

Example 4: Grafana Dashboard Panel Configuration

Operational Considerations

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Automating Fast-burn/Slow-burn Alerts with Grafana SLO

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

LogQL Pipeline Parser Practical Guide — Extracting Fields from Unstructured Logs in Grafana Loki 3.x

Grafana Loki Practical Guide: Managing Kubernetes Logs Without ELK