Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

Have you ever experienced your service suddenly slowing down with no idea why? Digging through logs, SSH-ing into servers to run top, and ending up with "let's just restart it" — this vicious cycle is common when you have no monitoring system in place. In modern service operations, the ability to understand in real time what is happening and how has become a necessity, not an option.

Prometheus and Grafana are the de facto industry-standard combination for solving this problem. Prometheus collects, stores, and analyzes metrics, while Grafana transforms that data into human-readable dashboards. As CNCF (Cloud Native Computing Foundation) Graduated projects deeply integrated with the Kubernetes ecosystem, and consistently recognized in the Gartner Observability Platforms category, this is a mature stack.

This guide is written for backend and DevOps developers introducing monitoring for the first time, covering everything step by step: how the two tools work, setting up a local Docker Compose environment, instrumenting custom metrics in Node.js, and deploying to Kubernetes in production. After reading this, you will be able to build everything from metric collection to Grafana visualization in your local environment. If you have Kubernetes experience, later sections also cover production-level configuration.

TL;DR — Key Summary

Prometheus: A time-series DB + PromQL query engine that periodically scrapes /metrics endpoints (Pull model)

Grafana: A visualization platform connecting Prometheus and 100+ other data sources (does not collect data itself)

Local start: Instantly up and running with a single docker compose up -d

Critical warning: Putting unique values (user_id, request_id) in labels risks memory explosion

Where to run PromQL: Prometheus UI (localhost:9090) or the Grafana Explore tab

Core Concepts

Prometheus: A Pull-Based Metric Collection Engine

The core of how Prometheus works is its Pull model. Target services expose a /metrics HTTP endpoint, and Prometheus periodically "scrapes" that endpoint to fetch data.

Unlike the Push model used by tools like StatsD — where services push data to the monitoring server — in the Pull model Prometheus polls each service. This means anomalies are detected immediately when a service stops responding, and monitoring configuration is centralized on the Prometheus side for easier management.

yaml

# prometheus.yml — basic scraping configuration example
global:
  scrape_interval: 15s      # collect metrics every 15 seconds
 
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']  # the service exposing the /metrics endpoint

Collected data is stored in a built-in time-series DB and can be queried and aggregated with PromQL (Prometheus Query Language). PromQL queries can be run in the Prometheus UI at http://localhost:9090 or in Grafana's Explore tab. If you're just getting started, it's recommended to paste one of the three queries below directly into the Prometheus UI to see the results.

promql

# Average requests per second over the last 5 minutes
# (applied to Counter metrics like orders_total defined in the Node.js example below)
rate(orders_total[5m])
 
# 99th percentile latency by service
# (applied to http_request_duration_seconds histogram metrics)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
 
# Error rate (proportion of 5xx responses)
sum(rate(orders_total{status="error"}[5m])) / sum(rate(orders_total[5m]))

PromQL Key Functions Quick Reference

rate(): Per-second rate of change — always wrap Counter metrics with this

increase(): Total increase over a given period

histogram_quantile(): Calculate percentiles from a histogram

sum() by (label): Aggregate by label

Grafana: The Role of the Visualization Layer

Grafana does not collect data itself. It focuses on connecting to Prometheus, Loki, Elasticsearch, and 100+ other data sources to visualize metrics, logs, and traces in a single dashboard.

Metrics  → Prometheus ──────────────────────────┐
Logs     → Loki ────────────────────────────────┤ Grafana (Visualization)
Traces   → Tempo ───────────────────────────────┤
Profiles → Pyroscope ──────────────────────────┘
        ↑
   Grafana Alloy (unified telemetry collector, OpenTelemetry compatible)

What is Grafana Alloy? It is the unified telemetry collector that replaces Grafana Agent, which reached EOL in 2025. It can collect and transform OpenTelemetry, Prometheus, Loki, and Pyroscope pipelines with a single collector, and is deployed as an agent on each server. The Docker Compose examples in this guide use direct scraping for simplicity.

The relationship between Grafana and Prometheus is like that of a database and a visualization tool. Grafana works without Prometheus, and Prometheus works without Grafana. But when the two tools work together, a powerful monitoring environment is complete.

Understanding the 4 Metric Types

Prometheus handles four types of metrics. Choosing the wrong type can distort your data, so understanding them is essential.

Type	Characteristics	Typical Use Cases
Counter	Monotonically increasing, resets to 0 on restart	Total HTTP requests, error occurrence count
Gauge	Can increase or decrease, represents current state	CPU usage, current connection count, memory usage
Histogram	Distribution by bucket + sum + count	Response time distribution, request size distribution
Summary	Client-side percentile calculation	SLA-based latency measurement (rarely used)

Practical Application

Local Development Environment Setup (Docker Compose)

The fastest way to run the Prometheus + Grafana stack in a development environment.

First, create a .env file in the project root to separate secrets.

bash

# .env
GRAFANA_PASSWORD=enter_a_secure_password_here

yaml

# docker-compose.yml
version: '3.8'
 
services:
  prometheus:
    image: prom/prometheus:v3.1.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'   # retain for 15 days
 
  grafana:
    image: grafana/grafana:11.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
 
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    network_mode: host   # required to accurately collect certain metrics like network stats (Linux environment)
    pid: host            # required for process information collection (Linux environment)
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
 
volumes:
  grafana-data:

Note for macOS users: network_mode: host and pid: host are Linux-only options and do not work with Docker Desktop on macOS. In a local macOS development environment, you can remove both options and still collect basic metrics, though some network statistics may be missing.

yaml

# prometheus.yml — including node-exporter scraping
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Component	Role	Access URL
Prometheus	Metric collection and storage	`http://localhost:9090`
Grafana	Dashboard visualization	`http://localhost:3000`
node-exporter	Expose host system metrics	`http://localhost:9100/metrics`

After running docker compose up -d, connect to Grafana, add Prometheus as a data source (URL: http://prometheus:9090), and import Grafana's official dashboard ID 1860 to immediately view a host monitoring dashboard.

Node.js Application Instrumentation

An example of defining your service's business metrics and collecting them with Prometheus. First, install the client library.

bash

pnpm add prom-client

typescript

// metrics.ts — using the prom-client library
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
 
export const registry = new Registry();
 
// Order processing counter
export const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [registry],
});
 
// API latency histogram
// bucket values are set to cover typical web API response time distribution (10ms~5s)
// it is recommended to adjust intervals more finely around thresholds based on your service SLA
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request processing time (seconds)',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [registry],
});
 
// Current active user count
export const activeUsers = new Gauge({
  name: 'active_users_current',
  help: 'Number of users with active sessions',
  registers: [registry],
});

typescript

// app.ts — Express middleware and /metrics endpoint
import express from 'express';
import { registry, httpRequestDuration, ordersTotal } from './metrics';
 
const app = express();
 
// Middleware to measure latency of all requests
// req.route.path must be referenced inside res.on('finish')
// so the correct path is recorded after route matching is complete.
// req.route is undefined at middleware execution time, so
// putting the route label at startTimer will cause cardinality explosion.
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method });
  res.on('finish', () => {
    end({
      route: req.route?.path ?? 'unknown',
      status_code: res.statusCode,
    });
  });
  next();
});
 
// Prometheus scraping endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});
 
// Example order processing route
app.post('/orders', async (req, res) => {
  // ... order processing logic ...
  ordersTotal.inc({ status: 'success', payment_method: 'card' });
  res.json({ success: true });
});

Warning: Using req.route?.path ?? req.path as the route label directly at middleware execution time will record individual request paths like /api/users/123 as-is because route matching hasn't happened yet, causing a cardinality explosion. Reference req.route.path inside the res.on('finish') callback as shown in the example above, or consider using the express-prom-bundle library.

Kubernetes Production Environment Setup (Advanced)

Prerequisites: A Kubernetes cluster and Helm 3.x must be installed. For local setups, you can create a cluster using kind or minikube.

In a Kubernetes environment, a single Helm chart can configure the entire monitoring stack. kube-prometheus-stack installs Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics all at once.

bash

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
 
# Install kube-prometheus-stack
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=your-secure-password \
  --set prometheus.prometheusSpec.retention=30d

bash

# Check installation status — verify all pods are in Running state
kubectl get pods -n monitoring
 
# Access Grafana dashboard via local port-forwarding
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

These commands alone will automatically install the basic dashboards needed for Kubernetes operations, including node CPU and memory, pod status, and resource usage by namespace.

Pros and Cons Analysis

Advantages

Item	Details
Pull-based collection	Immediate detection when a service goes down. Prometheus polls each service, so an alert fires if there's no response
Powerful query language	PromQL enables complex aggregation, filtering, and rate calculations. Supports multi-dimensional data analysis based on labels
Rich exporter ecosystem	Official exporters provided for major middleware including MySQL, Redis, Nginx, and PostgreSQL
Versatile visualization	Grafana alone lets you view Prometheus, logs (Loki), and traces (Tempo) in a unified dashboard
Cloud-native standard	CNCF Graduated project. Native integration with the Kubernetes ecosystem
Fully managed option	Using Grafana Cloud allows you to start immediately without operating any infrastructure

Disadvantages and Caveats

Item	Details	Mitigation
Short-term retention limits	The default storage is not suitable for retention of months to years	Separate long-term storage with Thanos* or Grafana Mimir**
Complex HA setup	Prometheus itself is a single node. Duplicate scraping occurs when configuring high availability	Use Thanos Receive or Mimir's distributed ingestion layer
Cardinality explosion	A rapid increase in label combinations causes memory and storage usage to spike	Exclude high-cardinality labels like `user_id` and `request_id` at the label design stage
Weak default security	`/metrics` endpoints have no authentication by default	Network isolation (internal network only) or TLS + Basic Auth configuration
PromQL learning curve	`rate`, `histogram_quantile`, etc. are not intuitive at first	Start with Grafana's query builder UI, then progressively learn PromQL

* Thanos: An open-source project providing a global query layer across Prometheus clusters and long-term object storage (S3, etc.) integration

** Grafana Mimir: A remote_write backend for Prometheus supporting horizontal scaling and long-term retention. Mimir 3.0, released in 2025, reduced memory usage by up to 92% with a new query engine.

The Most Common Mistakes in Practice

Graphing a Counter directly without rate() — A Counter is a cumulative value, so it should always be wrapped with rate() or increase() to view the rate of change over time. A cumulative graph almost always results in a meaningless upward-sloping straight line.
Putting unique values in labels — Using values that change per request as labels, like {user_id="12345"} or {request_id="abc-xyz"}, can cause a cardinality explosion that drives Prometheus into an out-of-memory state within hours. It is recommended to use labels only for "classification of state," such as {status="success"} or {region="ap-northeast-2"}.
Setting the scrape interval too short — Setting scrape_interval: 1s will overload the Prometheus server when there are many targets. 15–30 seconds is appropriate for most production environments, and shortening the interval should be done only as an exception when precise SLA measurement is required.

What is Cardinality? It refers to the number of unique label value combinations. For example, if a user_id label contains the IDs of 1 million users, 1 million time-series records are created. Because Prometheus keeps all of these time series in memory, high-cardinality labels are a primary cause of OOM (Out of Memory).

Closing Thoughts

The combination of Prometheus and Grafana goes beyond a simple server monitoring tool — it is core modern engineering infrastructure that transforms the state of your service into real-time business intelligence.

It may feel complex, but you can start with small steps. Here are 3 steps you can take right now.

Start the local stack: Run docker compose up -d with the docker-compose.yml provided above. If Prometheus and Grafana fail to connect after the stack comes up, first check the container status with docker compose ps and verify that the Grafana data source URL is set to http://prometheus:9090 (based on the container name).
Instrument your application: Add an official client library such as prom-client (Node.js), prometheus_client (Python), or micrometer (Java/Spring) to the service you are currently developing, and expose the single most important metric (e.g., API request count) via /metrics. It is recommended to read the cardinality explosion warnings before designing your labels.
Set up your first alert: Use Grafana's Unified Alerting feature to set up a rule that sends a Slack or email notification when the error rate exceeds 1%. The moment you receive that alert, you will feel monitoring transform from a simple dashboard into a tool that actively guards your service.

Next post: We plan to cover how to connect logs and traces with Grafana Loki and Tempo to build true Full-Stack Observability.

References

Prometheus + Grafana 모니터링 완전 가이드 — Docker Compose부터 Kubernetes까지 | DEV BAK - 기술블로그

DevOps

Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

TL;DR — Key Summary

Prometheus: A time-series DB + PromQL query engine that periodically scrapes /metrics endpoints (Pull model)

Grafana: A visualization platform connecting Prometheus and 100+ other data sources (does not collect data itself)

Local start: Instantly up and running with a single docker compose up -d

Critical warning: Putting unique values (user_id, request_id) in labels risks memory explosion

Where to run PromQL: Prometheus UI (localhost:9090) or the Grafana Explore tab

Core Concepts

Prometheus: A Pull-Based Metric Collection Engine

The core of how Prometheus works is its Pull model. Target services expose a /metrics HTTP endpoint, and Prometheus periodically "scrapes" that endpoint to fetch data.

yaml

# prometheus.yml — basic scraping configuration example
global:
  scrape_interval: 15s      # collect metrics every 15 seconds
 
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:8080']  # the service exposing the /metrics endpoint

promql

# Average requests per second over the last 5 minutes
# (applied to Counter metrics like orders_total defined in the Node.js example below)
rate(orders_total[5m])
 
# 99th percentile latency by service
# (applied to http_request_duration_seconds histogram metrics)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
 
# Error rate (proportion of 5xx responses)
sum(rate(orders_total{status="error"}[5m])) / sum(rate(orders_total[5m]))

PromQL Key Functions Quick Reference

rate(): Per-second rate of change — always wrap Counter metrics with this

increase(): Total increase over a given period

histogram_quantile(): Calculate percentiles from a histogram

sum() by (label): Aggregate by label

Grafana: The Role of the Visualization Layer

Grafana does not collect data itself. It focuses on connecting to Prometheus, Loki, Elasticsearch, and 100+ other data sources to visualize metrics, logs, and traces in a single dashboard.

Metrics  → Prometheus ──────────────────────────┐
Logs     → Loki ────────────────────────────────┤ Grafana (Visualization)
Traces   → Tempo ───────────────────────────────┤
Profiles → Pyroscope ──────────────────────────┘
        ↑
   Grafana Alloy (unified telemetry collector, OpenTelemetry compatible)

What is Grafana Alloy? It is the unified telemetry collector that replaces Grafana Agent, which reached EOL in 2025. It can collect and transform OpenTelemetry, Prometheus, Loki, and Pyroscope pipelines with a single collector, and is deployed as an agent on each server. The Docker Compose examples in this guide use direct scraping for simplicity.

The relationship between Grafana and Prometheus is like that of a database and a visualization tool. Grafana works without Prometheus, and Prometheus works without Grafana. But when the two tools work together, a powerful monitoring environment is complete.

Understanding the 4 Metric Types

Prometheus handles four types of metrics. Choosing the wrong type can distort your data, so understanding them is essential.

Type	Characteristics	Typical Use Cases
Counter	Monotonically increasing, resets to 0 on restart	Total HTTP requests, error occurrence count
Gauge	Can increase or decrease, represents current state	CPU usage, current connection count, memory usage
Histogram	Distribution by bucket + sum + count	Response time distribution, request size distribution
Summary	Client-side percentile calculation	SLA-based latency measurement (rarely used)

Practical Application

Local Development Environment Setup (Docker Compose)

The fastest way to run the Prometheus + Grafana stack in a development environment.

First, create a .env file in the project root to separate secrets.

bash

# .env
GRAFANA_PASSWORD=enter_a_secure_password_here

yaml

# docker-compose.yml
version: '3.8'
 
services:
  prometheus:
    image: prom/prometheus:v3.1.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'   # retain for 15 days
 
  grafana:
    image: grafana/grafana:11.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
 
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    network_mode: host   # required to accurately collect certain metrics like network stats (Linux environment)
    pid: host            # required for process information collection (Linux environment)
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
 
volumes:
  grafana-data:

Note for macOS users: network_mode: host and pid: host are Linux-only options and do not work with Docker Desktop on macOS. In a local macOS development environment, you can remove both options and still collect basic metrics, though some network statistics may be missing.

yaml

# prometheus.yml — including node-exporter scraping
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Component	Role	Access URL
Prometheus	Metric collection and storage	`http://localhost:9090`
Grafana	Dashboard visualization	`http://localhost:3000`
node-exporter	Expose host system metrics	`http://localhost:9100/metrics`

Node.js Application Instrumentation

An example of defining your service's business metrics and collecting them with Prometheus. First, install the client library.

bash

pnpm add prom-client

typescript

// metrics.ts — using the prom-client library
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
 
export const registry = new Registry();
 
// Order processing counter
export const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [registry],
});
 
// API latency histogram
// bucket values are set to cover typical web API response time distribution (10ms~5s)
// it is recommended to adjust intervals more finely around thresholds based on your service SLA
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request processing time (seconds)',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [registry],
});
 
// Current active user count
export const activeUsers = new Gauge({
  name: 'active_users_current',
  help: 'Number of users with active sessions',
  registers: [registry],
});

typescript

// app.ts — Express middleware and /metrics endpoint
import express from 'express';
import { registry, httpRequestDuration, ordersTotal } from './metrics';
 
const app = express();
 
// Middleware to measure latency of all requests
// req.route.path must be referenced inside res.on('finish')
// so the correct path is recorded after route matching is complete.
// req.route is undefined at middleware execution time, so
// putting the route label at startTimer will cause cardinality explosion.
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method });
  res.on('finish', () => {
    end({
      route: req.route?.path ?? 'unknown',
      status_code: res.statusCode,
    });
  });
  next();
});
 
// Prometheus scraping endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});
 
// Example order processing route
app.post('/orders', async (req, res) => {
  // ... order processing logic ...
  ordersTotal.inc({ status: 'success', payment_method: 'card' });
  res.json({ success: true });
});

Warning: Using req.route?.path ?? req.path as the route label directly at middleware execution time will record individual request paths like /api/users/123 as-is because route matching hasn't happened yet, causing a cardinality explosion. Reference req.route.path inside the res.on('finish') callback as shown in the example above, or consider using the express-prom-bundle library.

Kubernetes Production Environment Setup (Advanced)

Prerequisites: A Kubernetes cluster and Helm 3.x must be installed. For local setups, you can create a cluster using kind or minikube.

bash

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
 
# Install kube-prometheus-stack
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=your-secure-password \
  --set prometheus.prometheusSpec.retention=30d

bash

# Check installation status — verify all pods are in Running state
kubectl get pods -n monitoring
 
# Access Grafana dashboard via local port-forwarding
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

These commands alone will automatically install the basic dashboards needed for Kubernetes operations, including node CPU and memory, pod status, and resource usage by namespace.

Pros and Cons Analysis

Advantages

Item	Details
Pull-based collection	Immediate detection when a service goes down. Prometheus polls each service, so an alert fires if there's no response
Powerful query language	PromQL enables complex aggregation, filtering, and rate calculations. Supports multi-dimensional data analysis based on labels
Rich exporter ecosystem	Official exporters provided for major middleware including MySQL, Redis, Nginx, and PostgreSQL
Versatile visualization	Grafana alone lets you view Prometheus, logs (Loki), and traces (Tempo) in a unified dashboard
Cloud-native standard	CNCF Graduated project. Native integration with the Kubernetes ecosystem
Fully managed option	Using Grafana Cloud allows you to start immediately without operating any infrastructure

Disadvantages and Caveats

Item	Details	Mitigation
Short-term retention limits	The default storage is not suitable for retention of months to years	Separate long-term storage with Thanos* or Grafana Mimir**
Complex HA setup	Prometheus itself is a single node. Duplicate scraping occurs when configuring high availability	Use Thanos Receive or Mimir's distributed ingestion layer
Cardinality explosion	A rapid increase in label combinations causes memory and storage usage to spike	Exclude high-cardinality labels like `user_id` and `request_id` at the label design stage
Weak default security	`/metrics` endpoints have no authentication by default	Network isolation (internal network only) or TLS + Basic Auth configuration
PromQL learning curve	`rate`, `histogram_quantile`, etc. are not intuitive at first	Start with Grafana's query builder UI, then progressively learn PromQL

* Thanos: An open-source project providing a global query layer across Prometheus clusters and long-term object storage (S3, etc.) integration

** Grafana Mimir: A remote_write backend for Prometheus supporting horizontal scaling and long-term retention. Mimir 3.0, released in 2025, reduced memory usage by up to 92% with a new query engine.

The Most Common Mistakes in Practice

Graphing a Counter directly without rate() — A Counter is a cumulative value, so it should always be wrapped with rate() or increase() to view the rate of change over time. A cumulative graph almost always results in a meaningless upward-sloping straight line.
Putting unique values in labels — Using values that change per request as labels, like {user_id="12345"} or {request_id="abc-xyz"}, can cause a cardinality explosion that drives Prometheus into an out-of-memory state within hours. It is recommended to use labels only for "classification of state," such as {status="success"} or {region="ap-northeast-2"}.
Setting the scrape interval too short — Setting scrape_interval: 1s will overload the Prometheus server when there are many targets. 15–30 seconds is appropriate for most production environments, and shortening the interval should be done only as an exception when precise SLA measurement is required.

What is Cardinality? It refers to the number of unique label value combinations. For example, if a user_id label contains the IDs of 1 million users, 1 million time-series records are created. Because Prometheus keeps all of these time series in memory, high-cardinality labels are a primary cause of OOM (Out of Memory).

Closing Thoughts

It may feel complex, but you can start with small steps. Here are 3 steps you can take right now.

Start the local stack: Run docker compose up -d with the docker-compose.yml provided above. If Prometheus and Grafana fail to connect after the stack comes up, first check the container status with docker compose ps and verify that the Grafana data source URL is set to http://prometheus:9090 (based on the container name).
Instrument your application: Add an official client library such as prom-client (Node.js), prometheus_client (Python), or micrometer (Java/Spring) to the service you are currently developing, and expose the single most important metric (e.g., API request count) via /metrics. It is recommended to read the cardinality explosion warnings before designing your labels.
Set up your first alert: Use Grafana's Unified Alerting feature to set up a rule that sends a Slack or email notification when the error rate exceeds 1%. The moment you receive that alert, you will feel monitoring transform from a simple dashboard into a tool that actively guards your service.

Next post: We plan to cover how to connect logs and traces with Grafana Loki and Tempo to build true Full-Stack Observability.

Core Concepts

Prometheus: A Pull-Based Metric Collection Engine

Grafana: The Role of the Visualization Layer

Understanding the 4 Metric Types

Practical Application

Local Development Environment Setup (Docker Compose)

Node.js Application Instrumentation

Kubernetes Production Environment Setup (Advanced)

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Prometheus: A Pull-Based Metric Collection Engine

Grafana: The Role of the Visualization Layer

Understanding the 4 Metric Types

Practical Application

Local Development Environment Setup (Docker Compose)

Node.js Application Instrumentation

Kubernetes Production Environment Setup (Advanced)

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Grafana Loki + Tempo: Implementing Bidirectional Log-Trace Drill-Down with a Single Trace ID

TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x

100% Error Span Collection, Up to 95% Cost Reduction — Grafana Alloy + OpenTelemetry Tail-Based Sampling Practical Guide

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline