Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes
Have you ever experienced your service suddenly slowing down with no idea why? Digging through logs, SSH-ing into servers to run top, and ending up with "let's just restart it" — this vicious cycle is common when you have no monitoring system in place. In modern service operations, the ability to understand in real time what is happening and how has become a necessity, not an option.
Prometheus and Grafana are the de facto industry-standard combination for solving this problem. Prometheus collects, stores, and analyzes metrics, while Grafana transforms that data into human-readable dashboards. As CNCF (Cloud Native Computing Foundation) Graduated projects deeply integrated with the Kubernetes ecosystem, and consistently recognized in the Gartner Observability Platforms category, this is a mature stack.
This guide is written for backend and DevOps developers introducing monitoring for the first time, covering everything step by step: how the two tools work, setting up a local Docker Compose environment, instrumenting custom metrics in Node.js, and deploying to Kubernetes in production. After reading this, you will be able to build everything from metric collection to Grafana visualization in your local environment. If you have Kubernetes experience, later sections also cover production-level configuration.
TL;DR — Key Summary
- Prometheus: A time-series DB + PromQL query engine that periodically scrapes
/metricsendpoints (Pull model)- Grafana: A visualization platform connecting Prometheus and 100+ other data sources (does not collect data itself)
- Local start: Instantly up and running with a single
docker compose up -d- Critical warning: Putting unique values (
user_id,request_id) in labels risks memory explosion- Where to run PromQL: Prometheus UI (
localhost:9090) or the Grafana Explore tab
Core Concepts
Prometheus: A Pull-Based Metric Collection Engine
The core of how Prometheus works is its Pull model. Target services expose a /metrics HTTP endpoint, and Prometheus periodically "scrapes" that endpoint to fetch data.
Unlike the Push model used by tools like StatsD — where services push data to the monitoring server — in the Pull model Prometheus polls each service. This means anomalies are detected immediately when a service stops responding, and monitoring configuration is centralized on the Prometheus side for easier management.
# prometheus.yml — basic scraping configuration example
global:
scrape_interval: 15s # collect metrics every 15 seconds
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['localhost:8080'] # the service exposing the /metrics endpointCollected data is stored in a built-in time-series DB and can be queried and aggregated with PromQL (Prometheus Query Language). PromQL queries can be run in the Prometheus UI at http://localhost:9090 or in Grafana's Explore tab. If you're just getting started, it's recommended to paste one of the three queries below directly into the Prometheus UI to see the results.
# Average requests per second over the last 5 minutes
# (applied to Counter metrics like orders_total defined in the Node.js example below)
rate(orders_total[5m])
# 99th percentile latency by service
# (applied to http_request_duration_seconds histogram metrics)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Error rate (proportion of 5xx responses)
sum(rate(orders_total{status="error"}[5m])) / sum(rate(orders_total[5m]))PromQL Key Functions Quick Reference
rate(): Per-second rate of change — always wrap Counter metrics with thisincrease(): Total increase over a given periodhistogram_quantile(): Calculate percentiles from a histogramsum() by (label): Aggregate by label
Grafana: The Role of the Visualization Layer
Grafana does not collect data itself. It focuses on connecting to Prometheus, Loki, Elasticsearch, and 100+ other data sources to visualize metrics, logs, and traces in a single dashboard.
Metrics → Prometheus ──────────────────────────┐
Logs → Loki ────────────────────────────────┤ Grafana (Visualization)
Traces → Tempo ───────────────────────────────┤
Profiles → Pyroscope ──────────────────────────┘
↑
Grafana Alloy (unified telemetry collector, OpenTelemetry compatible)What is Grafana Alloy? It is the unified telemetry collector that replaces Grafana Agent, which reached EOL in 2025. It can collect and transform OpenTelemetry, Prometheus, Loki, and Pyroscope pipelines with a single collector, and is deployed as an agent on each server. The Docker Compose examples in this guide use direct scraping for simplicity.
The relationship between Grafana and Prometheus is like that of a database and a visualization tool. Grafana works without Prometheus, and Prometheus works without Grafana. But when the two tools work together, a powerful monitoring environment is complete.
Understanding the 4 Metric Types
Prometheus handles four types of metrics. Choosing the wrong type can distort your data, so understanding them is essential.
| Type | Characteristics | Typical Use Cases |
|---|---|---|
| Counter | Monotonically increasing, resets to 0 on restart | Total HTTP requests, error occurrence count |
| Gauge | Can increase or decrease, represents current state | CPU usage, current connection count, memory usage |
| Histogram | Distribution by bucket + sum + count | Response time distribution, request size distribution |
| Summary | Client-side percentile calculation | SLA-based latency measurement (rarely used) |
Practical Application
Local Development Environment Setup (Docker Compose)
The fastest way to run the Prometheus + Grafana stack in a development environment.
First, create a .env file in the project root to separate secrets.
# .env
GRAFANA_PASSWORD=enter_a_secure_password_here# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v3.1.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d' # retain for 15 days
grafana:
image: grafana/grafana:11.4.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
network_mode: host # required to accurately collect certain metrics like network stats (Linux environment)
pid: host # required for process information collection (Linux environment)
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
volumes:
grafana-data:Note for macOS users:
network_mode: hostandpid: hostare Linux-only options and do not work with Docker Desktop on macOS. In a local macOS development environment, you can remove both options and still collect basic metrics, though some network statistics may be missing.
# prometheus.yml — including node-exporter scraping
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']| Component | Role | Access URL |
|---|---|---|
| Prometheus | Metric collection and storage | http://localhost:9090 |
| Grafana | Dashboard visualization | http://localhost:3000 |
| node-exporter | Expose host system metrics | http://localhost:9100/metrics |
After running docker compose up -d, connect to Grafana, add Prometheus as a data source (URL: http://prometheus:9090), and import Grafana's official dashboard ID 1860 to immediately view a host monitoring dashboard.
Node.js Application Instrumentation
An example of defining your service's business metrics and collecting them with Prometheus. First, install the client library.
pnpm add prom-client// metrics.ts — using the prom-client library
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
export const registry = new Registry();
// Order processing counter
export const ordersTotal = new Counter({
name: 'orders_total',
help: 'Total number of orders processed',
labelNames: ['status', 'payment_method'],
registers: [registry],
});
// API latency histogram
// bucket values are set to cover typical web API response time distribution (10ms~5s)
// it is recommended to adjust intervals more finely around thresholds based on your service SLA
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request processing time (seconds)',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [registry],
});
// Current active user count
export const activeUsers = new Gauge({
name: 'active_users_current',
help: 'Number of users with active sessions',
registers: [registry],
});// app.ts — Express middleware and /metrics endpoint
import express from 'express';
import { registry, httpRequestDuration, ordersTotal } from './metrics';
const app = express();
// Middleware to measure latency of all requests
// req.route.path must be referenced inside res.on('finish')
// so the correct path is recorded after route matching is complete.
// req.route is undefined at middleware execution time, so
// putting the route label at startTimer will cause cardinality explosion.
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method });
res.on('finish', () => {
end({
route: req.route?.path ?? 'unknown',
status_code: res.statusCode,
});
});
next();
});
// Prometheus scraping endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.send(await registry.metrics());
});
// Example order processing route
app.post('/orders', async (req, res) => {
// ... order processing logic ...
ordersTotal.inc({ status: 'success', payment_method: 'card' });
res.json({ success: true });
});Warning: Using
req.route?.path ?? req.pathas the route label directly at middleware execution time will record individual request paths like/api/users/123as-is because route matching hasn't happened yet, causing a cardinality explosion. Referencereq.route.pathinside theres.on('finish')callback as shown in the example above, or consider using theexpress-prom-bundlelibrary.
Kubernetes Production Environment Setup (Advanced)
Prerequisites: A Kubernetes cluster and Helm 3.x must be installed. For local setups, you can create a cluster using kind or minikube.
In a Kubernetes environment, a single Helm chart can configure the entire monitoring stack. kube-prometheus-stack installs Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics all at once.
# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=your-secure-password \
--set prometheus.prometheusSpec.retention=30d# Check installation status — verify all pods are in Running state
kubectl get pods -n monitoring
# Access Grafana dashboard via local port-forwarding
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoringThese commands alone will automatically install the basic dashboards needed for Kubernetes operations, including node CPU and memory, pod status, and resource usage by namespace.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Pull-based collection | Immediate detection when a service goes down. Prometheus polls each service, so an alert fires if there's no response |
| Powerful query language | PromQL enables complex aggregation, filtering, and rate calculations. Supports multi-dimensional data analysis based on labels |
| Rich exporter ecosystem | Official exporters provided for major middleware including MySQL, Redis, Nginx, and PostgreSQL |
| Versatile visualization | Grafana alone lets you view Prometheus, logs (Loki), and traces (Tempo) in a unified dashboard |
| Cloud-native standard | CNCF Graduated project. Native integration with the Kubernetes ecosystem |
| Fully managed option | Using Grafana Cloud allows you to start immediately without operating any infrastructure |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Short-term retention limits | The default storage is not suitable for retention of months to years | Separate long-term storage with Thanos* or Grafana Mimir** |
| Complex HA setup | Prometheus itself is a single node. Duplicate scraping occurs when configuring high availability | Use Thanos Receive or Mimir's distributed ingestion layer |
| Cardinality explosion | A rapid increase in label combinations causes memory and storage usage to spike | Exclude high-cardinality labels like user_id and request_id at the label design stage |
| Weak default security | /metrics endpoints have no authentication by default |
Network isolation (internal network only) or TLS + Basic Auth configuration |
| PromQL learning curve | rate, histogram_quantile, etc. are not intuitive at first |
Start with Grafana's query builder UI, then progressively learn PromQL |
* Thanos: An open-source project providing a global query layer across Prometheus clusters and long-term object storage (S3, etc.) integration
** Grafana Mimir: A remote_write backend for Prometheus supporting horizontal scaling and long-term retention. Mimir 3.0, released in 2025, reduced memory usage by up to 92% with a new query engine.
The Most Common Mistakes in Practice
-
Graphing a Counter directly without
rate()— A Counter is a cumulative value, so it should always be wrapped withrate()orincrease()to view the rate of change over time. A cumulative graph almost always results in a meaningless upward-sloping straight line. -
Putting unique values in labels — Using values that change per request as labels, like
{user_id="12345"}or{request_id="abc-xyz"}, can cause a cardinality explosion that drives Prometheus into an out-of-memory state within hours. It is recommended to use labels only for "classification of state," such as{status="success"}or{region="ap-northeast-2"}. -
Setting the scrape interval too short — Setting
scrape_interval: 1swill overload the Prometheus server when there are many targets. 15–30 seconds is appropriate for most production environments, and shortening the interval should be done only as an exception when precise SLA measurement is required.
What is Cardinality? It refers to the number of unique label value combinations. For example, if a
user_idlabel contains the IDs of 1 million users, 1 million time-series records are created. Because Prometheus keeps all of these time series in memory, high-cardinality labels are a primary cause of OOM (Out of Memory).
Closing Thoughts
The combination of Prometheus and Grafana goes beyond a simple server monitoring tool — it is core modern engineering infrastructure that transforms the state of your service into real-time business intelligence.
It may feel complex, but you can start with small steps. Here are 3 steps you can take right now.
-
Start the local stack: Run
docker compose up -dwith thedocker-compose.ymlprovided above. If Prometheus and Grafana fail to connect after the stack comes up, first check the container status withdocker compose psand verify that the Grafana data source URL is set tohttp://prometheus:9090(based on the container name). -
Instrument your application: Add an official client library such as
prom-client(Node.js),prometheus_client(Python), ormicrometer(Java/Spring) to the service you are currently developing, and expose the single most important metric (e.g., API request count) via/metrics. It is recommended to read the cardinality explosion warnings before designing your labels. -
Set up your first alert: Use Grafana's Unified Alerting feature to set up a rule that sends a Slack or email notification when the error rate exceeds 1%. The moment you receive that alert, you will feel monitoring transform from a simple dashboard into a tool that actively guards your service.
Next post: We plan to cover how to connect logs and traces with Grafana Loki and Tempo to build true Full-Stack Observability.