Prometheus + Grafana System Monitoring: A 6-Month Production Review

Table of Contents

What I Was Running Before (And Why I Switched)

Before settling on Prometheus + Grafana, I spent a few weeks evaluating monitoring stacks. Three contenders made the shortlist: the ELK stack (Elasticsearch + Logstash + Kibana), Datadog, and the Prometheus + Grafana combo. Picking the wrong one isn’t just inconvenient — it means migrating dashboards, retraining muscle memory, and re-instrumenting services. Weeks of pain.

Here’s how I framed the comparison at the time.

Approach 1: Managed SaaS (Datadog, New Relic)

Zero setup overhead. Ship an agent, metrics flow in, dashboards appear. For teams that don’t want to think about infrastructure, it works well. The catch is cost. Datadog runs roughly $23/host/month for base infrastructure monitoring — add APM or custom metrics and you’re pushing $40–60/host. At 50 servers, that’s $2,000–3,000/month before you’ve added a single custom metric series. For a startup or lean team, that math gets uncomfortable fast.

Approach 2: ELK Stack

Elasticsearch is a search engine. People bolt it onto metrics pipelines because it can store time-series data, but it’s optimized for log search, not metric aggregation. The memory footprint alone is painful — a modest Elasticsearch cluster for metrics might need 16–32GB RAM versus 4GB for a comparable Prometheus setup. Kibana’s visualization has improved a lot, but building a dashboard still feels awkward compared to Grafana’s interface.

Approach 3: Prometheus + Grafana

This is what I run now. Prometheus is designed specifically for metrics: it scrapes endpoints on a pull model, stores data in a compressed time-series format, and ships with PromQL — a query language that maps naturally to how you think about system behavior. Grafana connects to Prometheus (and dozens of other data sources) and gives you dashboards that are actually good to use.

Adoption is fast because the exporter ecosystem covers almost everything. PostgreSQL, Redis, Nginx, Kubernetes, RabbitMQ — nearly every serious open-source project ships a /metrics endpoint or has a community exporter. You’re monitoring your stack within hours, not days.

Honest Pros and Cons After Six Months

What Prometheus + Grafana Gets Right

Pull-based scraping is operationally sane. Services expose metrics; Prometheus fetches them on schedule. If a service dies, you know immediately — the scrape fails and alerts fire. Push-based systems can mask failures silently if the client simply stops sending data.
PromQL is powerful once it clicks. The learning curve is real. But once you internalize rate(), irate(), histogram_quantile(), and label matchers, you can answer nearly any question about your system in seconds.
The exporter ecosystem covers almost everything. Node Exporter, cAdvisor, PostgreSQL Exporter, Redis Exporter, Nginx Exporter — if you run it, there’s probably an exporter for it. No custom instrumentation needed for standard infrastructure components.
Community dashboards save real time. Grafana.com/dashboards has thousands of community-published setups. Import an ID and you have a production-ready dashboard in under a minute.
Resource footprint is manageable. Monitoring 50 hosts at 15-second scrape intervals runs comfortably on 2 cores and 4GB RAM. Storage stays reasonable too — 30 days of metrics for 50 hosts typically lands under 20GB with default compression settings.

Where It Falls Short

Long-term storage requires extra work. Prometheus’s local TSDB isn’t designed for multi-year retention. Beyond 15–30 days, you need a remote write target: Thanos, Cortex, or VictoriaMetrics.
Alertmanager has a steep config curve. The routing tree is flexible but not intuitive. Budget a few hours to get silence rules and receiver routing working the way you expect.
No built-in log correlation. Prometheus is metrics-only. Adding logs means adding Loki — which integrates well with Grafana, but it’s another component to operate and debug.
PromQL has edge cases that will surprise you. Staleness handling, range selectors, and counter resets behave in ways that aren’t obvious at first. Alerts firing unexpectedly during service restarts — due to how Prometheus handles counter resets — is a common early gotcha. It takes some reading to fully understand.

Recommended Setup for Most Teams

For a team running 10–100 servers with standard Linux workloads, containers, and a few databases, here’s what I’d deploy:

Prometheus — core metrics collection, 30-day local retention
Node Exporter — system-level metrics on every host (CPU, memory, disk, network)
cAdvisor — container metrics if you’re running Docker or Kubernetes
Alertmanager — alert routing to Slack/PagerDuty/email
Grafana — dashboards and visualization

Need longer retention or Prometheus HA? Add VictoriaMetrics as a remote write target. It’s simpler to operate than Thanos and stores metrics at roughly a fifth the disk space of raw Prometheus.

Implementation Guide

Step 1: Install Prometheus

Download and install Prometheus on your monitoring server:

# Create a dedicated user
sudo useradd --no-create-home --shell /bin/false prometheus

# Download Prometheus (check latest version at prometheus.io)
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xvf prometheus-2.51.0.linux-amd64.tar.gz
cd prometheus-2.51.0.linux-amd64

# Install binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# Set up directories
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Create a basic configuration at /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'server1:9100'
          - 'server2:9100'

Create a systemd service file at /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --web.listen-address=0.0.0.0:9090

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Step 2: Install Node Exporter on Each Host

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xvf node_exporter-1.8.0.linux-amd64.tar.gz
sudo cp node_exporter-1.8.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter

Create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Step 3: Install Grafana

# Ubuntu/Debian
sudo apt-get install -y apt-transport-https software-properties-common wget
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana

sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Grafana runs on port 3000. Default login is admin / admin — change it on first login.

Step 4: Connect Prometheus to Grafana

Open Grafana at http://your-server:3000
Go to Connections → Data Sources → Add data source
Select Prometheus
Set URL to http://localhost:9090
Click Save & Test

Import dashboard ID 1860 — the Node Exporter Full dashboard. It covers CPU, memory, disk, and network for every host in your Prometheus config. Zero additional configuration required.

Step 5: A Useful Alert Rule to Start With

Create /etc/prometheus/rules/node.yml:

groups:
  - name: node_alerts
    rules:
      - alert: HighCPULoad
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% free"

Reload Prometheus config without restart:

curl -X POST http://localhost:9090/-/reload

Six Months Later

No major incidents caused by monitoring gaps. Three disk-full situations caught and resolved before they became outages. A Node.js service with a slow memory leak — growing roughly 200MB per day — identified and patched before it took down production. Monitoring costs dropped about 70% compared to what Datadog would have cost at this scale.

Initial setup runs 4–6 hours end-to-end. That covers Prometheus, Node Exporter on every host, Grafana, and a first pass at alert rules. Most teams recoup that time within the first week.

Running more than a handful of servers without structured metrics means flying blind. This stack fixes that without a $3,000/month bill.