What I Was Running Before (And Why I Switched)
Before settling on Prometheus + Grafana, I spent a few weeks evaluating monitoring stacks. Three contenders made the shortlist: the ELK stack (Elasticsearch + Logstash + Kibana), Datadog, and the Prometheus + Grafana combo. Picking the wrong one isn’t just inconvenient — it means migrating dashboards, retraining muscle memory, and re-instrumenting services. Weeks of pain.
Here’s how I framed the comparison at the time.
Approach 1: Managed SaaS (Datadog, New Relic)
Zero setup overhead. Ship an agent, metrics flow in, dashboards appear. For teams that don’t want to think about infrastructure, it works well. The catch is cost. Datadog runs roughly $23/host/month for base infrastructure monitoring — add APM or custom metrics and you’re pushing $40–60/host. At 50 servers, that’s $2,000–3,000/month before you’ve added a single custom metric series. For a startup or lean team, that math gets uncomfortable fast.
Approach 2: ELK Stack
Elasticsearch is a search engine. People bolt it onto metrics pipelines because it can store time-series data, but it’s optimized for log search, not metric aggregation. The memory footprint alone is painful — a modest Elasticsearch cluster for metrics might need 16–32GB RAM versus 4GB for a comparable Prometheus setup. Kibana’s visualization has improved a lot, but building a dashboard still feels awkward compared to Grafana’s interface.
Approach 3: Prometheus + Grafana
This is what I run now. Prometheus is designed specifically for metrics: it scrapes endpoints on a pull model, stores data in a compressed time-series format, and ships with PromQL — a query language that maps naturally to how you think about system behavior. Grafana connects to Prometheus (and dozens of other data sources) and gives you dashboards that are actually good to use.
Adoption is fast because the exporter ecosystem covers almost everything. PostgreSQL, Redis, Nginx, Kubernetes, RabbitMQ — nearly every serious open-source project ships a /metrics endpoint or has a community exporter. You’re monitoring your stack within hours, not days.
Honest Pros and Cons After Six Months
What Prometheus + Grafana Gets Right
- Pull-based scraping is operationally sane. Services expose metrics; Prometheus fetches them on schedule. If a service dies, you know immediately — the scrape fails and alerts fire. Push-based systems can mask failures silently if the client simply stops sending data.
- PromQL is powerful once it clicks. The learning curve is real. But once you internalize
rate(),irate(),histogram_quantile(), and label matchers, you can answer nearly any question about your system in seconds. - The exporter ecosystem covers almost everything. Node Exporter, cAdvisor, PostgreSQL Exporter, Redis Exporter, Nginx Exporter — if you run it, there’s probably an exporter for it. No custom instrumentation needed for standard infrastructure components.
- Community dashboards save real time. Grafana.com/dashboards has thousands of community-published setups. Import an ID and you have a production-ready dashboard in under a minute.
- Resource footprint is manageable. Monitoring 50 hosts at 15-second scrape intervals runs comfortably on 2 cores and 4GB RAM. Storage stays reasonable too — 30 days of metrics for 50 hosts typically lands under 20GB with default compression settings.
Where It Falls Short
- Long-term storage requires extra work. Prometheus’s local TSDB isn’t designed for multi-year retention. Beyond 15–30 days, you need a remote write target: Thanos, Cortex, or VictoriaMetrics.
- Alertmanager has a steep config curve. The routing tree is flexible but not intuitive. Budget a few hours to get silence rules and receiver routing working the way you expect.
- No built-in log correlation. Prometheus is metrics-only. Adding logs means adding Loki — which integrates well with Grafana, but it’s another component to operate and debug.
- PromQL has edge cases that will surprise you. Staleness handling, range selectors, and counter resets behave in ways that aren’t obvious at first. Alerts firing unexpectedly during service restarts — due to how Prometheus handles counter resets — is a common early gotcha. It takes some reading to fully understand.
Recommended Setup for Most Teams
For a team running 10–100 servers with standard Linux workloads, containers, and a few databases, here’s what I’d deploy:
- Prometheus — core metrics collection, 30-day local retention
- Node Exporter — system-level metrics on every host (CPU, memory, disk, network)
- cAdvisor — container metrics if you’re running Docker or Kubernetes
- Alertmanager — alert routing to Slack/PagerDuty/email
- Grafana — dashboards and visualization
Need longer retention or Prometheus HA? Add VictoriaMetrics as a remote write target. It’s simpler to operate than Thanos and stores metrics at roughly a fifth the disk space of raw Prometheus.
Implementation Guide
Step 1: Install Prometheus
Download and install Prometheus on your monitoring server:
# Create a dedicated user
sudo useradd --no-create-home --shell /bin/false prometheus
# Download Prometheus (check latest version at prometheus.io)
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xvf prometheus-2.51.0.linux-amd64.tar.gz
cd prometheus-2.51.0.linux-amd64
# Install binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
# Set up directories
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
Create a basic configuration at /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'server1:9100'
- 'server2:9100'
Create a systemd service file at /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.listen-address=0.0.0.0:9090
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Step 2: Install Node Exporter on Each Host
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xvf node_exporter-1.8.0.linux-amd64.tar.gz
sudo cp node_exporter-1.8.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Step 3: Install Grafana
# Ubuntu/Debian
sudo apt-get install -y apt-transport-https software-properties-common wget
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Grafana runs on port 3000. Default login is admin / admin — change it on first login.
Step 4: Connect Prometheus to Grafana
- Open Grafana at
http://your-server:3000 - Go to Connections → Data Sources → Add data source
- Select Prometheus
- Set URL to
http://localhost:9090 - Click Save & Test
Import dashboard ID 1860 — the Node Exporter Full dashboard. It covers CPU, memory, disk, and network for every host in your Prometheus config. Zero additional configuration required.
Step 5: A Useful Alert Rule to Start With
Create /etc/prometheus/rules/node.yml:
groups:
- name: node_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 2m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Root filesystem has {{ $value | printf \"%.1f\" }}% free"
Reload Prometheus config without restart:
curl -X POST http://localhost:9090/-/reload
Six Months Later
No major incidents caused by monitoring gaps. Three disk-full situations caught and resolved before they became outages. A Node.js service with a slow memory leak — growing roughly 200MB per day — identified and patched before it took down production. Monitoring costs dropped about 70% compared to what Datadog would have cost at this scale.
Initial setup runs 4–6 hours end-to-end. That covers Prometheus, Node Exporter on every host, Grafana, and a first pass at alert rules. Most teams recoup that time within the first week.
Running more than a handful of servers without structured metrics means flying blind. This stack fixes that without a $3,000/month bill.

