The 2 AM Wake-up Call
It’s 2:14 AM. My phone is vibrating off the nightstand with a PagerDuty alert that says one thing: API Latency > 5s. My first instinct is to check the application logs, but everything there looks clean. The CPU on the web servers is sitting at a comfortable 15% utilization. Then I check the database metrics. High lock contention. A rogue migration script is holding a metadata lock on the most active table in the system, and requests are piling up like a multi-car pileup on the highway.
Without a solid monitoring stack, I would have spent the next two hours tailing logs and guessing. Whether you use MySQL, PostgreSQL, or MongoDB, these databases share a common trait: they fail silently until the entire application grinds to a halt. This guide walks through setting up a battle-tested monitoring pipeline using Prometheus and Grafana. We want to catch these issues while they are still minor hiccups, not full-blown outages.
Quick Start: Up and Running in 5 Minutes
If you need visibility right now, the “Exporter” pattern is your best bet. Prometheus doesn’t talk to databases directly. Instead, it scrapes metrics from an exporter—a small sidecar service that translates database internals into a format Prometheus can digest.
1. Deploy the Exporter
For PostgreSQL, use postgres_exporter. For MySQL, mysqld_exporter is the standard. Here is a docker-compose.yml snippet to get a PostgreSQL exporter running alongside your database immediately:
services:
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: password123
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://postgres:password123@db:5432/postgres?sslmode=disable"
ports:
- "9187:9187"
2. Configure Prometheus
Next, update your prometheus.yml file. This tells Prometheus where to look for the raw data every few seconds.
scrape_configs:
- job_name: 'postgres_metrics'
static_configs:
- targets: ['postgres-exporter:9187']
3. Import a Dashboard
Once Prometheus starts pulling data, open Grafana and add Prometheus as a data source. Instead of building graphs from scratch, import Dashboard ID 9628 for Postgres or 7362 for MySQL. You’ll instantly see a live view of your database health.
Deep Dive: How the Monitoring Stack Actually Works
Spinning up tools is easy, but you need to understand the underlying architecture to troubleshoot effectively when things break. The system follows a pull-based model where Prometheus is the active collector.
The Role of the Exporter
The exporter acts as a translator. It runs SQL queries against the database’s internal statistics tables. In PostgreSQL, it’s looking at pg_stat_activity; in MySQL, it’s hitting the performance_schema. These exporters expose a /metrics endpoint that Prometheus scrapes at a configurable interval, usually every 15 to 60 seconds.
Essential Metrics to Watch
Avoid the temptation to monitor every single variable. Focus on these “Golden Signals” to keep your sanity:
- Connections: Are you nearing your
max_connectionslimit? If your DB allows 100 connections and you hit 98, your application will start throwing “Connection Refused” errors immediately. - Buffer Cache Hit Ratio: For Postgres, aim for 99% or higher. If this dips to 90%, your database is reading from the disk instead of RAM. This shift can turn a 1ms query into a 100ms nightmare.
- Transaction ID Wraparound (Postgres): This is the ultimate silent killer. If this reaches its limit, the database will literally stop accepting writes to prevent data corruption.
- Slow Queries: Track queries taking longer than 500ms. These are usually the first sign of missing indexes.
Advanced Usage: Custom Metrics and Alerting
Standard exporters are great for hardware health, but they don’t know your business logic. You might need to know how many “Unprocessed Payments” are sitting in a table, as a spike there suggests your background workers are failing.
Custom Queries with SQL Exporter
The sql_exporter is perfect for this. It allows you to turn any SQL SELECT COUNT(*) into a Prometheus gauge without writing a single line of Go or Python code.
# custom_queries.yml
jobs:
- name: "business_metrics"
interval: "1m"
queries:
- name: "unprocessed_orders"
help: "Orders stuck in pending status"
values: [count]
query: "SELECT count(*) as count FROM orders WHERE status = 'pending';"
Setting Up Alerts
Monitoring is useless if you have to stare at a screen all day. You need Prometheus Alertmanager to do the heavy lifting. Here is a rule I use to detect connection spikes before they crash the app:
groups:
- name: database_alerts
rules:
- alert: HighConnectionCount
expr: pg_stat_database_numbackends > 80
for: 2m
labels:
severity: warning
annotations:
summary: "High connections on {{ $labels.instance }}"
description: "Database connections are at {{ $value }}, exceeding 80% capacity for over 2 minutes."
Practical Tips for Production
After years of managing production clusters, I’ve learned that a poorly configured monitoring system can be just as dangerous as no monitoring at all. Here is how to stay safe.
Don’t Leave the Front Door Open
Exporters often run with high-level read access to your metadata. Never expose exporter ports like 9187 to the public internet. Use a private VPC or a Kubernetes sidecar pattern so only Prometheus can reach the endpoint. Always create a dedicated, restricted user for the exporter:
-- For PostgreSQL 10+
CREATE USER monitoring WITH PASSWORD 'use_a_strong_password';
GRANT pg_monitor TO monitoring;
Avoid the “Over-Scraping” Trap
Scraping every 1 second might feel like a good idea for high-resolution data, but it’s usually overkill. Querying performance schemas adds measurable load to your CPU. For most production workloads, a 15-second or 30-second interval provides plenty of detail without impacting query performance.
Curing Dashboard Fatigue
I have seen teams with 50 different graphs on a single page. When an incident occurs, they freeze because they don’t know which spike matters. Organize your Grafana dashboards by priority. Keep one “Executive Summary” dashboard with just the red/green status of your clusters, and link to “Deep Dive” dashboards for granular troubleshooting.
Effective monitoring is ultimately about getting more sleep. The more you automate the detection of bottlenecks, the less time you’ll spend playing detective at 2 AM. Start with the basic exporters, get your alerts tuned, and refine the system as you learn the unique quirks of your traffic.

