The Problem That Woke Me Up at 3 AM
A few months into running a production microservices stack, I started getting random user complaints — pages loading slowly, API timeouts, dropped connections. CPU and RAM metrics looked fine on our old Zabbix setup. Disk I/O? Normal. Nothing obvious.
Turned out our servers were silently saturating their network interfaces during peak hours. Packets were being dropped, TCP retransmits were spiking, and we had zero visibility into any of it. Our monitoring simply wasn’t capturing network-layer metrics with enough granularity or speed.
That experience pushed me to rebuild our observability stack from scratch — Prometheus and Grafana, focused specifically on network. Three months later: no more mystery slowdowns, and alerts fire two to four minutes before users notice anything wrong.
Root Cause: What Most Monitoring Setups Miss
Standard system monitoring covers CPU, memory, and disk. Network gets surface-level treatment — bandwidth in/out, maybe ping latency. That’s nowhere near enough for diagnosing real issues. You need metrics like:
- TCP retransmits — high values mean congestion or packet loss upstream
- Network errors and drops per interface — driver-level issues, bad cables, misconfigured NICs
- Connection states — too many TIME_WAIT or CLOSE_WAIT sockets can exhaust ephemeral ports
- Bandwidth utilization per interface — not just totals, but per-interface trends over time
- Socket queue depths — receive/send buffer overflows silently drop data
Linux already exposes all of this through /proc/net/ and /proc/sys/net/. Prometheus’s node_exporter scrapes it automatically. You just need to wire it up correctly.
Solutions Compared: What to Use and Why
Option 1: netdata
Netdata installs fast and has beautiful built-in dashboards. I used it early on. The catch: it’s optimized for real-time viewing, not long-term retention or alerting integration. Correlating network events with application logs becomes painful without a real query language.
Option 2: Telegraf + InfluxDB + Grafana (TIG Stack)
Solid option, especially if you’re already on InfluxDB. Telegraf has strong network input plugins. But InfluxDB’s licensing changed in 2023, and the operational overhead of a separate time-series DB alongside Prometheus wasn’t worth it for us.
Option 3: Prometheus + node_exporter + Grafana (Recommended)
This is what stuck. node_exporter is maintained by the Prometheus project itself — it exposes hundreds of Linux kernel metrics including full network stats, and integrates natively with Grafana and Alertmanager. PromQL gives you the flexibility to build any query you need, from simple bandwidth tracking to complex retransmit ratio calculations.
The Best Approach: Step-by-Step Setup
Step 1: Install node_exporter on Your Linux Host
Download the latest release from GitHub and run it as a systemd service:
# Download and extract
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo mv node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo useradd --no-create-home --shell /bin/false node_exporter
cat <<EOF | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \\
--collector.netstat \\
--collector.netdev \\
--collector.conntrack \\
--collector.sockstat
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Verify it’s running and exposing metrics:
curl http://localhost:9100/metrics | grep node_network
Step 2: Configure Prometheus to Scrape node_exporter
Add a scrape job to your prometheus.yml:
scrape_configs:
- job_name: 'linux-nodes'
scrape_interval: 15s
static_configs:
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
labels:
env: 'production'
region: 'ap-northeast-1'
Reload Prometheus after editing:
sudo systemctl reload prometheus
# Or via HTTP API:
curl -X POST http://localhost:9090/-/reload
Step 3: Key Network Metrics to Query
These are the PromQL queries I reach for first when diagnosing network issues:
Network receive/transmit bandwidth per interface (bytes/sec):
rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m]) * 8
Packet drop rate (critical — should be near zero):
rate(node_network_receive_drop_total[5m])
rate(node_network_transmit_drop_total[5m])
TCP retransmit rate:
rate(node_netstat_Tcp_RetransSegs[5m])
Active connections by state (TIME_WAIT, ESTABLISHED, etc.):
node_sockstat_TCP_tw # TIME_WAIT
node_sockstat_TCP_inuse # ESTABLISHED connections
node_sockstat_sockets_used # Total sockets in use
NIC errors — useful for catching hardware or driver problems:
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
Step 4: Build Your Grafana Dashboard
Skip building from scratch — import dashboard ID 1860 (Node Exporter Full) from Grafana’s official library. It covers most Linux system metrics including network, and gives you a solid baseline to customize from.
For a network-focused dashboard, build panels around these configurations:
- Panel 1 — Bandwidth per interface: Use the
rate(node_network_receive_bytes_total)query, visualization type: Time series, unit: bytes/sec - Panel 2 — Drop/Error rate: Stack receive drops and transmit drops, threshold line at 1 pkt/s
- Panel 3 — TCP retransmits: Single stat showing 5m rate, color red above 10/s
- Panel 4 — Socket states: Gauge for TIME_WAIT with threshold at 10,000
Step 5: Set Up Alerting Rules
Drop a network_alerts.yml file into your Prometheus config directory:
groups:
- name: network
rules:
- alert: HighPacketDropRate
expr: rate(node_network_receive_drop_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High packet drop rate on {{ $labels.instance }}"
description: "Interface {{ $labels.device }} is dropping {{ $value | humanize }} packets/sec"
- alert: HighTCPRetransmits
expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 50
for: 3m
labels:
severity: critical
annotations:
summary: "TCP retransmit storm on {{ $labels.instance }}"
- alert: NetworkInterfaceSaturated
expr: |
rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]) * 8
/ node_network_speed_bytes{device!~"lo|veth.*"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "NIC approaching saturation on {{ $labels.instance }} / {{ $labels.device }}"
- alert: TooManyTimeWaitSockets
expr: node_sockstat_TCP_tw > 15000
for: 10m
labels:
severity: warning
annotations:
summary: "Excessive TIME_WAIT sockets — possible connection leak"
Reference this file in your prometheus.yml:
rule_files:
- "/etc/prometheus/network_alerts.yml"
Tips From Real-World Operation
Exclude Virtual Interfaces From Network Panels
Docker and systemd-networkd create dozens of veth, br-, and docker0 interfaces. They’ll clutter every dashboard. Exclude them with a regex in your queries:
rate(node_network_receive_bytes_total{device!~"lo|veth.*|br-.*|docker.*"}[5m])
Correlate Network Drops with System Load
When packet drops spike, the first question is: NIC problem or CPU problem? High softirq CPU time alongside packet drops usually means the interrupt handler can’t keep up. At that point, check NIC interrupt affinity or enable RSS (Receive Side Scaling):
# Check interrupt distribution across CPUs
cat /proc/interrupts | grep eth0
# Set interrupt affinity (example: spread eth0 IRQs across cores 0-3)
echo 0f > /proc/irq/$(grep eth0 /proc/interrupts | awk '{print $1}' | tr -d ':')/smp_affinity
Watch for Conntrack Table Exhaustion
On busy servers, the netfilter connection tracking table fills up and silently drops new connections. It’s one of the nastiest failure modes I’ve hit — no obvious error, just requests mysteriously failing. Monitor the usage ratio:
# Alert if above 80%
node_nf_conntrack_entries / node_nf_conntrack_entries_limit
Consistently above 70%? Raise the limit:
sysctl -w net.netfilter.nf_conntrack_max=262144
echo 'net.netfilter.nf_conntrack_max = 262144' >> /etc/sysctl.d/99-conntrack.conf
Use Recording Rules for Heavy Queries
Network metrics generate high cardinality fast — many interfaces across many hosts. Pre-compute expensive queries as recording rules so Grafana dashboards stay snappy:
groups:
- name: network_recording
interval: 30s
rules:
- record: instance:node_network_transmit_bytes:rate5m
expr: sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]))
What the Stack Looks Like After a Few Months
Running this for a few months changes how you think about infrastructure. It’s not just about responding to incidents faster — you start doing actual capacity planning. I can see which servers are approaching NIC saturation a week before it becomes a problem, spot services with abnormal TCP retransmit patterns after a deploy, and track whether a kernel update shifted network behavior.
node_exporter covers the breadth, PromQL gives you the flexibility to slice any dimension you need, and Grafana ties it together visually. The whole stack runs on a single 2-vCPU/4GB VM handling scrapes from 20+ nodes without breaking a sweat — well worth the one-time setup cost.

