Linux Network Performance Monitoring with Prometheus and Grafana: A to Z Guide

Networking tutorial - IT technology blog
Networking tutorial - IT technology blog

The Problem That Woke Me Up at 3 AM

A few months into running a production microservices stack, I started getting random user complaints — pages loading slowly, API timeouts, dropped connections. CPU and RAM metrics looked fine on our old Zabbix setup. Disk I/O? Normal. Nothing obvious.

Turned out our servers were silently saturating their network interfaces during peak hours. Packets were being dropped, TCP retransmits were spiking, and we had zero visibility into any of it. Our monitoring simply wasn’t capturing network-layer metrics with enough granularity or speed.

That experience pushed me to rebuild our observability stack from scratch — Prometheus and Grafana, focused specifically on network. Three months later: no more mystery slowdowns, and alerts fire two to four minutes before users notice anything wrong.

Root Cause: What Most Monitoring Setups Miss

Standard system monitoring covers CPU, memory, and disk. Network gets surface-level treatment — bandwidth in/out, maybe ping latency. That’s nowhere near enough for diagnosing real issues. You need metrics like:

  • TCP retransmits — high values mean congestion or packet loss upstream
  • Network errors and drops per interface — driver-level issues, bad cables, misconfigured NICs
  • Connection states — too many TIME_WAIT or CLOSE_WAIT sockets can exhaust ephemeral ports
  • Bandwidth utilization per interface — not just totals, but per-interface trends over time
  • Socket queue depths — receive/send buffer overflows silently drop data

Linux already exposes all of this through /proc/net/ and /proc/sys/net/. Prometheus’s node_exporter scrapes it automatically. You just need to wire it up correctly.

Solutions Compared: What to Use and Why

Option 1: netdata

Netdata installs fast and has beautiful built-in dashboards. I used it early on. The catch: it’s optimized for real-time viewing, not long-term retention or alerting integration. Correlating network events with application logs becomes painful without a real query language.

Option 2: Telegraf + InfluxDB + Grafana (TIG Stack)

Solid option, especially if you’re already on InfluxDB. Telegraf has strong network input plugins. But InfluxDB’s licensing changed in 2023, and the operational overhead of a separate time-series DB alongside Prometheus wasn’t worth it for us.

Option 3: Prometheus + node_exporter + Grafana (Recommended)

This is what stuck. node_exporter is maintained by the Prometheus project itself — it exposes hundreds of Linux kernel metrics including full network stats, and integrates natively with Grafana and Alertmanager. PromQL gives you the flexibility to build any query you need, from simple bandwidth tracking to complex retransmit ratio calculations.

The Best Approach: Step-by-Step Setup

Step 1: Install node_exporter on Your Linux Host

Download the latest release from GitHub and run it as a systemd service:

# Download and extract
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo mv node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo useradd --no-create-home --shell /bin/false node_exporter

cat <<EOF | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \\
  --collector.netstat \\
  --collector.netdev \\
  --collector.conntrack \\
  --collector.sockstat
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Verify it’s running and exposing metrics:

curl http://localhost:9100/metrics | grep node_network

Step 2: Configure Prometheus to Scrape node_exporter

Add a scrape job to your prometheus.yml:

scrape_configs:
  - job_name: 'linux-nodes'
    scrape_interval: 15s
    static_configs:
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
        labels:
          env: 'production'
          region: 'ap-northeast-1'

Reload Prometheus after editing:

sudo systemctl reload prometheus
# Or via HTTP API:
curl -X POST http://localhost:9090/-/reload

Step 3: Key Network Metrics to Query

These are the PromQL queries I reach for first when diagnosing network issues:

Network receive/transmit bandwidth per interface (bytes/sec):

rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m]) * 8

Packet drop rate (critical — should be near zero):

rate(node_network_receive_drop_total[5m])
rate(node_network_transmit_drop_total[5m])

TCP retransmit rate:

rate(node_netstat_Tcp_RetransSegs[5m])

Active connections by state (TIME_WAIT, ESTABLISHED, etc.):

node_sockstat_TCP_tw          # TIME_WAIT
node_sockstat_TCP_inuse       # ESTABLISHED connections
node_sockstat_sockets_used    # Total sockets in use

NIC errors — useful for catching hardware or driver problems:

rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Step 4: Build Your Grafana Dashboard

Skip building from scratch — import dashboard ID 1860 (Node Exporter Full) from Grafana’s official library. It covers most Linux system metrics including network, and gives you a solid baseline to customize from.

For a network-focused dashboard, build panels around these configurations:

  • Panel 1 — Bandwidth per interface: Use the rate(node_network_receive_bytes_total) query, visualization type: Time series, unit: bytes/sec
  • Panel 2 — Drop/Error rate: Stack receive drops and transmit drops, threshold line at 1 pkt/s
  • Panel 3 — TCP retransmits: Single stat showing 5m rate, color red above 10/s
  • Panel 4 — Socket states: Gauge for TIME_WAIT with threshold at 10,000

Step 5: Set Up Alerting Rules

Drop a network_alerts.yml file into your Prometheus config directory:

groups:
  - name: network
    rules:
      - alert: HighPacketDropRate
        expr: rate(node_network_receive_drop_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High packet drop rate on {{ $labels.instance }}"
          description: "Interface {{ $labels.device }} is dropping {{ $value | humanize }} packets/sec"

      - alert: HighTCPRetransmits
        expr: rate(node_netstat_Tcp_RetransSegs[5m]) > 50
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "TCP retransmit storm on {{ $labels.instance }}"

      - alert: NetworkInterfaceSaturated
        expr: |
          rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]) * 8
          / node_network_speed_bytes{device!~"lo|veth.*"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "NIC approaching saturation on {{ $labels.instance }} / {{ $labels.device }}"

      - alert: TooManyTimeWaitSockets
        expr: node_sockstat_TCP_tw > 15000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Excessive TIME_WAIT sockets — possible connection leak"

Reference this file in your prometheus.yml:

rule_files:
  - "/etc/prometheus/network_alerts.yml"

Tips From Real-World Operation

Exclude Virtual Interfaces From Network Panels

Docker and systemd-networkd create dozens of veth, br-, and docker0 interfaces. They’ll clutter every dashboard. Exclude them with a regex in your queries:

rate(node_network_receive_bytes_total{device!~"lo|veth.*|br-.*|docker.*"}[5m])

Correlate Network Drops with System Load

When packet drops spike, the first question is: NIC problem or CPU problem? High softirq CPU time alongside packet drops usually means the interrupt handler can’t keep up. At that point, check NIC interrupt affinity or enable RSS (Receive Side Scaling):

# Check interrupt distribution across CPUs
cat /proc/interrupts | grep eth0

# Set interrupt affinity (example: spread eth0 IRQs across cores 0-3)
echo 0f > /proc/irq/$(grep eth0 /proc/interrupts | awk '{print $1}' | tr -d ':')/smp_affinity

Watch for Conntrack Table Exhaustion

On busy servers, the netfilter connection tracking table fills up and silently drops new connections. It’s one of the nastiest failure modes I’ve hit — no obvious error, just requests mysteriously failing. Monitor the usage ratio:

# Alert if above 80%
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

Consistently above 70%? Raise the limit:

sysctl -w net.netfilter.nf_conntrack_max=262144
echo 'net.netfilter.nf_conntrack_max = 262144' >> /etc/sysctl.d/99-conntrack.conf

Use Recording Rules for Heavy Queries

Network metrics generate high cardinality fast — many interfaces across many hosts. Pre-compute expensive queries as recording rules so Grafana dashboards stay snappy:

groups:
  - name: network_recording
    interval: 30s
    rules:
      - record: instance:node_network_transmit_bytes:rate5m
        expr: sum by (instance) (rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m]))

What the Stack Looks Like After a Few Months

Running this for a few months changes how you think about infrastructure. It’s not just about responding to incidents faster — you start doing actual capacity planning. I can see which servers are approaching NIC saturation a week before it becomes a problem, spot services with abnormal TCP retransmit patterns after a deploy, and track whether a kernel update shifted network behavior.

node_exporter covers the breadth, PromQL gives you the flexibility to slice any dimension you need, and Grafana ties it together visually. The whole stack runs on a single 2-vCPU/4GB VM handling scrapes from 20+ nodes without breaking a sweat — well worth the one-time setup cost.

Share: