Fixing the 2 AM Bandwidth Spike: Practical Traffic Analysis with NetFlow and ntopng

Table of Contents

The 2 AM Bandwidth Spike: Why Logs Aren’t Enough

Last month, an 850 Mbps bandwidth spike hit one of my production clusters at 2:14 AM. The billing alert from AWS arrived before my standard monitoring dashboard even blinked. I spent an hour digging through logs and trying to run tcpdump on various interfaces, only to realize I was looking for a needle in a haystack of millions of packets. Standard tools like top or vnstat told me the interface was saturated. They couldn’t tell me who was doing it or where the traffic was going.

This is where flow-based analysis changes the game. Instead of looking at every individual packet—which is computationally expensive and impossible to scale on 10Gbps+ links—we look at the metadata. This approach lets you see the ‘big picture’ across your entire infrastructure from a single pane of glass.

Approach Comparison: Packet Capture vs. Flow Analysis

When I talk to other engineers about network visibility, they often default to packet capture tools like Wireshark or tcpdump. Those are indispensable for debugging protocol errors. However, they fail when you need to monitor high-traffic environments for long periods.

Packet Capture (Deep Packet Inspection)

Think of packet capture as opening every single envelope in a post office to read the letters inside. It gives you 100% visibility, including the payload. However, if 100,000 envelopes arrive per second, you’ll need an army to read them. You will also run out of space to store them almost immediately. For a 10Gbps link at full tilt, you’d be writing roughly 1.2 GB of data to disk every second. That is over 4 TB of data per hour just for one interface.

Flow Analysis (Metadata Inspection)

Flow analysis is like looking at the outside of the envelope. You record the sender’s address, the recipient’s address, the timestamp, and the weight. You don’t know what the letter says. You do know exactly who is talking to whom and how much data they are exchanging. This is what NetFlow and sFlow do. They aggregate packets into “flows” based on a 5-tuple: Source IP, Destination IP, Source Port, Destination Port, and Protocol.

Mastering this skill is essential. It allows you to maintain a stable, high-performance infrastructure without breaking the bank on monitoring hardware.

Pros and Cons: NetFlow, sFlow, and IPFIX

Choosing the right protocol depends heavily on your hardware. I’ve worked with all three in production. Here is how they stack up.

NetFlow (v5/v9): Originally a Cisco proprietary protocol, it’s now widely supported. NetFlow is stateful. The router keeps a cache of active flows and exports them when they expire.
- Pros: Extremely detailed and excellent for security auditing.
- Cons: High resource usage. On a busy router, a flow table with 100,000 active entries can easily chew up 200MB+ of RAM.
sFlow (Sampled Flow): A multi-vendor standard. Unlike NetFlow, sFlow is stateless. It takes 1 packet out of every N—for example, 1 out of 2,048—and sends the header to the collector.
- Pros: Near-zero overhead on switches. It is perfect for 40G or 100G backbone links.
- Cons: It is a statistical approximation. You might miss a single-packet ‘ping of death’ or very small, short-lived flows.
IPFIX (IP Flow Information Export): The IETF standard based on NetFlow v9. It is the modern, vendor-neutral version of NetFlow.
- Pros: Highly extensible. It can export non-IP data and custom fields like HTTP URLs or TCP window sizes.
- Cons: Support varies. You won’t find it on older legacy hardware.

The Setup: ntopng, nProbe, and softflowd

For a robust, open-source stack, I recommend ntopng combined with nProbe or softflowd. ntopng handles the visualization. The other components act as exporters or collectors. You don’t need an expensive Cisco switch to start. You can install a software exporter on your Linux gateways to turn them into flow-reporting nodes.

The Architecture

Exporters: Your routers or Linux servers running softflowd. They watch the traffic and send flow records to the collector.
Collector/Analyzer: ntopng. It receives these records and stores them in a database like ClickHouse. Then, it provides a web interface.

Implementation: Setting Up ntopng on Ubuntu

Let’s get this running on an Ubuntu system. We will use ntopng for the dashboard and softflowd as our exporter.

1. Install ntopng

Add the official ntop repository to get the latest stable version. Avoid the versions in the default Ubuntu repos. They are often three years out of date and lack modern features.

# Add the ntop repository
wget http://apt.ntop.org/$(lsb_release -sc)/all/apt-ntop.deb
sudo dpkg -i apt-ntop.deb

# Update and install
sudo apt update
sudo apt install ntopng nprobe

2. Configure ntopng

Edit the configuration file. We want ntopng to act as a collector rather than just sniffing a local interface.

sudo nano /etc/ntopng/ntopng.conf

Add these lines to collect data on port 5555 via ZeroMQ:

-i=view:all
-i=tcp://127.0.0.1:5555
-w=3000
-d=/var/lib/ntopng

3. Setting up softflowd

If your router doesn’t support NetFlow, use softflowd on the server itself. This tool turns your Linux box into a NetFlow-capable device.

sudo apt install softflowd

# Export eth0 traffic to our collector on port 2055
sudo softflowd -i eth0 -n 127.0.0.1:2055 -v 9

4. Bridging with nProbe

ntopng often uses nProbe as a proxy. nProbe collects raw NetFlow (port 2055) and forwards it to ntopng (port 5555) in a format it understands perfectly.

# Collect NetFlow on 2055 and forward to ntopng on 5555
sudo nprobe --zmq "tcp://*:5555" -i none -n none --collector-port 2055

Real-World Troubleshooting

Once you log into the dashboard at http://your-server-ip:3000, you’ll see a wealth of data. Here is how I use it when things go wrong:

Finding ‘Top Talkers’: Sort the Hosts section by throughput. I once found a misconfigured backup script trying to sync 1.5 TB of data during peak business hours this way. It took 30 seconds to find.
Detecting DDoS: Check the Flows tab. If you see 50,000 flows from unique external IPs to one internal IP—all with low byte counts—you are facing a SYN flood.
App Visibility: ntopng uses nDPI to guess the application. It can tell you if that encrypted traffic is an AWS S3 upload or just someone watching Netflix in the office.

Keep an eye on the ‘Flow Expiration’ settings. If flows expire too quickly, your graphs look choppy. If they stay active too long, your ‘real-time’ data feels lagged. Set active flows to expire every 60 seconds for the best balance.

Moving to flow-based monitoring gives you the high-level visibility needed for modern networks. While packet capture explains the ‘how’, NetFlow tells you the ‘who, what, and where’. Adding both to your toolkit ensures you aren’t flying blind when the next spike hits.