Simulating Poor Network Conditions with tc netem on Linux: Test Latency, Packet Loss, and Jitter Before Production

Networking tutorial - IT technology blog
Networking tutorial - IT technology blog

The Bug That Only Appeared in Production

Six months ago, my team pushed a microservices update through staging without a single failure. Load tests passed. Integration tests passed. Then we deployed to production — and within 20 minutes, users in Southeast Asia were seeing timeouts and dropped API calls.

The culprit? Our application had never been tested under realistic network conditions. Staging lived on a gigabit LAN. Production traffic crossed oceans. The RTT between our Singapore nodes and the Tokyo backend ran consistently around 80ms, with occasional spikes above 200ms and a 1–2% packet loss rate on one upstream link.

That incident forced a hard question: how do you reproduce bad network conditions locally, reliably, before shipping code? The answer I kept landing on was tc netem — the Linux Traffic Control network emulator. I’ve been running it in pre-deployment pipelines ever since. Our post-deploy timeout rate dropped by over 80% within two sprints.

What tc netem Actually Is

tc is the Linux command for manipulating traffic control settings. It ships as part of the iproute2 package, so it’s already present on virtually every modern Linux server. netem (Network Emulator) is a queuing discipline (qdisc) you attach to a network interface to introduce artificial impairments.

Picture it as a programmable degradation layer sitting between your application and the network stack. You tell the kernel: “When packets leave this interface, randomly drop 2% of them, add 80ms of delay with 20ms of jitter, and occasionally reorder them.” The kernel does exactly that — transparently, with no changes needed in your application.

Tools like iperf3 or mtr measure existing network conditions. tc netem creates those conditions on demand. That’s the key distinction.

The Four Impairments Worth Knowing

  • Delay — adds fixed or variable latency to outgoing packets
  • Packet loss — randomly drops a percentage of packets
  • Jitter — introduces variation in delay (the enemy of real-time applications)
  • Packet reordering and duplication — simulates unreliable links

Setting Up Your First Netem Rule

Before anything, verify tc is available:

tc -V

Most distros ship it. If not: apt install iproute2 or yum install iproute.

Basic syntax to add a netem qdisc to an interface:

tc qdisc add dev eth0 root netem delay 100ms

All outgoing traffic on eth0 now gets an artificial 100ms one-way delay. To confirm it’s applied:

tc qdisc show dev eth0

You’ll see output like:

qdisc netem 8001: root refcnt 2 limit 1000 delay 100ms

To remove it when you’re done:

tc qdisc del dev eth0 root

Always clean up after testing. A forgotten netem rule has caused more than one confused debugging session on my team.

Hands-On: Realistic Test Scenarios

Scenario 1: Simulating Cross-Region Latency (80ms + Jitter)

Exact match for the Southeast Asia-to-Tokyo path that caught us off guard:

tc qdisc add dev eth0 root netem delay 80ms 20ms distribution normal

The second 20ms value is the jitter range. distribution normal makes variation follow a bell curve instead of being purely random — closer to how real networks actually behave. Most packets land around 80ms; occasional spikes reach 100ms or higher.

Verify with a quick ping to your target host:

ping -c 20 your-backend-host.example.com

Scenario 2: Packet Loss

Even 1% packet loss can destroy application performance if your retry logic isn’t solid. Start conservative:

tc qdisc add dev eth0 root netem loss 1%

For correlated loss — where dropped packets cluster together, like during a real link degradation event:

tc qdisc add dev eth0 root netem loss 2% 25%

That 25% is the correlation coefficient. If one packet is dropped, there’s a 25% chance the next one drops too. Correlated loss burns through retry budgets far faster than random loss. It’s what actually happens during upstream failures.

Scenario 3: The Full Degraded Link

Mobile users on congested 4G, or a VPN tunnel through a saturated ISP — combine all three impairments:

tc qdisc add dev eth0 root netem \
  delay 150ms 50ms distribution normal \
  loss 3% \
  duplicate 0.5% \
  reorder 5% 50%

Combined: 150ms average latency with high jitter, 3% packet loss, 0.5% duplicate packets, and 5% reordering. Run your application against this for ten minutes and broken timeout values will announce themselves.

I applied this to the environment mimicking our Singapore-Tokyo path. Within 10 minutes of integration testing, we found three separate gRPC clients with zero retry policy configured. Those had been silent production failures for weeks.

Scenario 4: Bandwidth Limiting with netem + tbf

Netem handles delay and loss, but bandwidth caps require chaining with a second qdisc — tbf (Token Bucket Filter):

# First, add netem as root qdisc
tc qdisc add dev eth0 root handle 1: netem delay 80ms loss 1%

# Then add tbf as a child to cap at ~1Mbit/s (simulate mobile)
tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 1mbit burst 32kbit latency 400ms

Chaining these two qdiscs gives you latency and loss with a hard bandwidth cap — a solid approximation of a mobile user with a weak signal.

Modifying Rules Without Removing Them

During testing you often want to tweak parameters without tearing everything down. Use change:

# Bump loss to 5% while keeping the existing delay rule
tc qdisc change dev eth0 root netem delay 80ms 20ms loss 5%

Building a Test Script for Pre-Deployment Checks

After a few weeks of ad-hoc netem usage, I wrapped the common scenarios into a small shell script that lives in our repo and runs as part of the pre-deployment checklist:

#!/bin/bash
# network-chaos.sh — Apply/remove netem impairments for pre-deploy testing
# Usage: ./network-chaos.sh [apply|remove] [scenario]

IFACE="eth0"

apply_scenario() {
  case "$1" in
    regional)
      echo "Applying: 80ms delay + 20ms jitter + 1% loss"
      tc qdisc add dev $IFACE root netem delay 80ms 20ms distribution normal loss 1%
      ;;
    degraded)
      echo "Applying: 150ms delay + 50ms jitter + 3% loss + reorder"
      tc qdisc add dev $IFACE root netem \\
        delay 150ms 50ms distribution normal \\
        loss 3% reorder 5% 50%
      ;;
    mobile)
      tc qdisc add dev $IFACE root handle 1: netem delay 120ms 40ms loss 2%
      tc qdisc add dev $IFACE parent 1:1 handle 10: tbf rate 2mbit burst 32kbit latency 400ms
      echo "Applying: mobile simulation (2Mbit, 120ms, 2% loss)"
      ;;
    *)
      echo "Unknown scenario: $1"
      exit 1
      ;;
  esac
}

remove_all() {
  tc qdisc del dev $IFACE root 2>/dev/null && echo "Netem rules removed" || echo "Nothing to remove"
}

case "$1" in
  apply) apply_scenario "$2" ;;
  remove) remove_all ;;
  *) echo "Usage: $0 [apply|remove] [regional|degraded|mobile]" ;;
esac

Root or CAP_NET_ADMIN capability is required. In Docker, pass --cap-add NET_ADMIN to the container running the test environment.

Validating Your Setup

After applying rules, confirm with a sanity check before trusting your results:

# Show current qdisc
tc qdisc show dev eth0

# Measure actual latency with ping statistics
ping -c 50 -q 8.8.8.8

# Or use hping3 for TCP-level stats (more relevant for app testing)
hping3 -c 100 -S -p 80 your-backend.example.com

What This Catches in Practice

Six months of running every deployment candidate through at least the regional scenario has produced a consistent list of pre-production catches:

  • HTTP clients with no timeout configured — they’d hang indefinitely under packet loss
  • Database connection pools that didn’t handle reconnects after a jitter spike triggered a TCP RST
  • WebSocket clients that silently stopped receiving messages when reordered packets hit an edge case in the message reassembly logic
  • Queue consumers that starved under high latency because the prefetch window was too small

Zero of these surfaced in LAN-based testing. All of them appeared within minutes under the degraded network scenario.

A Few Practical Notes

Netem only affects outgoing traffic on the interface. Impairing both directions requires rules on both sides of the connection, or a dedicated network namespace or VM acting as an intermediary.

For container-based setups, run netem inside the container’s network namespace, or use a sidecar container as a network proxy with impairments applied to its interface.

One more thing: netem rules survive process restarts but not reboots. If your test VM cycles mid-session, the rules are gone. Keep the apply/remove script somewhere obvious.

Making This Part of Your Standard Process

The real shift wasn’t the tooling — it was treating degraded network conditions as a normal part of the testing matrix, not an edge case. Every non-trivial network-facing change in our codebase now goes through at least the regional scenario before it gets a deploy approval.

First-time setup takes about five minutes. The bugs it surfaces would take hours to diagnose in production — if you can even reproduce them at all. An hour of pre-deploy testing beats a 2 AM war room every time.

Start with a simple 80ms delay rule, run your application, watch the logs, and see what breaks. You’ll find something — and you’ll be glad you found it now.

Share: