Surviving 'Neighbor Table Overflow': Tuning Linux ARP for High-Density Networks

Table of Contents

The 2 AM PagerDuty Wake-up Call

My phone went off at 2 AM. The monitoring dashboard for our core gateway was glowing red. Users couldn’t reach the application, pings were dropping 80% of packets, and SSH sessions kept hanging. Once I finally caught a stable terminal session, the kernel logs revealed the culprit. The screen was flooded with one message over and over:

[84321.123456] net_ratelimit: 542 callbacks suppressed
[84321.123457] neighbor table overflow!
[84321.123458] neighbor table overflow!

This is a classic failure in large Layer 2 networks, like the high-density virtualization clusters we run. The server had hit a hard limit it was never tuned to handle. Essentially, the Linux kernel was forgetting its neighbors because it had no more room in its address book.

How the Neighbor Table Works

Think of the Neighbor Table (the ARP Cache in IPv4) as a mapping of IP addresses to MAC addresses. To talk to any device on the local network, your server needs that device’s MAC address. It broadcasts an ARP request, receives a reply, and saves the mapping so it doesn’t have to ask again.

Standard Linux distributions like Ubuntu 22.04 or Debian come tuned for general-purpose environments. They expect a few hundred neighbors at most. However, modern setups—like flat Kubernetes networks or VLANs with 2,000+ nodes—crush these defaults. When the device count exceeds the kernel’s internal limits, it stops accepting new entries. That is when your network connectivity dies.

Checking Your Limits

To diagnose the bottleneck, start by counting your current neighbor entries:

ip neighbor show | wc -l

Next, look at the three sysctl thresholds that govern the Address Resolution Protocol (ARP) garbage collector (GC):

sysctl -a | grep gc_thresh

You will find three specific values for net.ipv4.neigh.default.gc_thresh:

gc_thresh1 (Default 128): The floor. The kernel won’t even think about cleaning the cache if you have fewer than 128 entries.
gc_thresh2 (Default 512): The soft ceiling. If the cache stays above 512 for more than 5 seconds, the garbage collector starts purging.
gc_thresh3 (Default 1024): The hard wall. If you hit 1,024, the kernel immediately tries to clear entries. If it fails, you get the “overflow” error, and new connections are dropped.

In our case, the network had spiked to 2,400 active devices. Since gc_thresh3 was stuck at the default 1024, the server was trying to squeeze a crowd into a tiny room. It was never going to work.

Two Ways to Fix It

1. The “Quick Fix” (Use with Caution)

If you are in a production outage, you might consider flushing the ARP table:

sudo ip -s -s neigh flush all

The risk: This forces the server to re-ARP for every single connection at once. On a high-traffic gateway, this creates an “ARP storm” that can saturate your network link for several minutes. Use this only as a last resort.

2. The Permanent Sysctl Tuning

The better way is to tell the kernel that a crowded neighborhood is normal. For most modern data centers, I recommend quadrupling the defaults. Even with 8,000 entries, the memory footprint is negligible. Each ARP entry takes roughly 256 bytes; 8,192 entries use only about 2MB of RAM. That is a tiny price for network stability.

Production-Ready Configuration

Don’t modify /etc/sysctl.conf directly. Instead, create a dedicated file in /etc/sysctl.d/ for better organization.

Apply these settings to handle high-density environments. I have used this specific configuration across clusters of 5,000+ nodes with zero issues.

# Save to /etc/sysctl.d/20-neighbor-limits.conf

# IPv4 Neighbor Table Tuning
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh3 = 8192

# IPv6 Neighbor Table Tuning
net.ipv6.neigh.default.gc_thresh1 = 1024
net.ipv6.neigh.default.gc_thresh2 = 4096
net.ipv6.neigh.default.gc_thresh3 = 8192

# Reduce ARP traffic by keeping entries in the cache for 1 hour
net.ipv4.neigh.default.gc_stale_time = 3600
net.ipv6.neigh.default.gc_stale_time = 3600

Load the new settings immediately without rebooting:

sudo sysctl -p /etc/sysctl.d/20-neighbor-limits.conf

Why these values?

Setting gc_thresh1 to 1,024 stops the kernel from wasting CPU cycles on garbage collection for small tables. The 8,192 limit on gc_thresh3 provides a massive buffer for sudden bursts of new containers, VMs, or network scans without triggering a panic.

Verifying the Results

Check your kernel logs (dmesg) after applying the change. The “overflow” messages should vanish instantly. To see the table grow naturally, use a watch command:

watch -n 1 "ip neigh show | wc -l"

The count should now climb comfortably past the old 1,024 limit and level off at your actual device count. If it continues to rise indefinitely without stopping, you may be dealing with a network loop or an ARP sweep from a malicious actor.

Wrapping Up

Networking at scale requires moving past “out-of-the-box” settings. The ‘Neighbor Table Overflow’ is a prime example of how a small default value can cause a massive headache. If you manage servers in a modern data center or large VLAN, don’t wait for a 2 AM page. Check your neighbor counts today and tune your limits before the table hits its ceiling.