Linux TCP Stack Optimization: Tuning Buffers and Congestion Control for Production

Table of Contents

The 10Gbps Link That Only Hits 2Gbps

Linux defaults aim for compatibility, not raw performance. These settings work fine for a personal laptop or a basic web server.

However, they quickly become a straitjacket when your application scales to handle 10,000+ concurrent connections. I’ve seen top-tier hardware plateau far below its physical limits simply because the kernel was playing it too safe. After applying these tweaks in a high-traffic production environment, we saw a 40% reduction in p99 latency and finally saturated our 10Gbps NICs.

TCP handles the heavy lifting of data reliability. It manages flow control and congestion based on kernel-defined parameters. If these limits are too conservative, your server wastes cycles waiting for acknowledgments. It chokes on tiny buffers instead of moving data. The goal is to open the pipe without exhausting your system memory.

Do the Math: Bandwidth Delay Product (BDP)

Don’t change numbers blindly. You need to understand the Bandwidth Delay Product (BDP). This represents the total data “in flight”—packets sent but not yet acknowledged.

BDP = Bandwidth (bits/sec) * Round Trip Time (seconds)

Here is a real-world example. If you have a 10Gbps link and a 50ms round-trip latency to your database, your BDP is roughly 62.5 megabytes. If your TCP buffer is capped at the default 4MB, your sender will stop and wait for an ACK 93% of the time. You are effectively paying for a 10Gbps link but only getting 640Mbps. We must increase buffer sizes to keep that pipe full.

Installation and Tools

Tuning the stack is straightforward. You don’t need heavy third-party software. Everything happens through the sysctl interface and iproute2. Most modern distros like Ubuntu 22.04 or RHEL 9 include these by default.

Start by verifying your current tools. If you’re on a bare-bones image, grab them here:

# For Debian/Ubuntu
sudo apt update && sudo apt install iproute2 procps -y

# For RHEL/CentOS/Fedora
sudo dnf install iproute procps-ng -y

You will use sysctl to tweak the kernel in real-time. Use ss (socket statistics) to monitor the actual impact on your connections.

Step-by-Step Configuration

For testing, use sysctl -w to apply changes instantly. Once you verify the performance jump, we will make them permanent.

1. Aggressive TCP Buffer Sizes

Linux uses core buffers and TCP-specific buffers. We need to increase both. To support 1Gbps+ networks, the old 4MB defaults won’t cut it.

# Check current limits first
sysctl net.core.rmem_max net.ipv4.tcp_rmem

Apply these settings for high-bandwidth environments:

# Set max receive/send buffer size to 16MB
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216

# Set TCP autotuning: min (4KB), default (87KB), max (16MB)
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

The kernel uses these three values for tcp_rmem to scale dynamically. It starts small and expands to the maximum only when the connection demands it.

2. Switch to Google’s BBR

Most older kernels default to Cubic. It’s reliable but reacts poorly to packet loss on long-distance links. Google’s BBR (Bottleneck Bandwidth and RTT) is a game-changer. It ignores random packet loss and focuses on actual throughput.

BBR requires the Fair Queuing (fq) discipline to work correctly:

# Enable BBR
sudo sysctl -w net.core.default_qdisc=fq
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

In our cross-region database replications, BBR maintained 90% throughput even when the link experienced 1-2% packet loss. Cubic would have collapsed to 10%.

3. Harden Socket Options

High-traffic APIs often suffer from “ephemeral port exhaustion.” This happens when thousands of sockets sit in a TIME_WAIT state. Let’s fix that and open up the limits.

# Reuse sockets in TIME_WAIT for new connections
sudo sysctl -w net.ipv4.tcp_tw_reuse=1

# Support up to 2 million open file descriptors
sudo sysctl -w fs.file-max=2097152

# Expand the available port range
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

Also, enable TCP Fast Open. It allows data transfer to start during the initial handshake, saving one full round-trip for returning visitors:

sudo sysctl -w net.ipv4.tcp_fastopen=3

4. Save Your Work

Don’t lose your gains on reboot. Add these to /etc/sysctl.d/99-performance.conf:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fastopen = 3

Apply with sudo sysctl -p.

Verification

How do you know it’s working? Use ss -ti to inspect active connections.

# Check if BBR is active on a specific connection
ss -ti

Look for the bbr tag and the wscale value. If wscale is high, your larger buffers are being utilized. Monitor your retransmissions with netstat -s. If retransmissions drop while throughput climbs, you’ve won.

One warning: RAM isn’t free. If you have 100,000 connections each using a 16MB buffer, you’ll run out of memory. Ensure your server has enough headroom to handle your peak connection count multiplied by these new limits.

Final Thoughts

Stack tuning is an iterative process. Start with these 16MB limits and monitor your p99 latency. By ditching conservative defaults, you stop leaving hardware performance on the table. Your Linux server is finally ready for the pressure of production traffic.