Stop the Linux OOM Killer from Crashing Your Production Apps

Table of Contents

When Processes Vanish: The 2 AM PagerDuty Nightmare

Imagine you are monitoring a production environment. Everything looks stable until a mission-critical service—perhaps your primary PostgreSQL database or a high-traffic Nginx instance—simply disappears. You check the status, but the process is gone. There are no application crash logs and no warning signs in your Grafana dashboards. This is the tell-tale sign of the Linux Out of Memory (OOM) Killer.

Linux manages memory with a “use it or lose it” philosophy. When RAM is completely exhausted, the kernel must make a cold-blooded choice: kill a process to prevent the entire operating system from locking up. Unfortunately, the kernel often targets the heaviest process, which is usually the very application you need most. Mastering these memory dynamics is essential for anyone running high-availability systems.

I recently managed a fleet of Ubuntu 22.04 nodes with only 4GB of RAM. By teaching the kernel which processes were expendable, I reduced unexpected service interruptions by roughly 85% without upgrading the hardware.

The Mechanics of Overcommit

To fix the OOM Killer, we first need to understand why it triggers. By default, Linux uses a strategy called “Overcommit.” The kernel allows applications to request more memory than is physically available, betting that most processes won’t use their full allocation simultaneously. It’s like an airline overbooking a flight, assuming a few passengers won’t show up.

But when every process tries to claim its RAM at once, the system hits a wall. At this point, the oom_killer() function calculates an oom_score for every process. It prioritizes targets based on three main factors:

RAM Usage: High-memory consumers are the first targets.
Process Age: The kernel prefers to kill newer processes rather than long-running system daemons.
User ID: Processes owned by root get a slight protection bonus compared to user-level tasks.

Step 1: Hunting for the Evidence

Don’t start changing configs until you have proof. The kernel records every execution in the system logs. If your database crashed at 3:15 AM, that is where you should look first.

# Check the kernel buffer for recent OOM events
dmesg | grep -i oom

# Search historical logs on Debian or Ubuntu
grep -i 'killed process' /var/log/syslog

# Search logs on RHEL, CentOS, or AlmaLinux
grep -i 'killed process' /var/log/messages

Look for a line like Out of memory: Kill process 1234 (mysqld) score 500. This confirms the OOM Killer was the culprit. The log will also show a full memory dump, revealing exactly how much RAM and Swap were left when the axe fell.

Step 2: Shielding Critical Services

While you shouldn’t disable the OOM Killer entirely, you can influence its hit list. Every process has a score adjustment file located at /proc/[PID]/oom_score_adj. This value ranges from -1000 (never kill) to 1000 (kill this first).

Suppose you have a vital SSH daemon and a background log-processing script. You want to ensure SSH stays alive so you can actually log in to fix things. Here is how to protect a process manually:

# Get the PID of your service
pidof sshd

# Make it virtually unkillable
echo -1000 > /proc/$(pidof sshd)/oom_score_adj

Manual changes vanish after a reboot. For a permanent fix in systemd, add the adjustment directly to your service file. This is the cleanest way to manage priorities.

[Service]
# Edit via: systemctl edit mysqld
OOMScoreAdjust=-500

Step 3: Hardening Overcommit Policies

If your server frequently runs out of breath, consider changing how the kernel handles memory requests. You can tune this via the vm.overcommit_memory parameter. It supports three modes:

0 (Heuristic): The default mode where the kernel “guesses” if it has enough RAM.
1 (Always): The kernel always grants memory requests. This is useful for specialized scientific apps but dangerous for general servers.
2 (Strict): The kernel only grants memory up to a specific limit (Swap + a percentage of RAM).

I recommend Mode 2 for production databases where stability is the priority. It causes malloc calls to fail rather than letting the kernel kill processes later. Use these commands to apply it:

# Apply immediately
sudo sysctl -w vm.overcommit_memory=2

# Persist after reboot
echo "vm.overcommit_memory = 2" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Step 4: The Swap Safety Net

Sometimes the simplest fix is the most effective. If your 4GB RAM server is redlining, adding a 2GB swap file provides a vital buffer. Disk-based swap is significantly slower than RAM, but it acts as a “pressure valve” that prevents the OOM Killer from triggering during minor spikes.

# Create and activate a 2GB swap file
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Add to fstab for persistence
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Once swap is active, check your swappiness. This value (0-100) tells the kernel how eagerly to move data to disk. For most modern servers, a low value like 10 is perfect.

# Set swappiness to 10 for better performance
sudo sysctl -w vm.swappiness=10

Tuning the VFS Cache

The vm.vfs_cache_pressure setting is another hidden gem. It controls how aggressively the kernel reclaims memory used for caching directory and file metadata. The default is 100. If you lower this to 50, the kernel keeps that metadata in RAM longer. This reduces disk I/O at the cost of a little extra memory usage.

sudo sysctl -w vm.vfs_cache_pressure=50

Final Thoughts

Managing Linux memory is about setting boundaries. By identifying OOM events in the logs, shielding critical apps with oom_score_adj, and configuring strict overcommit policies, you build a system that handles stress gracefully.

Always keep an eye on your baseline usage with htop. If your server is constantly leaning on its swap file, no amount of kernel tuning can replace the need for more physical RAM. Use these tweaks to buy time and stability, but listen when your hardware tells you it is overwhelmed.