2:14 AM: The Call You Never Want
My phone vibrated violently on the nightstand. 2:14 AM. The monitoring dashboard for our edge gateway was bleeding red. Someone had punctured a legacy web service in one of our production clusters. By the time I had my terminal open, the attacker was already moving laterally, scanning our internal 10.0.0.0/8 network for higher-value targets.
The logs told a familiar story. The entry point was a known vulnerability in a PHP-FPM 7.4 setup. The attacker landed a shell as www-data. Usually, that’s a restricted, low-privileged account. However, even the lowliest user can talk to the Linux kernel via hundreds of system calls (syscalls). Using unshare and ptrace, the attacker probed for namespace weaknesses and eventually found a path to escape the container and compromise the host.
This wasn’t just a failure of the PHP code. It was a failure of our OS-level boundaries. We were allowing an application to talk to parts of the kernel it had no reason to touch.
The Root Problem: The All-or-Nothing Trap
Linux security used to be binary. You were either root (UID 0) and could do everything, or you were a standard user and could do almost nothing. To let a web server bind to port 80, we often ran the process as root. This is a massive risk. If that process is compromised, the attacker inherits the full power of the administrator.
Most privilege escalation incidents happen because we give processes more power than they need. A web server needs to read static files and listen on a network socket. It does not need to load kernel modules, change the system clock, or reboot the machine. Yet, by default, many systems allow any process to access all syscalls, many of which can be leveraged to probe for kernel bugs.
While hardening the secondary authentication nodes after this incident, I generated my server secrets using the tool at toolcraft.app/en/tools/security/password-generator. It runs entirely in the browser, meaning zero data ever leaves the local machine. I wanted that same level of strict, localized isolation for my application’s relationship with the kernel.
Strategy 1: Granular Power via Linux Capabilities
Linux Capabilities (introduced in kernel 2.2) break the absolute power of root into small, distinct privileges. There are currently about 40 different capabilities. For example:
CAP_NET_BIND_SERVICE: Lets a process bind to ports below 1024.CAP_CHOWN: Lets a process change file ownership.CAP_SYS_TIME: Lets a process set the system clock.
By using these, we can give a process the exact power it requires. If I have a diagnostic tool that needs to capture network packets, I don’t give it root. I give it CAP_NET_RAW. This limits the blast radius if the tool is ever subverted.
Practical Implementation
Use getcap to check current file capabilities and setcap to apply them. Here is how I stripped a binary down to the bare essentials during our post-incident cleanup:
# Strip all root power and only allow binding to low ports
sudo setcap 'cap_net_bind_service=+ep' /usr/bin/my-web-app
# Verify the permissions
getcap /usr/bin/my-web-app
For containerized environments, start from a position of zero trust. Drop every privilege and add back only what is necessary:
# The most secure way to launch a container
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-secure-app
Strategy 2: Filtering the Interface with Seccomp
Capabilities control privileges, but Seccomp (Secure Computing Mode) controls the interface. It acts as a firewall for syscalls. Even a non-root user with zero capabilities can still call execve, fork, or open. A modern Linux kernel (v6.x) has over 450 syscalls. A typical application uses maybe 40 to 60. The remaining 400 represent unnecessary attack surface.
Seccomp lets us define a JSON profile that tells the kernel: “If this process attempts a syscall not on this whitelist, kill it immediately.”
Building a Seccomp Profile
Here is a snippet of a hardened profile I developed for our Nginx fleet. It uses a “default deny” strategy. This is the only way to ensure safety.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [ "SCMP_ARCH_X86_64" ],
"syscalls": [
{
"names": [ "accept4", "epoll_wait", "pwrite64", "read", "write", "close" ],
"action": "SCMP_ACT_ALLOW"
}
]
}
When an application runs under this profile, the kernel returns an EPERM error if it tries anything unlisted. This effectively blocks reverse shells that rely on socket calls or attackers attempting to use mount to view sensitive host filesystems.
Capabilities vs. Seccomp: Choosing Your Tools
They aren’t competitors; they protect different layers. Here is how I explain the distinction to my engineering team:
| Feature | Linux Capabilities | Seccomp |
|---|---|---|
| Primary Focus | Permissions (What can I do?) | Syscalls (How do I talk to the kernel?) |
| Granularity | Coarse (~40 categories) | Fine-grained (450+ individual calls) |
| Ease of Use | Straightforward mapping | Complex; requires profiling the app |
| Best For | Replacing SUID/Root binaries | Mitigating zero-day kernel exploits |
Defense in Depth: The Hardened Checklist
The 2 AM incident proved that relying on a single security layer is a gamble. To build resilient systems, you need to combine these techniques. This is my standard checklist for every new service:
1. Abandon Root
Never run a process as UID 0. Always include a USER directive in your Dockerfile. It is your most basic defense.
2. Strip All Capabilities
Start at zero. Use --cap-drop=ALL in Docker or CapabilityBoundingSet= in Systemd units. Only add back specific privileges like CAP_NET_BIND_SERVICE if there is no other way.
3. Profile and Apply Seccomp
Use strace to observe exactly which syscalls your app uses during a normal boot and load cycle. Then, create a profile that allows only those specific calls.
# Track an app to identify required syscalls
strace -c -f ./my-app
4. Leverage Systemd Sandboxing
For non-containerized services, modern Systemd offers excellent wrappers for these technologies. Add these lines to your .service file:
[Service]
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
NoNewPrivileges=yes
PrivateDevices=yes
ProtectSystem=strict
Enabling NoNewPrivileges=yes is vital. It ensures that the process—and any children it forks—can never gain new privileges through execve. This effectively neutralizes SUID binaries as an escalation vector.
That 2 AM incident ended with us restoring from backups and spending 48 hours rewriting our deployment manifests. It was a painful lesson. However, we now have a system that doesn’t just hope for bug-free code. We assume our code is vulnerable and build a sandbox so tight that even a successful exploit leads to a dead end.

