Beyond iptables: How We Scaled Kubernetes Networking with Cilium and eBPF

Table of Contents

Moving Beyond iptables: Why We Switched to Cilium

Most Kubernetes clusters start with iptables or IPVS to route traffic. These tools work fine at small scales, but they weren’t built for the churn of 10,000+ service endpoints. When our production cluster hit 200 nodes, we noticed a significant CPU overhead. Every time a new pod scaled, the kernel had to perform a linear search through thousands of sequential rules. It was inefficient and made troubleshooting a chore.

Six months ago, our network latency began creeping up by 15-20ms during peak loads. At the same time, our security team demanded granular visibility into pod-to-pod communication that standard logs couldn’t provide. We chose to migrate to Cilium. Unlike legacy CNIs, Cilium leverages eBPF (Extended Berkeley Packet Filter). This allows us to run sandboxed programs directly in the Linux kernel, bypassing the slow iptables stack entirely.

The results in our production environment have been rock solid. Shifting to an eBPF-based CNI fundamentally changed how we handle security and monitoring. This guide outlines the practical steps for deploying Cilium and explains why it is a strategic choice for modern infrastructure.

Step 1: Preparing the Environment and Installation

To get the best performance, you should deploy Cilium on a cluster without an existing CNI. If you are using a managed service like EKS or GKE, you can often disable the default provider during creation. I recommend using the Cilium CLI for initial validation.

First, fetch and install the binary. This tool performs pre-flight checks to ensure your nodes are compatible with eBPF.

CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all "https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}"
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}

I prefer using Helm for the actual deployment. It integrates better with GitOps pipelines and makes version pinning easier. We are using version 1.16.0 here to take advantage of the latest performance patches.

helm repo add cilium https://helm.cilium.io/

helm install cilium cilium/cilium --version 1.16.0 \
  --namespace kube-system \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set prometheus.enabled=true \
  --set operator.prometheus.enabled=true \
  --set kubeProxyReplacement=true

Setting kubeProxyReplacement=true is the critical step. It tells Cilium to handle LoadBalancer and NodePort services directly, removing the need for kube-proxy. Once the pods are running, verify the health of your mesh:

cilium status --wait

Step 2: Implementing L7 Policies for Granular Security

Standard Kubernetes NetworkPolicies are blunt instruments. They operate at Layer 3 and 4, meaning you can only block IPs or Ports. If you want to allow a pod to access api.github.com while blocking other external traffic, standard policies fail because GitHub’s IP addresses change frequently.

Cilium provides FQDN-aware policies and Layer 7 (HTTP) filtering. In our setup, we restricted the frontend so it could only execute GET requests on a specific API path. This prevents lateral movement if a pod is compromised.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "restrict-api-access"
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend-service
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: frontend-web
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/public/.*"

This level of control is vital. If an attacker tries to POST malicious data or access an /admin endpoint, the kernel drops the packet instantly. Because eBPF handles this filtering in the data plane, we saw no measurable increase in request latency.

Step 3: Real-time Observability with Hubble

The most tangible benefit of Cilium is Hubble. Debugging “Connection Refused” errors in a distributed system usually requires tcpdump and a lot of patience. Hubble changes that by providing a high-level stream of every flow in the cluster.

You can watch traffic live from your terminal to see exactly which policy is dropping a packet:

cilium hubble port-forward &
hubble observe --namespace production --follow --outcome dropped

If you prefer a GUI, the Hubble UI generates a dynamic service map. It automatically draws the dependencies between your microservices based on actual traffic. This was a massive win for our team. We used it to identify several legacy services that were still making unauthorized calls to our database. To launch the dashboard, simply run:

cilium hubble ui

Navigating to localhost:12000 gives you a visual representation of your cluster’s health. Seeing the “red lines” representing blocked traffic provides immediate confirmation that your security policies are working as intended.

Lessons from 6 Months in Production

Maintaining Cilium is straightforward, but it does require a shift in mindset regarding node resources. Here are three key takeaways from our journey:

Upgrade Your Kernel: While Cilium supports older versions, you really need Linux 5.10 or higher. We moved to 5.15 to use the most efficient eBPF helpers, which reduced our node CPU usage by an additional 5%.
Watch Your Memory: The Cilium agent is more resource-heavy than Flannel. Expect to allocate roughly 200MB to 500MB of RAM per node, especially if you have high flow concurrency and Hubble enabled.
Ditch kube-proxy: Don’t run both. Using Cilium in full replacement mode simplifies your networking stack and removes the complexity of managing large iptables rule sets.

Cilium is more than just a CNI; it is a security and observability platform built into the kernel. If you are scaling your Kubernetes environment, eBPF is the most performant path forward.