Deploying Knative on Kubernetes: Build Serverless Infrastructure that Scales to Zero

Table of Contents

The Hidden Cost of Idle Clusters

A few years ago, I spent my Friday afternoons auditing our Kubernetes clusters, and I felt a pang of guilt every time I looked at the metrics. We were running 40+ microservices, yet most of them—like a Slack notification webhook or a weekly report generator—did absolutely nothing 99% of the time. We were essentially paying $600 a month to keep idle containers warm while they sat there consuming memory and CPU reservations 24/7.

This “idle pod syndrome” is a silent budget killer. In a standard Kubernetes setup, you typically keep at least one replica per service to ensure it can respond to requests immediately. If you have 50 small services, that’s 50 pods hogging space even at 3:00 AM when traffic is non-existent. DevOps teams eventually hit a wall where they are paying for capacity rather than performance.

Why Traditional Kubernetes Autoscaling Falls Short

You might consider using the standard Horizontal Pod Autoscaler (HPA), but it’s an uphill battle for serverless workflows. The HPA scales based on metrics like CPU or memory usage. Even if a pod is idle, it still requires a baseline memory reservation—often 64MB or 128MB—just to stay alive. Crucially, the vanilla HPA cannot scale down to zero replicas because it has no way to “wake up” a service when a new request arrives at an empty endpoint.

The limitation is baked into how Kubernetes handles networking. Standard Services are internal load balancers that point to existing pod IPs. If no pods exist, the connection simply fails with a 503 error. Kubernetes lacks a native “waiting room” to buffer requests while it spins up a container. This is the exact gap that Knative fills.

Comparing Your Serverless Options

When I first tackled this, I evaluated three main paths:

Cloud Functions (AWS Lambda, GCF): These scale to zero beautifully, but vendor lock-in is a massive risk. Moving 100 functions from AWS to Azure isn’t a weekend task; it’s a multi-month migration nightmare.
KEDA with HPA: You can set minReplicas: 0 using KEDA, but managing the custom metrics and event triggers adds significant configuration overhead. It often feels like you’re fighting the platform rather than using it.
Knative: This adds a request-driven abstraction layer on top of your existing cluster. It handles the “Scale-to-Zero” logic automatically by intercepting traffic, making it the most balanced choice for container-native teams.

The Knative Solution: Performance Without the Waste

I’ve found that Knative is the most reliable way to achieve a serverless experience without losing the flexibility of Docker. After rolling this out in our dev environments, we reduced our cluster overhead by nearly 40%. It allows you to package any app as a standard image while running it with the efficiency of a function. Let’s get it running.

You’ll need a functional Kubernetes cluster and kubectl access to follow along.

Step 1: Install Knative Serving CRDs

First, we apply the Custom Resource Definitions (CRDs) that allow Knative to manage its unique service types.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-crds.yaml

Step 2: Deploy Core Serving Components

Next, we install the controllers that handle the actual serverless logic and pod lifecycle management.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-core.yaml

Step 3: Simplify Networking with Kourier

Knative needs a gateway to route traffic. While Istio is the industry standard, it’s often too heavy for smaller projects. Kourier is a lightweight alternative that gets the job done without the extra sidecars.

kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.12.0/kourier.yaml

# Configure Kourier as the default ingress
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

Step 4: Configure Automatic DNS

For testing, we use sslip.io to handle DNS automatically. This saves us from editing /etc/hosts files every time we deploy a new service.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-default-domain.yaml

Deploying a Service That Scales to Zero

Now for the payoff. We’ll deploy a web application using a Knative Service object. This is not the same as a standard K8s Service; it combines deployment, routing, and scaling into one resource.

Create hello-knative.yaml:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-world
  namespace: default
spec:
  template:
    spec:
      containers:
        - image: gcr.io/knative-samples/helloworld-go
          env:
            - name: TARGET
              value: "Knative Expert"
          resources:
            limits:
              cpu: "200m"
              memory: "128Mi"

Deploy it with a single command:

kubectl apply -f hello-knative.yaml

Watch the Scale-to-Zero Magic

Once the deployment finishes, keep an eye on your cluster:

kubectl get pods -w

A pod will spin up immediately to handle the initial setup. Now, wait about 60 to 90 seconds. You’ll see the pod enter Terminating and eventually disappear. Your service is now scaled to zero, consuming exactly zero CPU and RAM on your worker nodes.

To wake it up, fetch your service URL:

kubectl get ksvc hello-world

Send a curl request to that URL. You’ll notice a brief delay—usually between 1.5 and 2.5 seconds. This is the “cold start.” During this window, Knative’s Activator buffers the incoming request and signals the Autoscaler to boot a new pod. As soon as the container is healthy, the request is released and served.

Hard-Won Production Tips

Running Knative in production taught me that scale-to-zero isn’t always the right answer for every workload. Here is how I tune it:

Request-Based Scaling: Instead of scaling on CPU, use concurrency limits. Adding autoscaling.knative.dev/target: "10" ensures Knative adds a pod the moment you hit 11 concurrent requests. This is far more responsive for web traffic.
Eliminating Cold Starts: For high-priority APIs where a 2-second delay is unacceptable, set autoscaling.knative.dev/min-scale: "1". You still get the simplified Knative deployment model, but the service stays warm.
Tight Resource Limits: Serverless doesn’t mean infinite resources. Always define your limits to prevent a single buggy function from eating your entire cluster’s budget.

The Bottom Line

Switching to Knative has been one of the most effective ways I’ve optimized our infrastructure. It removes the mental burden of managing HPA thresholds and allows developers to focus on shipping code. By implementing scale-to-zero, you transform your cluster into a dynamic engine that only works when there is real work to be done. It’s efficient, it’s portable, and it just works.