The 3 AM Certificate Expiry Incident
After my server got hit by SSH brute-force attacks at midnight, security became my first concern on every new project. So when I joined a team running Kubernetes in production, my opening question was: “Who’s managing your SSL certificates?” The answer was a shared Google Sheet with expiry dates and a Slack reminder bot. Three weeks later, a certificate expired over a weekend and took down the payment gateway for six hours.
Six hours of downtime. From a cert that everyone knew would expire. That’s when it clicked: manual certificate management at scale isn’t a process — it’s a countdown timer you forgot to check.
Why Manual Certificate Management Breaks Down
The problem isn’t laziness. It’s that complexity grows faster than any spreadsheet can track.
A small cluster might have 10 services with TLS. A medium production environment can easily have 50–100 certificates across multiple namespaces, multiple domains, internal services, and wildcard certs. Each one has its own expiry date, renewal window, CA issuer, and secret location. No human tracks that reliably — not without mistakes, and definitely not over a long weekend.
Three failure modes show up again and again:
- Missed renewals — The reminder fires, gets buried in Slack, and nobody acts on it before Friday evening.
- Secret name mismatch — Someone manually updates the cert but the Kubernetes Secret name doesn’t match what the Ingress controller expects. Traffic breaks silently.
- Private CA drift — Internal services use a self-signed CA, but the CA cert itself expires after 2 years. Nobody tracked it.
Every one of these is preventable with automation.
What Are Your Options?
cert-manager isn’t the only tool here. Understanding the alternatives makes it easier to justify the choice.
Option 1: Certbot on each node
The classic approach. Run certbot renew as a cron job on every VM. Works fine for static servers, but it’s a poor fit for Kubernetes. Certificates live outside the cluster and need to be manually synced into Secrets. If a pod restarts and mounts a stale Secret, you’re debugging TLS errors at the worst possible time.
Option 2: External secrets + Vault
HashiCorp Vault can issue and renew certificates through its PKI secrets engine. It’s excellent for large enterprises with complex CA hierarchies. For smaller teams, though, Vault itself needs to be deployed, hardened, unsealed, and maintained — significant overhead before you get any benefit from it.
Option 3: cert-manager
cert-manager runs inside your cluster as a Kubernetes-native controller. It watches custom resources (Certificate, Issuer, ClusterIssuer) and handles the entire lifecycle: request, issuance, storage as a Kubernetes Secret, and automatic renewal. No manual steps. It supports Let’s Encrypt (ACME), private CAs, Vault, and Venafi out of the box.
For most teams on Kubernetes, cert-manager is the obvious fit. It integrates naturally with Ingress annotations and requires almost no ongoing maintenance once it’s running.
Setting Up cert-manager: Step by Step
Step 1: Install cert-manager
Use the official Helm chart. You’ll need Kubernetes 1.22+ and Helm 3.
# Add the Jetstack Helm repo
helm repo add jetstack https://charts.jetstack.io
helm repo update
# Install cert-manager with CRDs
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true
Before moving on, confirm the pods are healthy:
kubectl get pods -n cert-manager
You should see three pods: cert-manager, cert-manager-cainjector, and cert-manager-webhook, all in Running state. If the webhook pod is stuck, wait 60 seconds — it initializes last.
Step 2: Configure a Let’s Encrypt ClusterIssuer
A ClusterIssuer is cluster-wide. A plain Issuer is namespace-scoped. For public-facing services, Let’s Encrypt with HTTP-01 challenge is the simplest starting point.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: [email protected]
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- http01:
ingress:
ingressClassName: nginx
Apply it, then check the status:
kubectl apply -f clusterissuer-letsencrypt.yaml
kubectl get clusterissuer letsencrypt-prod -o yaml
Look for Ready: True in the status conditions. If it’s stuck, check the controller logs: kubectl logs -n cert-manager deploy/cert-manager. Nine times out of ten it’s a typo in the email or server URL.
Step 3: Issue Your First Certificate via Ingress Annotation
Add a single annotation to your Ingress. cert-manager detects it and automatically creates a Certificate resource — no extra manifests required.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app-ingress
namespace: production
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- app.example.com
secretName: my-app-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 80
Within 60–90 seconds, cert-manager creates the Secret my-app-tls in the production namespace. Watch it happen in real time:
kubectl get certificate -n production -w
kubectl describe certificate my-app-tls -n production
Renewal kicks in automatically when the certificate is 30 days from expiry (for 90-day Let’s Encrypt certs). You don’t touch it again.
Step 4: Private CA for Internal Services
Internal microservices on mTLS need certificates too. They shouldn’t use Let’s Encrypt — that’s for public domains. Use a private CA issuer instead.
Generate a root CA first:
# Generate CA private key
openssl genrsa -out ca.key 4096
# Generate CA certificate (valid 10 years)
openssl req -new -x509 -days 3650 -key ca.key -out ca.crt \
-subj "/CN=Internal Cluster CA/O=MyOrg"
Store it as a Kubernetes Secret:
kubectl create secret tls internal-ca-secret \
--cert=ca.crt \
--key=ca.key \
-n cert-manager
Create a ClusterIssuer backed by that CA:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: internal-ca
spec:
ca:
secretName: internal-ca-secret
Internal services now request certificates from your private CA using the same Certificate resource pattern — just pointing to internal-ca as the issuer.
Step 5: Explicit Certificate Resources for Non-Ingress Workloads
Some services never touch an Ingress — gRPC backends, PostgreSQL with TLS, internal APIs. For those, create a Certificate resource directly:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: postgres-tls
namespace: database
spec:
secretName: postgres-tls-secret
issuerRef:
name: internal-ca
kind: ClusterIssuer
dnsNames:
- postgres.database.svc.cluster.local
duration: 720h # 30 days
renewBefore: 168h # renew 7 days before expiry
Monitoring Certificate Health
Automation handles renewals, but you still want visibility. cert-manager exposes Prometheus metrics by default. Start with a quick audit across all namespaces:
# List all certificates
kubectl get certificates --all-namespaces
# Flag anything that isn't Ready
kubectl get certificates --all-namespaces | grep -v True
For alerting, add a Prometheus rule that fires when a certificate is under 14 days from expiry and hasn’t renewed yet:
- alert: CertificateExpirationWarning
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 1209600
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expiring in less than 14 days"
14 days gives you two full business weeks to investigate without a fire drill.
Common Mistakes to Avoid
- Using staging Let’s Encrypt in production — The staging ACME endpoint (
https://acme-staging-v02.api.letsencrypt.org/directory) issues certificates that browsers don’t trust. Use it for testing, then swap the server URL before going live. - HTTP-01 behind a firewall — HTTP-01 requires your domain to be publicly reachable on port 80. Private clusters or air-gapped environments need DNS-01 challenge instead — cert-manager supports Route53, Cloudflare, and others.
- Mixing Issuer and ClusterIssuer scope — A plain
Issueronly works within its own namespace. Reference it from a different namespace and you’ll get a cryptic “issuer not found” error. When in doubt, useClusterIssuer. - Forgetting your private CA expiry — cert-manager manages the leaf certificates it issues. It won’t warn you when the CA cert itself expires. Set a calendar reminder for your CA expiry, or wire up the Prometheus alert above to catch it early.
The Result: Zero-Touch Certificate Management
After rolling this out across our cluster, certificate management disappeared as a recurring task. No spreadsheet. No Slack reminders. No weekend pages.
The Prometheus alert fired twice in eight months. Both times it was for services we’d temporarily moved outside cert-manager’s control. Both times we caught it with more than 10 days to spare — enough to fix it calmly during business hours.
The setup is minimal: install cert-manager, create one ClusterIssuer for public domains and one for internal services, add a single annotation to your Ingress resources. Renewals, Secret updates, and Ingress reloads happen without you. Your job becomes reviewing the occasional alert rather than hunting down expiry dates.
That’s a much better place to be at 3 AM.

