The 3:00 AM Namespace Nightmare
It happens in an instant. A tired engineer runs kubectl delete namespace prod when they meant dev. In five seconds, your production API, databases, and ConfigMaps vanish. While your CI/CD pipeline might redeploy the application code, it cannot resurrect the data. If you lose 500GB of customer uploads or the specific state of a legacy database, a YAML manifest won’t save you.
GitOps is powerful, but it is not a backup strategy for state. Many teams assume that because their infrastructure is defined in code, they are protected. They aren’t. Git defines how your cluster should look, but it doesn’t hold the actual bits inside your Persistent Volumes (PVs). When a cloud region goes dark or a volume becomes corrupted, you need a way to recover the data, not just the deployment.
Why etcd Backups Aren’t Enough
Kubernetes is ephemeral and distributed by design. Unlike a traditional VM where you snapshot a single disk, a K8s app is a fragmented collection of Services, Secrets, and PVCs spread across different nodes. Relying solely on etcd backups is risky and inefficient.
Restoring a 200MB etcd snapshot just to recover one specific namespace is overkill. It is like using a sledgehammer to crack a nut. More importantly, etcd only stores metadata. If the underlying EBS or Azure Disk is physically gone, etcd has no record of the actual data. This is where Velero fills the gap.
Velero: The Industry Standard for Cluster Recovery
Velero (formerly Heptio Ark) is the go-to tool for this exact problem. It communicates directly with the Kubernetes API to back up resources while simultaneously triggering snapshots on your storage provider. This creates a unified recovery point that bundles configuration and data together.
I have deployed Velero across clusters managing hundreds of nodes, and the stability is impressive. When you trigger a backup, Velero grabs the YAML definitions and tells your cloud provider—whether it’s AWS, GCP, or a local MinIO instance—to snapshot the volumes immediately.
Core Components
- Velero Client: The CLI tool used to manage backups.
- Velero Server: A set of pods that orchestrate the backup and restore processes.
- Object Storage: A bucket (like S3) that stores metadata and logs.
- Volume Snapshotter: A plugin that interfaces with your storage backend (EBS, Longhorn, etc.) to handle PV data.
Hands-on Practice: Setting Up Velero
For this setup, we will use an S3-compatible backend. Ensure you have kubectl access to a running cluster before starting.
1. Install the Velero CLI
Download the latest release to your local machine. Keeping your CLI version aligned with the server is critical for compatibility.
# Example for Linux v1.15.0
wget https://github.com/vmware-tanzu/velero/releases/download/v1.15.0/velero-v1.15.0-linux-amd64.tar.gz
tar -xvf velero-v1.15.0-linux-amd64.tar.gz
sudo mv velero-v1.15.0-linux-amd64/velero /usr/local/bin/
2. Configure Storage Credentials
Velero needs permission to write to your object storage. Create a credentials-velero file with your keys:
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
3. Deploy the Velero Server
This command installs the server components into your cluster. We will use MinIO for this example, but the AWS S3 process is nearly identical.
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket k8s-backups \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.storage.svc.cluster.local:9000 \
--snapshot-location-config region=default
Check the health of your installation by monitoring the pods in the velero namespace:
kubectl get pods -n velero
Real-World Backup and Restore
Let’s protect a production namespace called app-production containing a stateful database.
Triggering a Manual Backup
Run this command to create a point-in-time snapshot of your entire namespace and its volumes:
velero backup create prod-snapshot-01 --include-namespaces app-production
Verify the status and ensure the volume snapshots are ‘Completed’:
velero backup describe prod-snapshot-01
velero backup get
Simulating Disaster
Delete the entire namespace. If you use a standard CSI driver, your PVs will likely vanish along with the pods.
kubectl delete namespace app-production
The Restore Workflow
Velero will now recreate the entire environment from your S3 snapshot. It restores the namespace first, then calls the cloud provider to recreate the PVs, and finally deploys the resources in the correct dependency order.
velero restore create --from-backup prod-snapshot-01
Within 2 to 5 minutes—depending on volume size—your application will be back online with zero data loss.
Advanced Strategy: RPO and File-Level Backups
Manual backups are unreliable. In production, I always configure hourly schedules to achieve a 60-minute Recovery Point Objective (RPO).
velero schedule create daily-prod --schedule="0 * * * *" --include-namespaces app-production --ttl 720h0m0s
The --ttl (Time To Live) flag is essential. It automatically purges backups older than 30 days to keep your cloud storage costs under control.
Using the Node Agent (Restic/Kopia)
What if your storage provider doesn’t support snapshots? You can use the Velero Node Agent. This performs file-system level backups. It is slower than hardware snapshots but works on any infrastructure, including on-premise bare metal.
Enable it during installation with --use-node-agent, then annotate your pods to include their volumes:
kubectl annotate pod <pod-name> backup.velero.io/backup-volumes=data-storage
Final Verdict: Reliability is a Process
Kubernetes scales your apps, but it doesn’t automatically protect your business data. Moving from a ‘hope-based’ strategy to a verified recovery plan is a major milestone for any DevOps team. By implementing Velero, you build a safety net that allows your engineers to move faster without the constant fear of data loss.
Start small. Pick one critical namespace, set up a local MinIO backend, and practice a full restore. Once you see your data reappear after a kubectl delete, you will sleep much better during your on-call shifts. Reliability isn’t just about the software you use; it’s about the recovery processes you actually test.

