When Backups Fail at the Worst Time
A backup job running without errors every night feels like protection. Until the day you actually need it. The dump is corrupted. Or the restore takes six hours on a database you expected back in thirty minutes. Or the most recent clean backup is four days old, because the job had been silently failing and nobody caught it.
None of this is bad luck. It’s the predictable result of treating backup as the finish line instead of recovery. Most teams design their schedule carefully, set it, and forget it. Nobody runs a test restore. The root cause is simple: teams own the backup. Nobody owns the recovery.
Comparing Backup Approaches
No single strategy fits every setup. Each one trades off speed, recovery granularity, storage cost, and operational complexity differently. Getting the mix right starts with knowing what you’re actually choosing between.
Logical Dumps (pg_dump, mysqldump)
The most common starting point — and for good reason. Logical dumps export your data as SQL statements or a portable binary format. Human-readable, portable across major versions, easy to automate with a single command.
# Create a compressed PostgreSQL dump
pg_dump -U postgres -d mydb -Fc -f mydb_backup.dump
# Restore it to a new database
createdb mydb_restored
pg_restore -U postgres -d mydb_restored mydb_backup.dump
The catch: dumps are slow at scale. A 300GB PostgreSQL database takes 3–4 hours to dump and often longer to restore. They also don’t support point-in-time recovery — you can only go back to the exact moment the dump ran. Fine for smaller setups, but a real gap for high-traffic production systems.
Physical Backups (pg_basebackup, Percona XtraBackup)
Instead of exporting SQL, physical backups copy the raw data directory directly. That same 300GB PostgreSQL database takes 20–30 minutes with pg_basebackup — versus hours with pg_dump. For production systems, that gap is hard to ignore.
# PostgreSQL physical backup
pg_basebackup -U postgres -D /backups/base -Fp -Xs -P
# Percona XtraBackup for MySQL
xtrabackup --backup --target-dir=/backups/mysql_base
xtrabackup --prepare --target-dir=/backups/mysql_base
One constraint: physical backups require matching the major database version on restore, and they’re not human-readable. But when downtime directly costs money, the speed difference is worth the added complexity every time.
WAL / Binary Log Streaming (Point-in-Time Recovery)
WAL streaming is where backup gets serious for production. Take a physical base backup, then continuously ship Write-Ahead Log files (PostgreSQL) or binary logs (MySQL) to a separate location. The result: recovery to any specific second in time — not just the last scheduled backup window.
# postgresql.conf: enable WAL archiving
wal_level = replica
archive_mode = on
archive_command = 'cp %p /wal_archive/%f'
# recovery.conf: restore to a specific point in time
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '2026-03-14 14:30:00'
recovery_target_action = 'promote'
Tools like pgBackRest and Barman handle the entire WAL lifecycle — compression, verification, retention — without manual scripting.
Snapshot-Based Backups
Cloud volume snapshots — AWS EBS, GCP Persistent Disk — and filesystem-level snapshots like ZFS or LVM complete in seconds regardless of database size. A 10TB volume snapshots as fast as a 10GB one. That speed makes them a solid infrastructure-level safety net. The tradeoff: you restore an entire volume, not a specific table or transaction. Useful as one layer in a broader strategy, but not as your primary recovery path.
Pros and Cons of Each Approach
- Logical dumps: Simple to set up, portable across versions, works well for databases under 50GB or cross-version migrations. Recovery is slow and PITR isn’t supported. Don’t rely on this alone for large production databases.
- Physical backups: Fast, consistent, well-suited for large databases. Requires matching the major version on restore and doesn’t port across engines. The right primary backup for production PostgreSQL or MySQL.
- WAL/binlog streaming: Near-zero RPO, flexible recovery targets. Higher storage and operational overhead. Essential when losing more than a few minutes of data isn’t acceptable.
- Snapshots: Near-instant at any database size, solid infrastructure protection. Not granular, tied to a specific cloud or filesystem technology. Best as a complementary layer alongside database-native strategies.
Recommended Setup: 3-2-1 Plus Recovery Testing
The 3-2-1 rule — three copies of data, two different media types, one copy offsite — translates to this for most production databases:
- Daily physical backup stored locally for fast recovery from common failures
- Continuous WAL or binary log streaming to object storage (S3, GCS) for point-in-time recovery
- Weekly full backup replicated to a different cloud region or provider for regional disasters
What most teams skip: actually testing the restore. Run a monthly recovery drill to a separate environment and verify the data. For critical systems, automate that weekly. A backup nobody has ever restored is a theory. Run it once and you’ll know exactly where the gaps are.
Implementation Guide
Setting Up pgBackRest for PostgreSQL
pgBackRest handles physical backups, WAL archiving, encryption, and parallel processing in one tool. After installing the package, configure it like this:
# /etc/pgbackrest.conf
[global]
repo1-path=/backups/pgbackrest
repo1-retention-full=2
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-strong-encryption-key
[mydb]
pg1-path=/var/lib/postgresql/14/main
# Initialize and run the first full backup
pgbackrest --stanza=mydb stanza-create
pgbackrest --stanza=mydb --log-level-console=info backup --type=full
# Daily incremental backup
pgbackrest --stanza=mydb backup --type=incr
# Restore to a specific point in time
pgbackrest --stanza=mydb --delta restore \
--target="2026-03-14 14:30:00" \
--target-action=promote
Automating MySQL Backups with XtraBackup
#!/bin/bash
# /usr/local/bin/mysql_backup.sh
BACKUP_DIR="/backups/mysql/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
xtrabackup --backup --target-dir="$BACKUP_DIR" \
--user=backup_user --password="$MYSQL_BACKUP_PASSWORD"
xtrabackup --prepare --target-dir="$BACKUP_DIR"
# Upload to S3 and remove local copy after successful sync
aws s3 sync "$BACKUP_DIR" "s3://my-bucket/mysql-backups/$(date +%Y%m%d)/" \
--delete && rm -rf "$BACKUP_DIR"
echo "$(date): Backup completed" >> /var/log/mysql_backup.log
# crontab -e
# Full backup at 2am daily
0 2 * * * /usr/local/bin/mysql_backup.sh >> /var/log/mysql_backup.log 2>&1
Verifying Backup Integrity
A backup file existing is not the same as it being restorable. For logical dumps, run a listing check and a test restore to a scratch database at minimum:
# Quick integrity check on a pg_dump file
pg_restore --list mydb_backup.dump | head -30
# Full verification: restore to an isolated test database
createdb mydb_verify
pg_restore -U postgres -d mydb_verify mydb_backup.dump
# Spot-check row counts match production
psql -U postgres -d mydb_verify -c "SELECT COUNT(*) FROM users;"
psql -U postgres -d mydb -c "SELECT COUNT(*) FROM users;"
# Clean up after verification
dropdb mydb_verify
After restoring, I sometimes export a sample of records to CSV to run a quick sanity check script against expected values. When I need to feed that data into a validation pipeline expecting JSON, I use toolcraft.app/en/tools/data/csv-to-json — it converts directly in the browser, so no sensitive production data ever leaves my machine. That matters when you’re working with real data during a recovery drill.
Monitoring for Silent Failures
Silent failures are the biggest risk most teams aren’t actively watching for. Three signals worth setting up alerts on:
- Backup job exit code — any non-zero exit should trigger an alert immediately
- Backup file age — if the latest backup is older than 25 hours, something broke
- Backup size anomaly — a sudden drop in size often means empty tables or a partial run
#!/bin/bash
# /usr/local/bin/check_backup_age.sh
BACKUP_FILE="/backups/mydb_latest.dump"
MAX_AGE=90000 # 25 hours in seconds
if [ ! -f "$BACKUP_FILE" ]; then
echo "CRITICAL: Backup file not found" | mail -s "DB Backup Alert" [email protected]
exit 1
fi
FILE_AGE=$(( $(date +%s) - $(stat -c %Y "$BACKUP_FILE") ))
if [ "$FILE_AGE" -gt "$MAX_AGE" ]; then
echo "WARNING: Backup is stale (${FILE_AGE}s old, max ${MAX_AGE}s)" | \
mail -s "DB Backup Alert" [email protected]
fi
# Run check every hour via cron
0 * * * * /usr/local/bin/check_backup_age.sh
Recovery Is the Real Measure
Start with logical dumps if you’re setting this up for the first time. They’re forgiving, easy to reason about, and you can have something working in under an hour. Once your database crosses 50GB, move to physical backups with WAL or binary log streaming. Add cloud snapshots for infrastructure-level protection at any size.
The teams that handle incidents well share one thing: they’ve practiced. They know their actual recovery time — not the estimate in a runbook, but the real number from a drill. They trust their backups because they’ve watched them restore. Build that confidence before you’re doing it at 2am under pressure.

