I Built a Personal Wayback Machine: 6 Months with ArchiveBox on Docker

Table of Contents

The 5-Minute Quick Start

Spinning up ArchiveBox on Docker is one of those rare HomeLab wins that works almost immediately. Docker Compose is my go-to here; it makes the whole setup portable enough to move between servers in seconds. My HomeLab setup relies on a lean configuration to keep things manageable.

First, create a dedicated directory for your archive and a docker-compose.yml file:

mkdir ~/archivebox && cd ~/archivebox
touch docker-compose.yml

Next, paste this configuration into the file. Note that I’ve capped MEDIA_MAX_SIZE at 750MB to prevent a single YouTube video from eating my entire boot drive:

version: '3.9'

services:
  archivebox:
    image: archivebox/archivebox:latest
    command: server --bind 0.0.0.0:8000
    ports:
      - "8000:8000"
    volumes:
      - ./data:/data
    environment:
      - ALLOW_REGISTRATION=False
      - ADMIN_USERNAME=admin
      - ADMIN_PASSWORD=your_secure_password
      - MEDIA_MAX_SIZE=750m

Initialization is straightforward. Run these commands to set up the database and start the engine:

docker compose run archivebox init --setup
docker compose up -d

Your new archive is now live at http://localhost:8000. To test it out, try adding a URL directly from the CLI:

docker compose run archivebox add 'https://itfromzero.com'

The Reality of Digital Preservation

We’ve all felt the sting of “Link Rot.” You find the perfect fix for an obscure Linux kernel panic, bookmark it, and return six months later only to find a 404 error or a parked domain. ArchiveBox solves this by orchestrating a heavy-duty toolkit—Headless Chrome, Wget, Curl, and SingleFile—to grab a forensic snapshot of any page.

ArchiveBox doesn’t just save a link; it creates a multi-layered fallback system. Every time you add a URL, the system generates a PDF, a high-res PNG screenshot, a WARC file, and a localized HTML file. If one format fails to render in 2030, you have three others waiting in the wings. This is foundational for digital sovereignty.

Performance-wise, the default SQLite database is surprisingly resilient. Even after 2,140 archived items, my dashboard remains snappy. The system uses a flat-file structure in /data/archive. This means your snapshots are searchable using grep or find, even if the ArchiveBox container is completely stopped.

Advanced Usage: Putting Archiving on Autopilot

Manually adding URLs is for beginners. The real power of ArchiveBox comes from weaving it into your existing daily workflow through automation and deep integration.

1. Browser and Bookmark Sync

While browser extensions are handy, syncing an entire link library is better. If you use Pocket or Wallabag to save articles, you can bulk-import them in one go. I exported 500 links from my browser and imported them via this command:

docker compose run archivebox add --depth=0 < /data/bookmarks_export.html

2. Scheduled RSS Pulls

I track several high-signal technical blogs that update weekly. Instead of manual checks, I use the built-in scheduler to pull new content from their RSS feeds automatically. Setting a daily sync takes seconds:

docker compose run archivebox schedule --every=day 'https://example.com/rss.xml'

3. Authentication: The ‘Final Boss’

Paywalls and login screens are the biggest obstacles to a clean archive. To bypass these, ArchiveBox can use your browser cookies. I export my active session cookies to cookies.txt in the data volume. This allows the system to archive private documentation or paywalled articles from sites like Medium or O’Reilly that would otherwise be blocked.

Hard Lessons from 6 Months in Production

Running a personal internet archive is resource-intensive. Here are the specific bottlenecks I encountered while scaling my instance.

Storage Strategy

ArchiveBox eats storage for breakfast. A “light” blog post can balloon to 50MB once you capture the PDF, high-res PNGs, and the media extracts. My archive hit 124.5GB in just six months. I strongly recommend mounting your /data folder on a high-capacity HDD or a NAS share. If you are tight on space, edit archivebox.conf to disable high-bandwidth extracts like video or chrome-headless.

Security and Exposure

By default, this is a local-only tool. If you expose it to the internet to add links on the go, the built-in admin password isn’t enough protection. I shield my instance behind a Traefik reverse proxy with an added Authelia layer for 2FA. Also, remember that your archived data—including potentially sensitive pages—is stored unencrypted on the disk. Full-disk encryption on your host machine is a must.

Maintenance and Updates

The ArchiveBox team moves fast, and helper tools like yt-dlp need frequent updates as sites like YouTube change their structure. I pull new images monthly. My routine is simple:

docker compose pull
docker compose up -d
docker compose run archivebox init

Never skip that init command; it handles database migrations and ensures all helper binaries are correctly mapped within the new container version.

The Search Gap

The built-in search focuses on titles and URLs. Full-text indexing of 2,000+ PDFs and HTML files will send your CPU usage through the roof—I saw spikes to 80% on a 4-core VPS during indexing. For most HomeLabs, a solid tagging system (e.g., #linux, #networking) is more efficient than trying to index every word of every page.