Stop Losing YouTube Videos: Self-Host Tube Archivist for Your Private Archive

HomeLab tutorial - IT technology blog
HomeLab tutorial - IT technology blog

The 2 AM Error: Why YouTube Isn’t a Permanent Library

It was 2:14 AM on a Tuesday. I was troubleshooting a critical 503 error on a production server, trying to recall a specific kernel tuning trick from a niche 2019 tutorial. I clicked my bookmark, but instead of the solution, I got a digital slap in the face: “This video is private.” Or even worse, “This account has been terminated.”

That moment changed how I view ‘cloud’ knowledge. Reliance on a third-party platform is a house of cards. Digital rot is inevitable. Creators delete channels, copyright strikes happen, and algorithms shift. If you don’t own the bits, you don’t own the knowledge. I realized I needed to pull my learning resources off the cloud and into my own server rack.

The Archive Strategy: yt-dlp vs. Tube Archivist

Choosing a method for archiving video usually boils down to two paths. I’ve tested both extensively, and the difference is massive once you hit triple-digit video counts.

The Manual Route: yt-dlp and Folders

Most engineers start by writing a simple cron job around yt-dlp. You dump files into nested folders and call it a day. This works for a handful of Linux tutorials. It fails at 500 videos. You can’t search through transcripts, you lose the link between a video and its specific channel, and tracking what you’ve already downloaded becomes a tedious manual chore.

The Systematic Route: Tube Archivist

Tube Archivist is a professional-grade indexing engine. It uses Elasticsearch to index every single word of metadata and subtitles. Redis manages the background worker queues, while a clean Python-based UI ties it together. It treats your YouTube content like a local library rather than just a pile of .mp4 files.

The Engine Under the Hood: Pros and Cons

Before committing your drive space, understand the trade-offs. This isn’t a lightweight container that sips resources.

  • Full-Text Search. This is the standout capability. I can search for a specific terminal command, and Tube Archivist finds the exact timestamp where the creator mentioned it in the subtitles.
  • Automatic Sync. Point the tool at a playlist or channel. It checks for new uploads every 12 hours and grabs them automatically while you sleep.
  • Metadata Preservation. It saves comments, descriptions, and view counts from the moment of download, preserving the context of the video.
  • Resource Heavy. Running Elasticsearch requires a dedicated allocation of at least 2GB of RAM just for the index.
  • Stack Complexity. This is a multi-container environment. If the Redis connection drops or the index gets corrupted, you’ll need to be comfortable reading Docker logs.

Hardware Requirements: What You Actually Need

Don’t try to host this on a cheap VPS or an old Raspberry Pi 3. To keep the UI responsive, use a machine with at least 8GB of total system RAM and a modern 4-core CPU. Speed matters for the database. Keep your metadata (Elasticsearch) on an SSD, but store the actual video files on cheaper, high-capacity HDD arrays.

The Setup: Implementation Guide

We’ll use Docker Compose. It is the most reliable way to manage the three moving parts (Core, Redis, and Elasticsearch) without a headache. I’ve simplified this to a production-ready config.

1. Create the Directory Structure

First, set up your storage paths. You need four main volumes to handle the application data, cache, and the database index.

mkdir -p tubearchivist/{cache,res,es,redis}
cd tubearchivist
touch docker-compose.yml

2. Docker Compose Configuration

Paste the following into your docker-compose.yml. Pay attention to the ES_JAVA_OPTS; this keeps Elasticsearch from devouring your entire server’s memory pool.

version: '3.3'

services:
  tubearchivist:
    container_name: tubearchivist
    restart: unless-stopped
    image: bbilly1/tubearchivist
    ports:
      - 8000:8000
    volumes:
      - /mnt/storage/youtube:/youtube
      - ./cache:/cache
    environment:
      - TA_HOST=192.168.1.50 # Change to your local IP
      - TA_USERNAME=admin
      - TA_PASSWORD=secure_pass_123
      - ELASTIC_PASSWORD=es_pass_456
      - HOST_UID=1000
      - HOST_GID=1000
    depends_on:
      - archivist-es
      - archivist-redis

  archivist-redis:
    container_name: archivist-redis
    restart: unless-stopped
    image: redis/redis-stack-server
    volumes:
      - ./redis:/data

  archivist-es:
    container_name: archivist-es
    restart: unless-stopped
    image: bbilly1/tubearchivist-es
    environment:
      - "ELASTIC_PASSWORD=es_pass_456"
      - "discovery.type=single-node"
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - "xpack.security.enabled=true"
    volumes:
      - ./es:/usr/share/elasticsearch/data

3. Booting the Stack

Ensure your vm.max_map_count is high enough before starting, or Elasticsearch will crash on boot. This is a common pitfall that often confuses first-time users.

# Apply the setting immediately
sudo sysctl -w vm.max_map_count=262144

# Make the change persist after a reboot
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf

# Launch the containers
docker-compose up -d

Practical Tips for a Better Archive

Once the UI is live at port 8000, avoid the urge to subscribe to 50 channels instantly. Start small. The initial run is CPU-intensive because the system must generate thumbnails and index thousands of lines of subtitles simultaneously.

Begin with your 5 most critical channels. Check the ‘Settings’ page and set your ‘Download Format’. I recommend 1080p. A single 10-minute 4K video can take up 1.2GB, whereas a 1080p version might only be 200MB. Only enable ‘Auto-delete watched’ if you’re using this as a DVR; for a permanent archive, keep it off.

Managing your own data requires discipline. But the next time a vital tutorial vanishes from the web, you won’t be panicking. You’ll just open your local dashboard and hit play.

Share: