Ditch the Filing Cabinet: Self-Hosting Paperless-ngx on Docker

HomeLab tutorial - IT technology blog
HomeLab tutorial - IT technology blog

The Paper Mountain Problem

Physical filing systems are where productivity goes to die. Most of us spend years accumulating birth certificates, tax returns, and appliance warranties, only to realize that finding a specific document from 2019 requires a 30-minute archaeological dig through dusty folders. For anyone running a HomeLab, the manual chore of scanning a PDF, renaming it ‘Invoice_Electricity_March.pdf’, and dragging it into a folder is a process that eventually fails due to human laziness.

I hit my limit last year when a dishwasher repairman asked for a warranty card I knew I had but couldn’t find. That frustrating hour of searching prompted me to find a solution that didn’t just store files, but actually understood them. Paperless-ngx is the successor to the original Paperless project. It acts as a digital brain for your documents, using OCR (Optical Character Recognition) to index every word and automatically tag files based on their content.

Since moving my archive to this setup, I have processed over 1,200 documents. My physical paper trail is now almost non-existent. Every record I own is searchable via a simple web interface that feels as fast as a Google search.

How the System Handles Your Data

Paperless-ngx isn’t a single monolithic application. It is a coordinated suite of services that work together within Docker to handle the heavy lifting of document processing.

  • The Webserver: A Django-based core that serves the UI and manages the logic.
  • The Consumption Folder: This is the ‘magic’ directory. Drop a file here, and the system immediately begins processing it.
  • Redis: This acts as the task manager. If you dump 50 PDFs into the system at once, Redis ensures they are queued and processed one by one without crashing your CPU.
  • PostgreSQL: While the app supports SQLite, I recommend Postgres for long-term stability. It handles large indexes much better as your library grows.
  • Tesseract OCR: The engine that ‘reads’ your images. It can turn a blurry smartphone photo of a receipt into fully searchable text in about 5 to 10 seconds.

Preparing Your Environment

You will need a machine with Docker and Docker Compose. I suggest creating a dedicated directory to keep your data portable and easy to back up. I usually map everything under ~/homelab/paperless to keep it isolated from the rest of the system.

Start by creating the directory structure:

mkdir -p ~/homelab/paperless/{config,data,db,consume,export}

The consume folder is your entry point. Whether you use a network scanner or a phone app, this is where the raw files land before they are ingested into the database. The export folder remains empty until you run a backup command.

The Docker Compose Configuration

Deploying via docker-compose.yml is the standard approach. This file links the web app, database, and broker into a single functional unit. I have optimized this configuration to prevent port conflicts and ensure the OCR engine knows which languages to prioritize.

version: "3.9"
services:
  broker:
    image: docker.io/library/redis:7
    restart: unless-stopped
    volumes:
      - ./redisdata:/data

  db:
    image: docker.io/library/postgres:15
    restart: unless-stopped
    volumes:
      - ./db:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: your_secure_password

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - "8010:8000"
    volumes:
      - ./config:/usr/src/paperless/data
      - ./data:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: your_secure_password
      PAPERLESS_OCR_LANGUAGES: eng vie
      PAPERLESS_TIME_ZONE: Asia/Ho_Chi_Minh
      USER_ID: 1000
      GROUP_ID: 1000

Note the PAPERLESS_OCR_LANGUAGES variable. If you have documents in both English and Vietnamese, listing both ensures the Tesseract engine uses the correct character sets. I’ve also mapped the internal port 8000 to 8010 on the host to avoid clashing with other web services you might be running.

Spinning Up the Instance

With your configuration saved, pull the images and launch the containers in detached mode:

docker compose up -d

The containers are now running, but you can’t log in yet. You need to create an admin account by executing a command directly inside the running webserver container:

docker compose exec webserver python3 manage.py createsuperuser

Once you set your credentials, open your browser and head to http://your-server-ip:8010. You are officially ready to start digitizing.

Automating the Workflow

Software setup is only half the battle. The real magic happens when you eliminate friction. After six months of use, I found that Matching Algorithms are the secret to a clean archive. These rules tell Paperless-ngx how to categorize files without your intervention.

For example, I created a ‘Utility’ tag with a matching rule looking for the words ‘Power Company’ or ‘Kilowatt’. Now, when I drop a PDF from the electric company into the folder, the system automatically tags it, assigns it to the ‘Monthly Bills’ document type, and renames the file using the date it found inside the text. No manual typing required.

To make this truly seamless, use a mobile app like Microsoft Lens. Configure it to save scans directly to a folder on your phone that syncs with your server via Syncthing. This creates a ‘one-tap’ path from a physical receipt to a fully indexed digital record.

Reliable Backups

Since this system holds your life’s paperwork, losing data is not an option. Don’t just back up the Docker volumes. Use the built-in exporter tool to create a ‘hard’ copy of your library. I run a weekly cron job that executes the following command:

docker compose exec webserver document_exporter ../export

This command exports every document along with a JSON file containing all its metadata. If your Docker setup ever fails, you still have a perfectly organized folder of PDFs that any computer can read.

The Verdict

Moving from a physical ‘paper mountain’ to a searchable digital archive is one of the most rewarding HomeLab projects you can undertake. It transitions your server from a hobbyist toy to a critical piece of household infrastructure. Paperless-ngx is incredibly stable and requires almost zero maintenance once the matching rules are set. If you are tired of losing track of your important records, deploy this today and never dig through a dusty folder again.

Share: