Scalable ELT: How We Replaced 2,500 Lines of SQL with dbt and Postgres

Database tutorial - IT technology blog
Database tutorial - IT technology blog

The 2:14 AM Wake-Up Call: When Stored Procedures Break

At 2:14 AM, the PagerDuty alert is a physical blow. I’m staring at a 2,500-line legacy stored procedure in PostgreSQL that just timed out for the third night in a row. The logic is a mess of temporary tables, nested loops, and implicit dependencies. Nobody on the team wants to touch it because one wrong edit could break the entire reporting layer.

My experience with MySQL, PostgreSQL, and MongoDB has taught me that every tool has its niche. For structured analytical transformations, PostgreSQL is an absolute powerhouse—if you respect its query planner. The bottleneck isn’t usually the database itself. It’s the workflow. We’ve been hiding critical business logic inside a black box. To fix this, we need to move toward a professional ELT (Extract, Load, Transform) approach using dbt (data build tool).

Wiring dbt to Postgres in Under Five Minutes

You can have dbt communicating with your Postgres instance before you finish your first cup of coffee. Think of dbt as the orchestration layer. It sits on top of your database, allowing you to write modular SQL that it compiles and executes in the correct order.

Start by grabbing the Postgres adapter. I always use a virtual environment to avoid dependency hell:

python3 -m venv dbt-env
source dbt-env/bin/activate
pip install dbt-postgres

Next, initialize your project. This scaffolds the directory structure you’ll use to organize your data models:

dbt init analytics_pipeline

Configuration happens in your profiles.yml (usually tucked away in ~/.dbt/). This file tells dbt how to authenticate. For production, never hardcode passwords. Use environment variables instead.

analytics_pipeline:
  outputs:
    dev:
      type: postgres
      threads: 4
      host: localhost
      port: 5432
      user: transform_user
      pass: "{{ env_var('DB_PASSWORD') }}"
      dbname: analytics_db
      schema: analytics_main
  target: dev

Run a quick dbt debug. If the terminal returns green text, you’re ready to stop writing monolithic scripts and start building models.

The Shift to Modular Modeling

At its core, dbt treats every .sql file as a model. A model is just a SELECT statement. You write the logic; dbt handles the CREATE TABLE AS or CREATE VIEW AS boilerplate automatically.

Dependency management used to be a headache. Now, the ref() function handles it. Instead of guessing which table must be built first, you reference the model name directly. This creates a clear lineage that dbt follows during execution.

Consider a staging model for raw e-commerce data (models/staging/stg_orders.sql):

with raw_orders as (
    select * from {{ source('raw_data', 'orders') }}
)

select
    id as order_id,
    user_id,
    order_date,
    lower(status) as order_status -- Standardizing raw strings
from raw_orders

In your final transformation (models/marts/fct_orders.sql), you simply pull from those staged files:

with orders as (
    select * from {{ ref('stg_orders') }}
),

order_payments as (
    select * from {{ ref('stg_payments') }}
)

select
    orders.order_id,
    orders.user_id,
    order_payments.amount
from orders
left join order_payments using (order_id)

Behind the scenes, dbt builds a Directed Acyclic Graph (DAG). It ensures stg_orders finishes before fct_orders even starts. You no longer have to manually schedule a sequence of five different Python scripts.

Testing: The End of Broken Dashboards

Modular code is only half the battle. Testing is what actually keeps the pager quiet. Previously, we only noticed a primary key duplication when a dashboard inflated our monthly recurring revenue (MRR) by 115%. With dbt, I catch those errors during the build phase.

Define your tests in a schema.yml file:

version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']

Executing dbt test verifies your data integrity. If a record contains a null ID or an unsupported status, the pipeline halts. This prevents corrupted data from ever reaching your production BI tools.

Documentation shouldn’t be an afterthought. By running dbt docs generate, you create a web-based interface. It shows your entire data lineage and table descriptions, making it easy to explain data origins to a product manager without digging through code.

Hard-Won Lessons for Production

Scaling Postgres requires more than just clean SQL. After migrating three legacy systems to this stack, I’ve found these four strategies to be essential:

  1. Embrace Incremental Models: Stop rebuilding 500-million-row tables from scratch. Use incremental materialization to append only the new data. This slashed our runtimes from 45 minutes to 120 seconds.
  2. Materialization Strategy: Default to using views for development. They are instant to build. Switch to tables only when your BI tool performance starts to lag.
  3. Isolate Your Raw Data: Keep your raw source data in a dedicated schema. Give the dbt user read-only access to raw and full permissions for analytics. This simple wall prevents accidental DROP TABLE disasters on your source data.
  4. Post-Run Hooks: Automatically run ANALYZE on your Postgres tables after dbt finishes. This keeps the query planner optimized and ensures your joins remain fast as data volume grows.

Professional data engineering is about applying software discipline to SQL. Version control, automated testing, and modularity aren’t just for app developers anymore. My stress levels have dropped significantly since we ditched stored procedures. If you are still manually managing dependencies in Postgres, you are working much harder than you need to.

Share: