Build a Text-to-SQL Tool with Vanna.ai: Talk to Your Data in Plain English

AI tutorial - IT technology blog
AI tutorial - IT technology blog

Breaking the SQL Bottleneck

I’ve spent years watching talented data teams drown in simple, repetitive requests. It’s a classic friction point: a marketing manager needs to know, “Which 15 customers in Berlin spent the most on electronics last month?” but they don’t know a JOIN from a GROUP BY. The result? A mounting ticket queue and frustrated stakeholders waiting three days for a five-minute query.

General LLMs like GPT-4 are impressive, but they stumble over messy, real-world schemas. They don’t know that your user_id maps to legacy_id in the billing table, or that “active” means a login within the last 14 days.

This is where Vanna.ai changes the math. Rather than just tossing a prompt at an AI, Vanna uses a Retrieval-Augmented Generation (RAG) framework built specifically for SQL. It searches your actual schema definitions and previous successful queries to give the AI the context it needs to be accurate.

On a recent project for a mid-sized e-commerce platform, we deployed this setup and watched manual reporting requests drop by 58% in the first quarter. Here is the blueprint for that system and the hard-won lessons we gathered during the build.

Setting Up the Environment

Vanna is highly modular. You can use their hosted tier for a 10-minute prototype or go fully local for strict data privacy. For professional work, I prefer a balanced setup that keeps things clean and reproducible.

Start with a dedicated virtual environment. I recommend Python 3.10 or newer to handle the latest async LLM wrappers without a hitch.

# Standard environment setup
python -m venv vanna-env
source vanna-env/bin/activate  # Windows: `vanna-env\Scripts\activate`

# Core dependencies
pip install vanna pandas
# Choose your driver: sqlite3, psycopg2-binary, or pymysql
pip install sqlite3

One detail to watch for: Vanna has specific extensions for different LLM providers. If you are leaning toward Claude 3.5 or GPT-4o, make sure to install the corresponding package extensions mentioned in their documentation to avoid runtime import errors.

The Engine Room: Training the AI

The heavy lifting in Vanna happens within the vector database, not the LLM. This is where you store your DDL (Data Definition Language), business documentation, and what I call “Golden SQL”—queries you know are 100% correct. When a user asks a question, Vanna pulls these snippets to build a precise prompt.

Initializing the Connection

For this walkthrough, we’ll use the default Vanna wrapper. If you’re in an enterprise setting, you would swap VannaDefault for a custom class using ChromaDB and your own Anthropic or OpenAI API keys.

import vanna as vn
from vanna.remote import VannaDefault

# Get your key from vanna.ai/account
vn = VannaDefault(model='itfromzero-demo-v1', api_key='your_vanna_api_key')

# Connect to your data source
vn.connect_to_sqlite('ecommerce_data.db')

Feeding High-Quality Metadata

Accuracy isn’t an accident. It’s a product of three training layers:

  1. The Skeleton (DDL): Defines the table structures.
  2. The Dictionary (Documentation): Defines the business definitions.
  3. The Examples (SQL Samples): This is the most critical layer. It shows the AI how you prefer to handle complex logic.
# 1. Teach it the structure
vn.train(ddl="""
CREATE TABLE subscriptions (
    id UUID PRIMARY KEY,
    user_id UUID,
    monthly_price DECIMAL(12, 2),
    tier TEXT,
    is_active BOOLEAN
)
""")

# 2. Add the 'Human' context
vn.train(documentation="'tier' can be 'Basic', 'Pro', or 'Enterprise'. Active users must have is_active=True.")

# 3. Provide Golden SQL
vn.train(question="What's our monthly recurring revenue from Pro users?", 
         sql="SELECT SUM(monthly_price) FROM subscriptions WHERE tier = 'Pro' AND is_active = True")

Instead of manual entry, I recommend a weekly script that pulls your top 20 most-used production queries and feeds them back into Vanna. This creates a loop where the system actually improves as your business logic evolves.

Security and Guardrails

Handing an AI a database connection is inherently risky. Never use an admin account. Create a read-only user with access only to the tables the business team actually needs. If you don’t want the AI looking at password_hashes or ssn_numbers, don’t even let the database user see those columns.

Hallucinations happen. An AI might invent a total_profit column that doesn’t exist. When this occurs, don’t fight the prompt—fix the training. Usually, it means you missed a piece of documentation or a DDL definition.

# A simple wrapper to prevent 'runaway' queries
def execute_safely(user_query):
    generated_sql = vn.generate_sql(user_query)
    
    # Basic safety check
    if not generated_sql.strip().upper().startswith("SELECT"):
        return "Error: Only SELECT queries are permitted."
        
    # Enforce a hard limit on result sets
    if "LIMIT" not in generated_sql.upper():
        generated_sql += " LIMIT 50"
        
    return vn.run_sql(generated_sql)

Final Thoughts

Successful Text-to-SQL isn’t a one-and-done setup; it’s a garden you tend. If your database has inconsistent naming conventions, the AI will struggle. However, Vanna.ai provides the best framework I’ve found to manage that complexity without building a massive RAG pipeline from scratch.

Start small. Pick one department—like Sales or Marketing—and give them access to a simple Flask interface (vanna.flask.VannaAdmin). Their feedback will quickly point out where your documentation is weak, allowing you to iterate toward a truly self-service data culture.

Share: