Build a Local Voice Assistant with Whisper and Ollama: Fully Offline Speech Recognition and LLM Response

AI tutorial - IT technology blog
AI tutorial - IT technology blog

Why I Stopped Using Cloud Voice APIs After Six Months

Six months ago, every voice query I made went through cloud APIs — Google Speech-to-Text for transcription, OpenAI’s API for responses. Costs were manageable at first. But the latency killed the experience. Every interaction had a noticeable 1.5–2 second round-trip just from network overhead, before the model even started thinking.

Then there was the privacy problem. I work with internal documentation and occasionally ask questions about systems that shouldn’t leave my network. Sending that audio to a third-party service felt wrong — and honestly, it was.

Whisper (OpenAI’s open-source speech recognition model) combined with Ollama (a local LLM runner) solved both issues at once. Six months of running this stack in production: zero quota limits, zero API bills, and response times competitive with cloud solutions on mid-range hardware.

Here’s exactly how I set it up.

What You Actually Need

Hardware requirements are worth addressing upfront. Whisper’s medium model runs fine on CPU, but anything larger benefits significantly from a GPU. For Ollama, a 7B parameter model like Mistral or LLaMA 3 needs at least 8GB of RAM — 16GB gives you more breathing room.

My setup: Ubuntu 22.04, 32GB RAM, NVIDIA RTX 3060 with 12GB VRAM. Running macOS with Apple Silicon? Ollama’s Metal support is excellent there — performance is genuinely impressive on an M2 or M3.

Required packages:

  • Python 3.10+
  • ffmpeg (audio processing)
  • portaudio (microphone input)
  • Ollama

Installation

Step 1: Install Ollama

Ollama ships a one-liner that handles everything, including the systemd service:

curl -fsSL https://ollama.com/install.sh | sh

Pull the model you want. For consumer hardware, mistral hits the sweet spot of speed and capability. If you’re on a machine with less than 8GB VRAM, phi3 is noticeably lighter:

ollama pull mistral
# Or for something lighter:
ollama pull phi3

Verify it’s actually working before moving on:

ollama list
# Should show your downloaded models

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Hello, are you working?",
  "stream": false
}'

Step 2: Set Up Python Environment

python3 -m venv voice-assistant-env
source voice-assistant-env/bin/activate

# Install system dependencies first
sudo apt install -y ffmpeg portaudio19-dev python3-dev

# Install Python packages
pip install openai-whisper pyaudio requests numpy

On macOS:

brew install ffmpeg portaudio
pip install openai-whisper pyaudio requests numpy

Step 3: Download the Whisper Model

Whisper fetches models on first use automatically. Pre-downloading avoids the wait during your first real run:

import whisper
# This downloads and caches the model (~1.5GB for 'medium')
model = whisper.load_model("medium")
print("Model loaded successfully")

Model sizes available: tiny, base, small, medium, large. For conversational use, small or medium is the practical sweet spot — large is overkill unless accuracy is critical.

Configuration

The Core Voice Assistant Script

Below is the complete script I’ve been running in production. It records audio, transcribes with Whisper, fires the text at Ollama, and prints the response:

import whisper
import pyaudio
import wave
import requests
import json
import os
import tempfile
import numpy as np

# ── Configuration ──────────────────────────────────────────────
WHISPER_MODEL = "medium"        # Change to "small" for faster response
OLLAMA_MODEL = "mistral"        # Must match what you pulled with ollama
OLLAMA_URL = "http://localhost:11434/api/generate"

# Audio recording settings
SAMPLE_RATE = 16000
CHUNK_SIZE = 1024
CHANNELS = 1
FORMAT = pyaudio.paInt16
RECORD_SECONDS = 5              # Adjust based on your typical query length
SILENCE_THRESHOLD = 500         # Adjust for your microphone sensitivity

# ── Load Whisper model once at startup ─────────────────────────
print(f"Loading Whisper model: {WHISPER_MODEL}")
whisper_model = whisper.load_model(WHISPER_MODEL)
print("Ready. Press Enter to start recording.")


def record_audio() -> str:
    """Record audio from microphone and save to temp file."""
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=CHUNK_SIZE
    )

    print("Recording... (speak now)")
    frames = []

    for _ in range(0, int(SAMPLE_RATE / CHUNK_SIZE * RECORD_SECONDS)):
        data = stream.read(CHUNK_SIZE)
        frames.append(data)

    print("Recording complete.")
    stream.stop_stream()
    stream.close()
    audio.terminate()

    # Save to temp WAV file
    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    with wave.open(tmp.name, 'wb') as wf:
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(audio.get_sample_size(FORMAT))
        wf.setframerate(SAMPLE_RATE)
        wf.writeframes(b''.join(frames))

    return tmp.name


def transcribe(audio_path: str) -> str:
    """Transcribe audio file using Whisper."""
    result = whisper_model.transcribe(audio_path, language="en")
    os.unlink(audio_path)  # Clean up temp file
    return result["text"].strip()


def ask_ollama(prompt: str) -> str:
    """Send prompt to local Ollama instance and return response."""
    payload = {
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 256  # Keep responses concise for voice
        }
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=60)
    response.raise_for_status()
    return response.json()["response"].strip()


def main():
    while True:
        input("\nPress Enter to speak (Ctrl+C to quit)...")
        audio_path = record_audio()

        print("Transcribing...")
        text = transcribe(audio_path)

        if not text:
            print("No speech detected. Try again.")
            continue

        print(f"You said: {text}")
        print("Thinking...")

        response = ask_ollama(text)
        print(f"\nAssistant: {response}\n")


if __name__ == "__main__":
    main()

Tuning for Your Hardware

Three settings make a real difference in practice:

  • Whisper language setting: Always specify the language when you know it (language="en"). Auto-detection adds latency and occasionally picks the wrong language entirely — I’ve seen it transcribe English as French when the audio had background music.
  • num_predict: Capping Ollama’s output at 256 tokens keeps voice responses short and listenable. A 2000-token essay is painful to read on screen; spoken aloud, it’s unbearable.
  • Record seconds: 5 seconds handles most short commands. Bump to 8–10 for back-and-forth conversation.

Adding a System Prompt for Consistent Behavior

A system persona nudges Ollama toward shorter, voice-appropriate answers. Without it, responses tend to ramble and include markdown that looks terrible in terminal output:

SYSTEM_PROMPT = (
    "You are a concise voice assistant. Keep all responses under 3 sentences. "
    "No markdown formatting. Speak as if answering out loud."
)

def ask_ollama(user_input: str) -> str:
    full_prompt = f"{SYSTEM_PROMPT}\n\nUser: {user_input}\nAssistant:"
    payload = {
        "model": OLLAMA_MODEL,
        "prompt": full_prompt,
        "stream": False,
        "options": {"temperature": 0.7, "num_predict": 256}
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=60)
    return response.json()["response"].strip()

Verification and Monitoring

Testing Each Component Independently

Test each piece in isolation before running the full pipeline. Doing this upfront saved me a couple hours of debugging when I first set this up — chasing a Whisper issue through Ollama logs is not fun.

Test Whisper alone:

# Record a 5-second clip
python3 -c "
import sounddevice as sd
import scipy.io.wavfile as wav
import numpy as np

fs = 16000
duration = 5
print('Recording...')
audio = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
sd.wait()
wav.write('test.wav', fs, audio)
print('Saved test.wav')
"

# Transcribe it
python3 -c "
import whisper
m = whisper.load_model('medium')
result = m.transcribe('test.wav', language='en')
print(result['text'])
"

Test Ollama independently:

curl http://localhost:11434/api/generate \
  -d '{"model": "mistral", "prompt": "What is 2+2?", "stream": false}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"

Measuring Latency

Six months of runtime on an RTX 3060 gives me a consistent baseline:

  • Whisper medium on GPU: ~0.8–1.2 seconds for a 5-second clip
  • Ollama (Mistral 7B, GPU): ~1.5–3 seconds for a 256-token response
  • Total round-trip: roughly 2.5–4 seconds

Once you factor in network latency, that’s competitive with cloud APIs. And entirely local.

Wire up basic timing to track this over time:

import time

start = time.time()
text = transcribe(audio_path)
print(f"Transcription: {time.time() - start:.2f}s")

start = time.time()
response = ask_ollama(text)
print(f"LLM response: {time.time() - start:.2f}s")

Running as a Background Service

For a permanent setup, wrap the Python script in a systemd service. Ollama already runs as one by default — this just adds your assistant alongside it:

sudo nano /etc/systemd/system/voice-assistant.service
[Unit]
Description=Local Voice Assistant
After=ollama.service

[Service]
Type=simple
User=youruser
WorkingDirectory=/home/youruser/voice-assistant
ExecStart=/home/youruser/voice-assistant/voice-assistant-env/bin/python assistant.py
Restart=on-failure

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now voice-assistant

What to Watch For

Three issues come up repeatedly in production:

  • Ollama running out of VRAM: Running other GPU workloads alongside Ollama can push it off the GPU entirely. Check with nvidia-smi — you want to see the model loaded in GPU memory, not spilled to system RAM.
  • Whisper misfire on silence: When nobody speaks during recording, Whisper returns garbage text rather than an empty string. Add a simple RMS energy check before sending anything to Ollama — skip the request if the audio is silent.
  • Memory creep in long sessions: I restart the service weekly as a precaution. Two months in I noticed RAM usage had drifted up ~800MB from a cold start. Weekly restart keeps it clean.

The full stack — Whisper medium plus Mistral 7B — holds around 6GB of VRAM and 4GB of system RAM when both are loaded. In six months of production use, I’ve had exactly two unplanned restarts, both from power outages. That’s a reliability record I’m happy with.

Share: