Why I Stopped Using Cloud Voice APIs After Six Months
Six months ago, every voice query I made went through cloud APIs — Google Speech-to-Text for transcription, OpenAI’s API for responses. Costs were manageable at first. But the latency killed the experience. Every interaction had a noticeable 1.5–2 second round-trip just from network overhead, before the model even started thinking.
Then there was the privacy problem. I work with internal documentation and occasionally ask questions about systems that shouldn’t leave my network. Sending that audio to a third-party service felt wrong — and honestly, it was.
Whisper (OpenAI’s open-source speech recognition model) combined with Ollama (a local LLM runner) solved both issues at once. Six months of running this stack in production: zero quota limits, zero API bills, and response times competitive with cloud solutions on mid-range hardware.
Here’s exactly how I set it up.
What You Actually Need
Hardware requirements are worth addressing upfront. Whisper’s medium model runs fine on CPU, but anything larger benefits significantly from a GPU. For Ollama, a 7B parameter model like Mistral or LLaMA 3 needs at least 8GB of RAM — 16GB gives you more breathing room.
My setup: Ubuntu 22.04, 32GB RAM, NVIDIA RTX 3060 with 12GB VRAM. Running macOS with Apple Silicon? Ollama’s Metal support is excellent there — performance is genuinely impressive on an M2 or M3.
Required packages:
- Python 3.10+
- ffmpeg (audio processing)
- portaudio (microphone input)
- Ollama
Installation
Step 1: Install Ollama
Ollama ships a one-liner that handles everything, including the systemd service:
curl -fsSL https://ollama.com/install.sh | sh
Pull the model you want. For consumer hardware, mistral hits the sweet spot of speed and capability. If you’re on a machine with less than 8GB VRAM, phi3 is noticeably lighter:
ollama pull mistral
# Or for something lighter:
ollama pull phi3
Verify it’s actually working before moving on:
ollama list
# Should show your downloaded models
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Hello, are you working?",
"stream": false
}'
Step 2: Set Up Python Environment
python3 -m venv voice-assistant-env
source voice-assistant-env/bin/activate
# Install system dependencies first
sudo apt install -y ffmpeg portaudio19-dev python3-dev
# Install Python packages
pip install openai-whisper pyaudio requests numpy
On macOS:
brew install ffmpeg portaudio
pip install openai-whisper pyaudio requests numpy
Step 3: Download the Whisper Model
Whisper fetches models on first use automatically. Pre-downloading avoids the wait during your first real run:
import whisper
# This downloads and caches the model (~1.5GB for 'medium')
model = whisper.load_model("medium")
print("Model loaded successfully")
Model sizes available: tiny, base, small, medium, large. For conversational use, small or medium is the practical sweet spot — large is overkill unless accuracy is critical.
Configuration
The Core Voice Assistant Script
Below is the complete script I’ve been running in production. It records audio, transcribes with Whisper, fires the text at Ollama, and prints the response:
import whisper
import pyaudio
import wave
import requests
import json
import os
import tempfile
import numpy as np
# ── Configuration ──────────────────────────────────────────────
WHISPER_MODEL = "medium" # Change to "small" for faster response
OLLAMA_MODEL = "mistral" # Must match what you pulled with ollama
OLLAMA_URL = "http://localhost:11434/api/generate"
# Audio recording settings
SAMPLE_RATE = 16000
CHUNK_SIZE = 1024
CHANNELS = 1
FORMAT = pyaudio.paInt16
RECORD_SECONDS = 5 # Adjust based on your typical query length
SILENCE_THRESHOLD = 500 # Adjust for your microphone sensitivity
# ── Load Whisper model once at startup ─────────────────────────
print(f"Loading Whisper model: {WHISPER_MODEL}")
whisper_model = whisper.load_model(WHISPER_MODEL)
print("Ready. Press Enter to start recording.")
def record_audio() -> str:
"""Record audio from microphone and save to temp file."""
audio = pyaudio.PyAudio()
stream = audio.open(
format=FORMAT,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK_SIZE
)
print("Recording... (speak now)")
frames = []
for _ in range(0, int(SAMPLE_RATE / CHUNK_SIZE * RECORD_SECONDS)):
data = stream.read(CHUNK_SIZE)
frames.append(data)
print("Recording complete.")
stream.stop_stream()
stream.close()
audio.terminate()
# Save to temp WAV file
tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
with wave.open(tmp.name, 'wb') as wf:
wf.setnchannels(CHANNELS)
wf.setsampwidth(audio.get_sample_size(FORMAT))
wf.setframerate(SAMPLE_RATE)
wf.writeframes(b''.join(frames))
return tmp.name
def transcribe(audio_path: str) -> str:
"""Transcribe audio file using Whisper."""
result = whisper_model.transcribe(audio_path, language="en")
os.unlink(audio_path) # Clean up temp file
return result["text"].strip()
def ask_ollama(prompt: str) -> str:
"""Send prompt to local Ollama instance and return response."""
payload = {
"model": OLLAMA_MODEL,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 256 # Keep responses concise for voice
}
}
response = requests.post(OLLAMA_URL, json=payload, timeout=60)
response.raise_for_status()
return response.json()["response"].strip()
def main():
while True:
input("\nPress Enter to speak (Ctrl+C to quit)...")
audio_path = record_audio()
print("Transcribing...")
text = transcribe(audio_path)
if not text:
print("No speech detected. Try again.")
continue
print(f"You said: {text}")
print("Thinking...")
response = ask_ollama(text)
print(f"\nAssistant: {response}\n")
if __name__ == "__main__":
main()
Tuning for Your Hardware
Three settings make a real difference in practice:
- Whisper language setting: Always specify the language when you know it (
language="en"). Auto-detection adds latency and occasionally picks the wrong language entirely — I’ve seen it transcribe English as French when the audio had background music. - num_predict: Capping Ollama’s output at 256 tokens keeps voice responses short and listenable. A 2000-token essay is painful to read on screen; spoken aloud, it’s unbearable.
- Record seconds: 5 seconds handles most short commands. Bump to 8–10 for back-and-forth conversation.
Adding a System Prompt for Consistent Behavior
A system persona nudges Ollama toward shorter, voice-appropriate answers. Without it, responses tend to ramble and include markdown that looks terrible in terminal output:
SYSTEM_PROMPT = (
"You are a concise voice assistant. Keep all responses under 3 sentences. "
"No markdown formatting. Speak as if answering out loud."
)
def ask_ollama(user_input: str) -> str:
full_prompt = f"{SYSTEM_PROMPT}\n\nUser: {user_input}\nAssistant:"
payload = {
"model": OLLAMA_MODEL,
"prompt": full_prompt,
"stream": False,
"options": {"temperature": 0.7, "num_predict": 256}
}
response = requests.post(OLLAMA_URL, json=payload, timeout=60)
return response.json()["response"].strip()
Verification and Monitoring
Testing Each Component Independently
Test each piece in isolation before running the full pipeline. Doing this upfront saved me a couple hours of debugging when I first set this up — chasing a Whisper issue through Ollama logs is not fun.
Test Whisper alone:
# Record a 5-second clip
python3 -c "
import sounddevice as sd
import scipy.io.wavfile as wav
import numpy as np
fs = 16000
duration = 5
print('Recording...')
audio = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
sd.wait()
wav.write('test.wav', fs, audio)
print('Saved test.wav')
"
# Transcribe it
python3 -c "
import whisper
m = whisper.load_model('medium')
result = m.transcribe('test.wav', language='en')
print(result['text'])
"
Test Ollama independently:
curl http://localhost:11434/api/generate \
-d '{"model": "mistral", "prompt": "What is 2+2?", "stream": false}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
Measuring Latency
Six months of runtime on an RTX 3060 gives me a consistent baseline:
- Whisper
mediumon GPU: ~0.8–1.2 seconds for a 5-second clip - Ollama (Mistral 7B, GPU): ~1.5–3 seconds for a 256-token response
- Total round-trip: roughly 2.5–4 seconds
Once you factor in network latency, that’s competitive with cloud APIs. And entirely local.
Wire up basic timing to track this over time:
import time
start = time.time()
text = transcribe(audio_path)
print(f"Transcription: {time.time() - start:.2f}s")
start = time.time()
response = ask_ollama(text)
print(f"LLM response: {time.time() - start:.2f}s")
Running as a Background Service
For a permanent setup, wrap the Python script in a systemd service. Ollama already runs as one by default — this just adds your assistant alongside it:
sudo nano /etc/systemd/system/voice-assistant.service
[Unit]
Description=Local Voice Assistant
After=ollama.service
[Service]
Type=simple
User=youruser
WorkingDirectory=/home/youruser/voice-assistant
ExecStart=/home/youruser/voice-assistant/voice-assistant-env/bin/python assistant.py
Restart=on-failure
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now voice-assistant
What to Watch For
Three issues come up repeatedly in production:
- Ollama running out of VRAM: Running other GPU workloads alongside Ollama can push it off the GPU entirely. Check with
nvidia-smi— you want to see the model loaded in GPU memory, not spilled to system RAM. - Whisper misfire on silence: When nobody speaks during recording, Whisper returns garbage text rather than an empty string. Add a simple RMS energy check before sending anything to Ollama — skip the request if the audio is silent.
- Memory creep in long sessions: I restart the service weekly as a precaution. Two months in I noticed RAM usage had drifted up ~800MB from a cold start. Weekly restart keeps it clean.
The full stack — Whisper medium plus Mistral 7B — holds around 6GB of VRAM and 4GB of system RAM when both are loaded. In six months of production use, I’ve had exactly two unplanned restarts, both from power outages. That’s a reliability record I’m happy with.

