6 Months with Local AI: How I Automated Meeting Minutes Using Whisper and Ollama

AI tutorial - IT technology blog
AI tutorial - IT technology blog

The Manual Note-Taking Tax

Six months ago, our internal documentation was a black hole. We were losing roughly 12 hours every week to ‘sync-up debt’—time spent in architecture reviews and client calls where the resulting notes were either half-baked or arrived two days late.

Like most engineering leads, I experimented with cloud-based AI note-takers. They were polished, but we hit a hard ceiling: $30/month per user subscriptions and, more critically, security policies that flatly prohibited sending raw client data to external servers.

I decided to stop outsourcing our privacy and built a local-first pipeline instead. After running this system in production for half a year, the results are clear. We’ve replaced the manual grind with a Python-based workflow that uses OpenAI’s Whisper for transcription and Ollama for summarization. It isn’t just a technical curiosity; it’s a practical way to reclaim your calendar while keeping every byte of data on your own hardware.

The Architecture: Whisper + Ollama

The workflow is straightforward: take an audio file (MP4, MP3, or WAV) and turn it into a structured Markdown document. To make this work without the cloud, you need two specialized engines running on your machine.

1. Speech-to-Text: OpenAI Whisper

Whisper is the industry benchmark for local transcription. While older libraries crumbled when faced with heavy accents or coffee shop background noise, Whisper’s Transformer architecture handles them with surprising ease. It offers several model sizes. In my tests, the ‘small’ model provides a 95% accuracy rate on clear English audio and runs at roughly 5x real-time speed on a mid-range GPU.

2. The Intelligence: Local LLMs via Ollama

Raw transcripts are messy. People stutter, talk over one another, and fill gaps with ‘um’ and ‘ah’. To turn this chaos into logic, you need a Large Language Model (LLM). I use Ollama to manage our models because it wraps Llama 3.1 or Mistral into a background service with a simple API, making it easy to swap models as better ones are released.

Hands-on Practice: Building the Pipeline

Here is the exact stack I’m running. We use Python as the glue. For hardware, an NVIDIA GPU with at least 8GB of VRAM is the sweet spot, though you can run this on a modern Mac M-series chip with excellent results.

Step 1: Environment Setup

First, grab the core libraries. You will need the openai-whisper engine and the ollama client.

# Install Whisper and its dependencies
pip install openai-whisper setuptools-rust

# Install Ollama Python client
pip install ollama

# ffmpeg is the engine under the hood for audio processing
# Linux: sudo apt install ffmpeg | Mac: brew install ffmpeg

Step 2: Transcribing Audio with Whisper

This script initializes the Whisper model and processes your recording. I recommend ‘base’ for quick debugging and ‘small’ for your actual daily notes.

import whisper

def transcribe_audio(file_path):
    # 'small' is the best balance of VRAM and accuracy
    model = whisper.load_model("small")
    
    print("Transcribing... (this usually takes 1/5th of the audio length)")
    result = model.transcribe(file_path)
    
    return result["text"]

# Example usage
# transcript = transcribe_audio("sprint_planning.mp4")

Step 3: Extracting Value with Ollama

Once you have the text, you need to clean it. The secret is in the prompt. You want the AI to act as a focused technical secretary, not a creative writer.

import ollama

def generate_minutes(transcript):
    prompt = f"""
    Task: Summarize this technical meeting transcript into Markdown.
    Rules: Be concise. Prioritize decisions over small talk.
    
    Include:
    - Primary Objective
    - Key Technical Decisions
    - Unresolved Blockers
    - Action Items (Owner + Deadline)
    
    Transcript:
    {transcript}
    """
    
    response = ollama.generate(model='llama3.1', prompt=prompt)
    return response['response']

Step 4: The Batch Processor

I typically run this as a CLI tool. It watches a specific folder and processes recordings overnight. This way, when I log in at 9:00 AM, my inbox is already full of Markdown summaries from the previous day’s calls.

Lessons from 6 Months in Production

Building the tool is easy; making it reliable for a team is harder. Here are the three main friction points we solved:

1. The VRAM Wall

Whisper and LLMs are memory hogs. If you try to run them simultaneously on an 8GB card, your system will likely hang. I found that loading the Whisper model, finishing the transcription, and then explicitly deleting the model object before calling Ollama prevents 99% of crashes.

2. The ‘Who Said What’ Problem

Standard Whisper generates a ‘wall of text’ without speaker names. For 2-3 person syncs, Llama 3.1 is remarkably good at identifying speakers based on context (e.g., ‘I will push the Docker image’ is clearly the DevOps lead). For larger groups, I recommend integrating faster-whisper with pyannote-audio to add proper speaker diarization.

3. Brutal Prompting

My early summaries were fluff. They included things like ‘The team exchanged pleasantries.’ No one needs to read that. I updated the system prompt to: ‘Ignore all social filler. If a sentence doesn’t result in a decision or an action item, discard it.’ The utility of our notes tripled overnight.

Conclusion

Ditching the cloud for a local Whisper and Ollama stack was the single best workflow improvement I made last year. It killed my privacy anxiety and ended the Monday morning scramble to remember what happened in Friday’s meetings. While it requires an upfront investment in a decent GPU and a few hours of Python scripting, the payoff is a perpetual, private secretary that doesn’t charge a monthly fee. If you handle sensitive code or client data, this isn’t just an option—it’s a necessity.

Share: