The Problem: LLMs Are Too Big to Run Locally
You found a great open-source model on Hugging Face — maybe Mistral 7B, LLaMA 3, or Qwen2. You download it, try to load it, and your machine either freezes or throws an out-of-memory error. A 7B parameter model in float16 takes roughly 14 GB of VRAM. A 13B model? Over 26 GB. Most developers simply don’t have that.
The root cause is precision. Model weights are stored as 16-bit or 32-bit floats by default. That’s 2–4 bytes per parameter — far more precision than inference actually needs. The fix is quantization: reducing the bit-width of each weight to shrink the model without gutting its accuracy.
Learning this pipeline changed how I work with local AI. With a properly quantized model, Mistral 7B runs at 10–20 tokens per second on a modern laptop CPU — no GPU, no datacenter, no cloud bill.
Core Concepts You Need to Know
What Is Quantization?
Quantization maps high-precision float values to lower-precision integers. Instead of 16 bits per weight, you use 4 or 8 bits. The model loses some numerical detail, but output quality degrades very little in practice — especially at Q4 and above.
Here’s how the common quantization levels compare:
- Q2_K — ~2 bits/weight. Smallest file, noticeably degraded quality. Only useful under extreme memory constraints.
- Q4_K_M — ~4 bits/weight. The sweet spot. Fast inference, quality close to full precision.
- Q5_K_M — ~5 bits/weight. Better quality, slightly larger. Worth it when you have the RAM.
- Q8_0 — ~8 bits/weight. Near-lossless. Best quality among quantized formats, but the largest and slowest to generate.
Concretely: Q4_K_M brings a 7B model down to around 4.1 GB. That fits in 8 GB of system RAM with room to spare — no GPU required.
What Is GGUF?
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It replaced the older GGML format and bundles everything the runtime needs — weights, tokenizer, hyperparameters — into a single binary. One file. No separate config to track down.
GGUF is what you see on Hugging Face with names like model-Q4_K_M.gguf. Many popular models are pre-converted, but when a new release lands and the community hasn’t packaged it yet, you convert it yourself. That’s exactly what this guide covers.
How llama.cpp Fits In
llama.cpp is a C++ inference engine optimized for running GGUF models on CPU and GPU. The conversion pipeline uses two tools. convert_hf_to_gguf.py transforms a Hugging Face model directory into GGUF format. Then llama-quantize compresses it to your chosen bit-width. You run them in sequence: convert first, quantize second.
Hands-On: Converting and Quantizing a Model
Step 1: Install Dependencies
You’ll need Python 3.10+, Git, and a C++ compiler. On Ubuntu/Debian:
sudo apt update && sudo apt install -y git build-essential cmake
pip install torch transformers sentencepiece huggingface_hub
Step 2: Clone llama.cpp and Build the Tools
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
This compiles llama-quantize, llama-cli, llama-server, and the other utilities. If you have CUDA and want GPU acceleration during inference:
make LLAMA_CUDA=1 -j$(nproc)
Step 3: Install Python Requirements for the Conversion Script
pip install -r requirements.txt
Step 4: Download the Source Model from Hugging Face
Use huggingface-cli to pull a model. Here we use mistralai/Mistral-7B-v0.3:
huggingface-cli login # required if the model is gated
huggingface-cli download mistralai/Mistral-7B-v0.3 \
--local-dir ./models/Mistral-7B-v0.3 \
--local-dir-use-symlinks False
This pulls all .safetensors shards, the tokenizer, and the config files. Mistral 7B is around 14 GB total at this stage.
Step 5: Convert to GGUF (Float16)
The conversion script produces a full-precision GGUF file as an intermediate step:
python convert_hf_to_gguf.py ./models/Mistral-7B-v0.3 \
--outfile ./models/Mistral-7B-v0.3-F16.gguf \
--outtype f16
Depending on disk speed, this takes 2–5 minutes. The output is around 14 GB — same precision as the original, repackaged as GGUF.
Step 6: Quantize to Your Target Format
Now run llama-quantize. Q4_K_M is almost always the right starting point:
./llama-quantize \
./models/Mistral-7B-v0.3-F16.gguf \
./models/Mistral-7B-Q4_K_M.gguf \
Q4_K_M
This takes 2–4 minutes and outputs a 4.1 GB file. Delete the F16 intermediate afterward to reclaim ~10 GB of disk space.
Want to generate multiple quantization levels for comparison?
for QTYPE in Q2_K Q4_K_M Q5_K_M Q8_0; do
./llama-quantize \
./models/Mistral-7B-v0.3-F16.gguf \
./models/Mistral-7B-${QTYPE}.gguf \
$QTYPE
done
Step 7: Run Inference to Verify
Test the quantized model with llama.cpp’s built-in CLI:
./llama-cli \
-m ./models/Mistral-7B-Q4_K_M.gguf \
-p "Explain what Docker volumes are in simple terms." \
-n 300 \
--temp 0.7
Coherent output means the conversion worked. Garbage tokens usually mean the tokenizer wasn’t packaged correctly — rerun the conversion step and check for warnings about missing tokenizer files.
Perplexity Check (Optional but Recommended)
Perplexity measures how well the model predicts text. Lower is better. Run this to measure exactly how much quality you traded for file size:
# Download a test dataset
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
# Run perplexity evaluation
./llama-perplexity \
-m ./models/Mistral-7B-Q4_K_M.gguf \
-f wikitext-2-raw/wiki.test.raw
Q4_K_M typically adds less than 0.5 perplexity points versus F16 on most 7B models. That’s a negligible accuracy hit for a 70% reduction in file size.
Loading into Ollama or Open WebUI
Prefer a chat interface over the raw CLI? Load the GGUF directly into Ollama:
# Create a Modelfile
cat > Modelfile <<EOF
FROM ./models/Mistral-7B-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
ollama create mistral-local -f Modelfile
ollama run mistral-local
Choosing the Right Quantization Level
The right format depends on your hardware. Here’s a practical decision tree:
- Only 4–6 GB RAM available: Q2_K or Q3_K_M. Expect noticeable quality loss on complex reasoning.
- 8 GB RAM: Q4_K_M for 7B models. This is the default choice for most setups.
- 16 GB RAM: Q5_K_M for 7B, or Q4_K_M for 13B.
- 32 GB+ RAM: Q8_0 for 7B/13B, or run 30B+ models at Q4.
What Comes Next
Quantized model in hand, you have several natural directions. Run llama-server to get an OpenAI-compatible REST API — any tool that speaks to OpenAI (LangChain, Open WebUI, LlamaIndex) works immediately without code changes. Or build a RAG pipeline on top of your local model. Or fine-tune the base model first, then convert the fine-tuned weights.
The llama.cpp project ships updates almost daily and adds support for new architectures within days of a model release. LLaMA, Mistral, Qwen, Phi, Gemma — if it’s open-weight, it almost certainly has GGUF support.
Start with Q4_K_M. Run a quick perplexity check. Then decide whether to trade up to Q5 for better quality or down to Q3 for lower memory usage. The quantization step takes minutes — iteration is cheap, and the right format is worth finding.

