The Bottleneck: Why Your Local LLM Crawls
Waiting for a Llama-3-70B or Mixtral 8x7B model to spit out text at a sluggish 2 tokens per second feels like watching paint dry. Even if you are rocking an RTX 4090, the math is working against you. The problem isn’t that your GPU is weak; it’s that LLMs are auto-regressive. They generate text one piece at a time, requiring a full pass through billions of parameters for every single word.
Under the hood, inference is a memory-bandwidth game. Your GPU spends 95% of its time moving massive model weights from VRAM to the processing cores and only 5% actually doing the math. To generate one token on a 70B model, you must read roughly 40GB of data (at 4-bit quantization). Doing this over and over is the definition of an efficiency nightmare.
The Solution: Decoding with a ‘Fast-Thinking’ Assistant
Speculative decoding flips the script. Instead of forcing the massive, slow “Target Model” to do all the heavy lifting, we hire a tiny, agile “Draft Model” to act as a scout.
Here is how the hand-off works:
- The Guess (Speculation): A 1B or 3B parameter model—which fits entirely in a tiny corner of your VRAM—rapidly predicts the next 5 to 8 tokens. It is fast because it is small.
- The Audit (Verification): The 70B Target Model looks at that entire string of predicted text in one single parallel pass. It asks: “Would I have said the same thing?”
If the scout gets 4 out of 5 words right, the expert accepts them instantly. You just got 4 tokens for the “price” of one pass of the big model. If the scout fails, the expert corrects the first mistake and takes over. In my testing, even a mediocre scout provides a nearly “free” speed boost with zero loss in output quality.
Setting Up Your Environment
To make this work, your models must share the same vocabulary. If you are running Llama 3 70B, use Llama 3 8B or a Llama-3-distilled 1B model as your draft. Mixing families—like using a Gemma draft for a Llama target—usually breaks the logic because they “speak” different internal languages.
We will focus on the two heavy hitters: llama.cpp (the gold standard for GGUF and consumer hardware) and vLLM (the go-to for high-speed API serving).
Hardware Requirements
- OS: Linux or macOS are preferred; Windows users should stick to WSL2 for better performance.
- VRAM: You need enough space for both models. For a 70B Q4_K_M model (~42GB) and an 8B Q4 draft (~5GB), a 48GB setup (dual 3090/4090s) is the sweet spot.
Configuring llama.cpp for Speed
llama.cpp makes speculation easy with the --draft flag. Ensure both your target and draft are in GGUF format before starting.
# Speeding up Llama-3-70B with an 8B draft
./llama-cli -m models/llama-3-70b-q4.gguf \
--draft 5 \
-md models/llama-3-8b-q4.gguf \
-p "Write a Python script to scrape news headlines." \
-n 512 \
--n-gpu-layers 81
Key parameters to tune:
-md: This points to your small draft model.--draft 5: This tells the scout to look 5 tokens ahead. If your scout is smart, try 8. If it’s guessing poorly, drop to 4.
Keep both models on the GPU. If you offload the draft model to your CPU while the target sits on the GPU, the PCIe latency will eat all your gains.
Implementing vLLM for API Serving
vLLM is the better choice for building web interfaces. It handles speculative decoding dynamically and supports advanced heads like Medusa.
# Start the vLLM server with speculation
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct \
--num-speculative-tokens 5 \
--gpu-memory-utilization 0.95
If you are using high-end setups, look into “Medusa” or “Eagle” heads. These are specialized components built into the model itself that can offer 2x to 3x speedups without needing a separate draft model file.
Measuring Success: The Acceptance Rate
Once you are running, don’t just trust your eyes. Watch the statistics at the end of your generation.
- Tokens per Second (TPS): Your bottom line. If you go from 3 TPS to 7 TPS, you have won.
- Acceptance Rate: This is the percentage of the scout’s guesses that the expert kept.
An acceptance rate of 70-80% is the gold standard for general chat. If you see it dip below 40%, your scout is essentially guessing blindly. This usually happens when your draft model is too small or wasn’t trained on similar data to your target.
Troubleshooting Tips
Still not seeing a speedup? Check these three things:
- VRAM Overload: If your OS starts swapping to system RAM, your performance will drop to near zero. Quantize your models further if needed.
- Tokenizer Mismatch: If the scout uses a different tokenizer, the expert will reject every single guess. Stick to the same model family.
- Lookahead Depth: Don’t be greedy. Setting
--draft 15sounds fast, but every wrong guess forces the expert to re-calculate, which can actually make it slower than standard inference.
By mastering these configurations, you can turn a sluggish local setup into a responsive assistant that rivals commercial APIs—all while keeping your data private on your own iron.

