Context & Why Run Local LLMs?
Large Language Models (LLMs) have revolutionized our interaction with technology. They now power everything from advanced search engines to sophisticated content creation tools.
Cloud-based LLMs, such as OpenAI’s GPT series or Google’s Gemini, offer immense power and convenience. However, they come with trade-offs, especially concerning data privacy, cost, and customization.
This is precisely why running LLMs on your local machine is so beneficial. By hosting these models on your own machine, you gain:
- Enhanced Privacy: Your data remains securely on your device. This is crucial for sensitive projects or personal use where data security is paramount.
- Cost Efficiency: Say goodbye to API usage fees. Once you’ve downloaded the model, interactions are free.
- Offline Access: Develop and experiment with LLMs even without an internet connection.
- Complete Control: You can tweak, customize, and integrate models into your applications exactly as needed. There are no external API limitations.
- Learning and Experimentation: A local setup acts as a sandbox. It lets you understand how LLMs work, test prompts, and explore various models without worrying about usage costs.
For modern developers, mastering local LLM execution is an essential skill. It empowers extensive experimentation and secure application development, particularly when handling proprietary data.
Ollama greatly simplifies this process. This open-source tool streamlines running large language models directly on your machine. Ollama manages complex model weights, dependencies, and execution, freeing you to focus on building applications and experimenting with AI.
Installation
Getting Ollama up and running is a quick process. We’ll cover the steps for common operating systems.
Download & Install Ollama
For Linux
Open your terminal and run the following command. This script will download and install Ollama, setting it up as a system service.
curl -fsSL https://ollama.com/install.sh | sh
After execution, Ollama should be installed and running in the background.
For macOS
For macOS users, the simplest method is to download the application directly from the official Ollama website:
Download the .dmg file, open it, and drag Ollama to your Applications folder. Alternatively, if you use Homebrew, you can install it via the command line:
brew install ollama
For Windows
Windows users can also download the installer directly from the Ollama website:
Run the .exe installer and follow the on-screen prompts. Ollama will install itself and typically start automatically after installation.
Verify Installation
Once you’ve completed the installation steps for your operating system, open a new terminal or command prompt and type:
ollama --version
You should see output similar to this, indicating that Ollama is correctly installed:
ollama version is 0.1.XX
The exact version number might differ, but seeing it confirms that the ollama command is recognized.
Configuration
With Ollama installed, it’s time to download and interact with some large language models.
Downloading Your First LLM
Ollama simplifies model downloads. It automatically handles locating model weights and preparing them for use. Let’s start with Llama 2, a popular open-source model:
ollama run llama2
The first time you execute this command, Ollama detects that llama2 isn’t installed locally. It then automatically begins downloading the model. Be aware that this can take some time, depending on your internet speed and the model’s size (many LLMs are several gigabytes, like Llama 2 at approximately 3.8 GB). A progress indicator will appear in your terminal.
Once the download is complete, Ollama will automatically load the model and present you with a prompt to start interacting with it.
You can explore other available models on the Ollama library. For example, to download and run Mistral, another excellent model, you would use:
ollama run mistral
Interacting with Models via CLI
When you run a model using ollama run <model_name>, you enter an interactive chat session. You can type your prompts directly into the terminal.
ollama run llama2
>>> Hi there!
Hello! How can I assist you today?
>>> Explain the concept of recursion in programming.
Recursion in programming is a technique where a function calls itself, directly or indirectly, to solve a problem. Think of it like a set of Russian nesting dolls, each doll containing a smaller version of itself. ...
>>> Generate a Python function to reverse a string.
def reverse_string(s):
return s[::-1]
# Example usage:
my_string = "hello"
reversed_my_string = reverse_string(my_string)
print(reversed_my_string) # Output: olleh
To exit the interactive session, simply type /bye or press Ctrl + D (on Linux/macOS) or Ctrl + Z followed by Enter (on Windows).
Using the Ollama API
Ollama includes a powerful built-in API. When Ollama runs, it automatically launches a local server, typically at http://localhost:11434, which exposes a REST API. This API enables seamless integration of your local LLMs into custom applications via simple HTTP requests.
Basic API Interaction with cURL
You can test the API using curl from your terminal:
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
The response will be a stream of JSON objects, each containing a part of the model’s answer.
Python Example for API Interaction
Here’s a simple Python script to interact with your local Ollama API. You’ll need the requests library (pip install requests).
import requests
import json
def generate_response(prompt, model="llama2"):
url = "http://localhost:11434/api/generate"
headers = {'Content-Type': 'application/json'}
data = {
"model": model,
"prompt": prompt,
"stream": False # Set to True for streaming responses
}
try:
response = requests.post(url, headers=headers, data=json.dumps(data))
response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
# If not streaming, the response is a single JSON object
return response.json()['response']
except requests.exceptions.RequestException as e:
print(f"Error communicating with Ollama API: {e}")
return None
if __name__ == "__main__":
user_prompt = "Write a short poem about a sunny day."
print(f"Prompt: {user_prompt}")
llm_response = generate_response(user_prompt, model="mistral") # You can change the model here
if llm_response:
print("\nLLM Response:")
print(llm_response)
user_prompt_2 = "Give me three ideas for a simple Python CLI tool."
print(f"\nPrompt: {user_prompt_2}")
llm_response_2 = generate_response(user_prompt_2)
if llm_response_2:
print("\nLLM Response:")
print(llm_response_2)
This script demonstrates how to send a prompt to your local Ollama instance and receive a response. You can adapt this for various applications, from chatbots to content generators.
Verification & Monitoring
Keeping an eye on your Ollama instance ensures everything is running smoothly and helps with troubleshooting.
Checking Ollama Service Status
It’s helpful to know if the Ollama service is active in the background.
On Linux
systemctl status ollama
You should see output indicating an active (running) status.
On macOS
Ollama usually runs as a user agent. You can often see its status by checking processes or looking in Activity Monitor. For a quick check, see if the API endpoint is responding:
curl http://localhost:11434
If Ollama is active, you’ll likely receive a simple JSON response, such as {"message":"Ollama is running"}. Alternatively, you might encounter a 404 error if the path is invalid, confirming the server is indeed running.
On Windows
Check the Windows Services manager (search for “Services” in the Start menu) and look for an Ollama service. You can also verify the API endpoint as shown for macOS.
Listing Installed Models
To see which models you’ve downloaded and are available to run:
ollama list
This command will display a list of models, their sizes, and when they were last used:
NAME ID SIZE MODIFIED
llama2:latest f6b15d2a... 3.8 GB 5 minutes ago
mistral:latest 294e7737... 4.1 GB 2 hours ago
Monitoring Resource Usage
LLMs can be resource-intensive, especially on consumer hardware. Keeping an eye on your system’s CPU, GPU (if available and utilized by Ollama), and RAM usage is crucial.
On Linux/macOS
Use tools like htop (brew install htop or sudo apt install htop if not installed) or top in your terminal:
htop
Locate processes named ollama or those related to your active model. Monitor the CPU and memory columns. If you have a compatible GPU, Ollama will attempt to offload computations to it. You can monitor this GPU usage with specific tools, such as nvidia-smi for NVIDIA GPUs.
On Windows
Open Task Manager (Ctrl + Shift + Esc) and navigate to the “Processes” or “Performance” tab. You can monitor CPU, Memory, and GPU usage there. Look for the Ollama process.
Should you face performance issues or out-of-memory errors, consider downloading smaller models. Ollama frequently offers different sizes, like `llama2:7b` versus `llama2:13b`, if available. Alternatively, ensure your system meets the recommended specifications for the models you intend to run.
Running LLMs locally with Ollama places powerful AI capabilities directly on your desktop. This approach empowers you to learn, develop, and innovate with language models, all while retaining full control and privacy over your data.

