Local LLMs with any-llm (Ollama Integration)

Introduction: Bringing LLMs Home

Welcome back, future AI architect! So far in our any-llm journey, we’ve largely focused on interacting with powerful cloud-based LLMs like OpenAI, Anthropic, or Mistral. These services are incredible for their scale and performance, but what if you need more privacy, lower latency, or simply want to experiment without incurring API costs?

This chapter is all about bringing the power of Large Language Models directly to your machine. We’ll dive into the exciting world of Local LLMs and learn how to run them efficiently using a fantastic tool called Ollama. Best of all, we’ll see how any-llm seamlessly integrates with Ollama, allowing you to switch between local and cloud models with minimal code changes. Pretty neat, right?

By the end of this chapter, you’ll be able to:

Understand the benefits of running LLMs locally.
Install and set up Ollama on your system.
Download and manage various open-source LLMs using Ollama.
Integrate any-llm with your local Ollama instance to perform completions.
Confidently run and experiment with LLMs right on your own hardware!

Ready to take control of your LLMs? Let’s dive in!

Core Concepts: Why Go Local?

Before we start typing commands, let’s understand why running LLMs locally is such a powerful option and how Ollama makes it accessible.

The Appeal of Local LLMs

While cloud-based LLM services offer convenience and scalability, local LLMs provide distinct advantages:

Privacy and Security: Your data never leaves your machine. This is crucial for sensitive applications or when working with proprietary information.
Cost-Effectiveness: Once the model is downloaded, there are no per-token API costs. Your only “cost” is your hardware’s power consumption.
Low Latency: Interactions with a local model often feel snappier because requests don’t need to travel over the internet.
Offline Capability: No internet? No problem! Your local LLM will still be there for you.
Customization and Control: You have more direct control over the model’s environment and can more easily explore fine-tuning or specialized deployments.

Introducing Ollama: Your Local LLM Companion

Running LLMs locally used to be a complex dance of dependencies, CUDA setups, and specific model formats. Enter Ollama!

Ollama is a fantastic open-source tool that simplifies running large language models on your local machine. It provides a simple command-line interface and a local API server that handles:

Model Management: Easily download, pull, and manage various open-source LLMs (like Llama 3, Mistral, Gemma, etc.).
GPU Acceleration: Automatically leverages your GPU (if available and configured) for faster inference, falling back to CPU if not.
API Endpoint: Exposes a simple HTTP API (usually on localhost:11434) that other applications, like any-llm, can connect to.

Think of Ollama as a friendly local server that speaks “LLM.” It takes care of the heavy lifting so you can focus on building your applications.

Figure 11.1: How any-llm interacts with Ollama for local LLM inference.

`any-llm` and Ollama: A Seamless Integration

This is where any-llm shines! Just as it abstracts various cloud providers, any-llm also provides a unified interface to interact with your local Ollama server. You don’t need to learn Ollama’s specific API; you just tell any-llm to use the "ollama" provider, and it handles the rest. This means you can develop your application logic once and effortlessly switch between cloud and local models as needed!

Step-by-Step Implementation: Getting Hands-On

Let’s get our hands dirty and set up Ollama, pull a model, and then connect any-llm to it.

Step 1: Install Ollama (Version 0.1.25+ as of 2025-12-30)

First, we need to install Ollama itself. As of December 2025, Ollama is rapidly evolving, with stable releases frequently adding new models and features. We’ll aim for version 0.1.25 or later, but always check the official Ollama website for the absolute latest stable release and installation instructions specific to your operating system.

Visit the Official Ollama Website: Navigate your web browser to ollama.com.
Download the Installer: On the homepage, you’ll find download options for macOS, Windows, and Linux. Choose the one appropriate for your system.
Run the Installer:
- macOS: Open the downloaded .dmg file and drag the Ollama application to your Applications folder. Run it.
- Windows: Run the downloaded .exe installer and follow the prompts.
- Linux: Open your terminal and follow the instructions on the Ollama website. Typically, it involves a curl command to download and run an installation script. For example:
```
curl -fsSL https://ollama.com/install.sh | sh
```
Once installed, Ollama will usually start a background service automatically. You can verify it’s running by opening a new terminal and typing:
```
ollama --version
```
You should see output similar to ollama version is 0.1.25 (or a newer version). If you encounter issues, refer to the official Ollama documentation for troubleshooting.
Official Documentation Link: https://ollama.com/download

Step 2: Pull Your First Local LLM

With Ollama installed, let’s download an actual LLM. We’ll start with llama3, a powerful and popular open-source model.

Open your terminal and type:

ollama pull llama3

You’ll see a progress bar as Ollama downloads the model layers. This might take a few minutes depending on your internet speed and the model size. llama3 is a significant download, often several gigabytes.

Once downloaded, you can list your available models:

ollama list

You should see llama3 (and potentially others if you’ve pulled them before) listed.

Step 3: Install `any-llm` with Ollama Support

If you haven’t already, ensure any-llm is installed with its Ollama dependencies. This ensures all necessary components are available.

pip install 'any-llm-sdk[ollama]'

This command installs the core any-llm-sdk along with the specific dependencies needed to communicate with Ollama.

Step 4: Interact with Ollama using `any-llm`

Now for the fun part! Let’s write some Python code to talk to our local llama3 model via any-llm.

Create a new Python file, say local_llm_app.py.

First, we’ll import completion from any_llm and make a simple request.

# local_llm_app.py
from any_llm import completion

def run_local_completion():
    print("--- Running Local LLM Completion (Llama 3 via Ollama) ---")
    
    # We specify the provider as 'ollama' and the model as 'llama3'
    # The 'llama3' model name corresponds to the one we pulled with 'ollama pull llama3'
    response = completion(
        provider="ollama",
        model="llama3",
        prompt="What is the capital of France?",
        temperature=0.7 # A bit of creativity, but keep it factual for this query
    )
    
    # The response object is similar to what you get from other providers
    print(f"Model: {response.model}")
    print(f"Provider: {response.provider}")
    print(f"Response: {response.text}")

if __name__ == "__main__":
    run_local_completion()

Explanation of the code:

from any_llm import completion: Imports the core function for making LLM requests.
provider="ollama": This is the magic! It tells any-llm to route this request to your local Ollama server.
model="llama3": This specifies which local model Ollama should use. It must match a model you’ve pulled (e.g., ollama pull llama3).
prompt="What is the capital of France?": Your question for the LLM.
temperature=0.7: Controls the randomness of the output. A value of 0 makes the output more deterministic, higher values make it more creative.
response.text: Extracts the generated text from the LLM’s response.

Run this script from your terminal:

python local_llm_app.py

You should see output similar to:

--- Running Local LLM Completion (Llama 3 via Ollama) ---
Model: llama3
Provider: ollama
Response: Paris is the capital of France.

How cool is that? You just ran an LLM locally on your machine using any-llm!

Step 5: Streaming Responses from Local LLMs

Just like with cloud providers, any-llm supports streaming responses from Ollama. This is fantastic for user experience, as it allows you to display the LLM’s output as it’s generated, rather than waiting for the entire response.

Let’s modify our local_llm_app.py file to include streaming:

# local_llm_app.py (continued)
from any_llm import completion

def run_local_completion():
    print("--- Running Local LLM Completion (Llama 3 via Ollama) ---")
    
    response = completion(
        provider="ollama",
        model="llama3",
        prompt="What is the capital of France?",
        temperature=0.7
    )
    
    print(f"Model: {response.model}")
    print(f"Provider: {response.provider}")
    print(f"Response: {response.text}")

def run_local_streaming_completion():
    print("\n--- Running Local LLM Streaming Completion (Llama 3 via Ollama) ---")
    
    # Add stream=True to enable streaming
    streamed_response = completion(
        provider="ollama",
        model="llama3",
        prompt="Explain the concept of quantum entanglement in simple terms.",
        stream=True, # This is the key for streaming!
        temperature=0.5
    )
    
    print("Streaming Response (chunks):")
    full_response_text = ""
    for chunk in streamed_response:
        # Each chunk is a Response object, but only the 'text' attribute contains the streamed part
        print(chunk.text, end="", flush=True) # Print each chunk immediately
        full_response_text += chunk.text
    print("\n--- End of Stream ---")
    
    # You can still access the full response text if you concatenate the chunks
    # print(f"\nFull Streamed Response: {full_response_text}")

if __name__ == "__main__":
    run_local_completion()
    run_local_streaming_completion()

Explanation of the new code:

stream=True: This crucial parameter tells any-llm to expect a generator object back, yielding chunks of text as they become available.
for chunk in streamed_response:: We iterate through the streamed_response object. Each chunk object will have a text attribute containing a small part of the LLM’s reply.
print(chunk.text, end="", flush=True): We print each chunk.text without a newline (end="") and immediately flush the output buffer (flush=True) to ensure it appears in real-time.

Run this updated script:

python local_llm_app.py

You’ll now see the “Quantum entanglement” explanation appearing word by word or sentence by sentence, demonstrating the power of streaming!

Mini-Challenge: Explore Another Local Model

You’ve successfully integrated llama3 with any-llm locally. Now it’s your turn to explore!

Challenge:

Pull a different model using Ollama. A good choice would be mistral.
Modify your local_llm_app.py script to use this new mistral model for a streaming completion.
Ask the mistral model a new, interesting question (e.g., “Write a short poem about a cat watching birds from a window.”).
Observe the differences in output style or speed between llama3 and mistral.

Hint: Remember to use ollama pull mistral in your terminal first! Then, simply change the model parameter in your any-llm completion call.

What to Observe/Learn:

How easy it is to switch between different local models using Ollama and any-llm.
The distinct “personalities” or strengths of different open-source LLMs.
The performance characteristics (speed) of different models on your hardware.

Take your time, experiment, and have fun!

Common Pitfalls & Troubleshooting

Working with local LLMs can sometimes introduce unique challenges. Here are a few common issues and how to resolve them:

Ollama Server Not Running:
- Symptom: Your any-llm script might hang, time out, or throw connection errors (e.g., ConnectionRefusedError, requests.exceptions.ConnectionError).
- Cause: The Ollama background service isn’t active.
- Solution: Ensure Ollama is running. On macOS/Windows, check your system tray or applications folder. On Linux, you might need to manually start the service (e.g., ollama serve in a dedicated terminal window, or check systemd services if you configured it as such).
Model Not Downloaded or Incorrect Name:
- Symptom: any-llm might return an error like “model ‘xxx’ not found” or “Ollama: model ‘xxx’ not found.”
- Cause: You haven’t pulled the specified model with ollama pull <model_name>, or there’s a typo in the model name.
- Solution:
  - Run ollama list in your terminal to see all currently downloaded models and their exact names.
  - If the model isn’t listed, run ollama pull <model_name> (e.g., ollama pull mistral).
  - Double-check the model parameter in your any-llm.completion() call to ensure it matches exactly.
Resource Constraints (CPU/RAM/GPU):
- Symptom: The LLM responses are extremely slow, your computer becomes unresponsive, or the Ollama process crashes.
- Cause: LLMs, especially larger ones, consume significant CPU, RAM, and GPU resources. Your system might not have enough.
- Solution:
  - Use smaller models: Ollama offers quantized versions of models (e.g., llama3:8b instead of llama3). mistral is generally lighter than llama3.
  - Close other applications: Free up RAM and GPU memory.
  - Check GPU drivers: Ensure your GPU drivers are up-to-date for optimal performance with Ollama.
  - Consider upgrading hardware: For serious local LLM work, a powerful CPU and a decent GPU (8GB+ VRAM recommended) are beneficial.
Network Configuration (Ollama API Port):
- Symptom: Even with Ollama running, any-llm can’t connect, reporting connection errors.
- Cause: Ollama might be running on a non-default port (11434), or a firewall is blocking the connection.
- Solution:
  - Ollama defaults to localhost:11434. If you’ve configured it differently (e.g., via OLLAMA_HOST environment variable), any-llm needs to know. You can often specify the base URL for Ollama if needed, though any-llm usually finds it automatically.
  - Check your firewall settings to ensure localhost:11434 is not blocked.

Summary: Your Local AI Powerhouse

Congratulations! You’ve taken a significant step in your any-llm journey by mastering local LLM integration with Ollama. You now have the flexibility to run powerful AI models right on your machine, offering unparalleled privacy, cost control, and offline capabilities.

Here are the key takeaways from this chapter:

Local LLMs offer benefits like privacy, cost savings, low latency, and offline access compared to cloud-based alternatives.
Ollama simplifies the process of downloading, running, and managing open-source LLMs locally, providing a convenient API.
any-llm seamlessly integrates with Ollama by simply setting provider="ollama" and specifying the model name.
You learned to install Ollama, pull models like llama3 and mistral, and perform both single-shot and streaming completions using any-llm.
Troubleshooting common issues like the Ollama server not running or resource constraints is crucial for a smooth local LLM experience.

The ability to switch between cloud and local providers with any-llm empowers you to build robust and flexible AI applications. In the next chapter, we’ll delve deeper into more advanced any-llm features, such as handling different output formats and perhaps even exploring custom model configurations!

References

Ollama Official Website: The primary resource for downloading, installing, and learning about Ollama. https://ollama.com/
mozilla-ai/any-llm GitHub Repository: The official source for the any-llm library, including installation and usage details. https://github.com/mozilla-ai/any-llm
Introducing any-llm: A unified API to access any LLM provider (Mozilla.ai Blog): Provides context and vision behind the any-llm project. https://blog.mozilla.ai/introducing-any-llm-a-unified-api-to-access-any-llm-provider/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.