Author’s note: this header image was generated by my own personal GenAI assistant, and very poorly I might add.
Welcome back to the second part of our journey to build a self-hosted, ChatGPT-style AI assistant! In Part 1, we covered the foundational setup: getting Ubuntu, NVIDIA drivers, Docker, Portainer, OpenWebUI, and Ollama all up and running on your local machine. You now have a working chat interface, even if it’s currently running a basic model.
In this installment, our focus shifts to the “brains” of your local LLM setup: the models themselves. We’ll dive deep into different methods for acquiring these models and, crucially, how to make the right choices based on your available hardware. Getting this right is key to unlocking the full potential of your self-hosted AI.
Downloading Models to Ollama
One of the greatest advantages of running Ollama is its direct integration with a growing ecosystem of open-source models. You have several avenues for populating your local LLM library.
From Ollama’s Model Library
The easiest way to get started is by using models directly available from the Ollama library. These models are pre-packaged and optimized for Ollama, making installation a breeze.
You can browse the full list on Ollama’s website.
To download a model using the Ollama CLI (which you can access via Docker if Ollama is running in a container, as set up in Part 1), the command is straightforward:
| |
For example, to pull a popular model like Mistral 7B:
| |
Once downloaded, the model will be immediately available in your OpenWebUI instance. Alternatively, you can also do this from the OpenWebUI interface. In the top of the menu where you can switch which model you’re using, you can search for any model and click “Pull
Importing Models from Hugging Face
Hugging Face is the central hub for machine learning models, offering an enormous variety of LLMs. While Ollama’s library is convenient, tapping into Hugging Face allows you access to an even broader selection, including newer or more specialized models.
To use a Hugging Face model with Ollama, you’ll typically need to:
- Download the model file: Often in GGUF format, which is optimized for CPU inference and compatible with many local LLM runtimes, including Ollama.
- Create a Modelfile: Ollama uses “Modelfiles” to define how a model should be served. This is a simple text file that specifies the base model and any custom instructions or parameters (like system prompts, temperature settings, etc.).
- Import the model into Ollama: Using the
ollama createcommand with your Modelfile.
To make this a bit easier, ollama has limited support to pull Hugging Face models via cli. Note that in order for this method to work, the models need to be in GGUF format.
| |
Choosing the Right Model for Your Hardware
Selecting an LLM isn’t just about finding the “smartest” model; it’s about finding the right model that fits your hardware constraints and specific use cases. The primary factor here will be your GPU’s Video RAM (VRAM).
Understanding Model Sizes (Parameters)
LLMs are often described by their number of parameters (e.g., 7B, 13B, 70B). Generally, more parameters mean a more capable (smarter) model, but also significantly higher VRAM requirements.
| Model Size | Approx VRAM / RAM Needed | Typical Capabilities |
|---|---|---|
| 3B - 7B | 4GB-8GB VRAM / 8GB-16GB RAM | Good for basic tasks, summarization, creative text. Fast inference. |
| 13B - 20B | 10GB-16GB VRAM / 24GB-32GB RAM | Stronger reasoning, better code generation, more nuanced responses. Moderate inference speed. |
| 30B - 40B | 20GB-32GB VRAM / 48GB-64GB RAM | Very capable, approaching cloud models for many tasks. Slower inference. |
| 70B+ | 40GB+ VRAM / 128GB+ RAM | State-of-the-art local performance, often requiring high-end GPUs or multiple GPUs. Very slow on CPU. |
GPU vs. CPU Inference
While Part 1 focused on NVIDIA GPUs for performance, it’s worth noting options for other setups:
- NVIDIA GPU: Still the gold standard for speed. Aim to load models that fit within your GPU VRAM, as this offers the fastest inference.
- AMD/Intel GPUs: Ollama is continuously improving support for other GPU manufacturers. Check Ollama’s documentation for the latest compatibility.
- CPU-Only Inference: If you lack a discrete GPU or sufficient VRAM, Ollama can run models entirely on your CPU. This will be significantly slower, but still functional. Look for smaller, highly quantized models for the best CPU performance.
Quantization (Q-levels)
Quantization is a vital optimization technique for running large language models on consumer-grade hardware. In essence, it’s the process of reducing the precision of a model’s weights (the numerical values that define its learned knowledge).
Imagine a model that normally stores its weights using 16-bit floating-point numbers (FP16). Quantization might convert these to 8-bit, 4-bit, or even lower integer representations. This drastically reduces the model’s footprint in terms of both disk space and required VRAM/RAM for inference. Often, this can lead to significant improvements in inference speed with only a minor, sometimes imperceptible, drop in output quality.
Common quantization levels (often seen with GGUF models on Hugging Face, optimized for tools like Ollama) include:
- Q8_0: 8-bit quantization. Offers good balance between size and quality, often very close to FP16 performance.
- Q5_K_M (or Q5_K_S): 5-bit quantization using mixed (M) or small (S) K-quants. A popular sweet spot for many users, offering a great balance of size, speed, and quality. “K-quants” refer to a newer quantization method that uses different quantization types for different parts of the neural network, minimizing quality loss.
- Q4_K_M (or Q4_K_S): 4-bit quantization using K-quants. Provides further memory savings and speed boosts, with a slightly larger potential impact on quality compared to Q5, but still excellent for most use cases.
- Q2_K: 2-bit quantization (or similar very low bit rates). Offers maximum memory savings and speed but can lead to a more noticeable drop in perplexity (a measure of model quality) and response coherence.
The “Q” refers to the number of bits used per weight, and the “_K_M” or “_K_S” denotes specific, more advanced quantization algorithms (like those from Georgi Gerganov’s GGML/GGUF project) that aim to retain as much quality as possible at lower bitrates.
Experimenting with different Q-levels is highly recommended to find the best balance for your specific hardware and desired output quality.
My Experience with Different Local LLMs
Beyond the technical specifications and hardware considerations, a significant part of choosing an LLM comes down to its personality and performance in real-world use. Having experimented with several models on my local setup, here are some of my personal observations and insights:
Google Models (e.g., Gemma): I’ve found that these models tend to have a very upbeat, affirmative, and eager-to-please personality. As the Gen Z’ers might call them, they can sometimes feel like “pick-me"s – always positive and trying to be agreeable. This can be fantastic for certain creative tasks or when you want a very encouraging tone, but it’s something to be aware of if you’re looking for more neutral or critical responses.
Llama Models: These are generally solid contenders for everyday queries. They perform reliably for a broad range of tasks. However, it’s worth noting that the more heavily quantized Llama models sometimes struggle with up-to-date information. Their training data might be older, leading to quickly outdated responses on current events or rapidly evolving topics. If cutting-edge factual recall is paramount, a well-quantized Llama might not always be the first choice.
Qwen Reasoning Models: When it comes to pure reasoning capabilities, Qwen models have consistently impressed me. They are truly powerful in their ability to process complex prompts and deliver logical, structured answers. The trade-off, however, is often speed. These models can take a significant amount of time to “think” and generate responses, making them less ideal for real-time conversational flows but excellent for tasks where depth and accuracy are priorities.
Deepseek Reasoning Models: About a year ago, Deepseek’s reasoning models were at the forefront, offering truly remarkable performance. This highlights just how quickly the LLM landscape evolves. While they were once top-tier, the rapid churning out of new models means that Deepseek, though still capable, has fallen behind the curve. It serves as a stark reminder that what’s impressive today might be merely functional tomorrow.
These personal experiences underscore the importance of not just looking at benchmark numbers, but also considering the model’s “personality” and how its training and optimization impact its utility for your specific needs.
Summary
By understanding how to acquire models and select them wisely based on your hardware and quantization levels, you’re now empowered to truly customize your local LLM chatbot. In the next part of this series, we’ll explore advanced configurations and integrations, including local text-to-speech, speech-to-text, web search, image generation tools like ComfyUI, and techniques for auto-memory, to make your self-hosted AI even more powerful.
