Build Your Own Local ChatGPT - Part 3: Voice, Vision, and Memory

Author’s note: yes, this header image was once again generated by my own personal GenAI assistant. I swear it is getting slightly better each iteration.

Welcome back to Part 3 of our journey to build a self-hosted, ChatGPT-style AI assistant. In Part 1 we got the hardware and base stack running (Ubuntu, Docker, Portainer, Ollama, OpenWebUI). In Part 2 we dove into models - where to find them, how to pick them, and what quantization actually means for your VRAM budget.

By this point, you should have a perfectly functional chat interface talking to a local LLM. But let’s be honest: a plain text chat window is only a fraction of what makes ChatGPT feel like “the future.” What really sells the experience is the stuff around the chat: being able to talk to it, having it talk back, letting it see the web, generating images on demand, and having it remember you between sessions.

That’s what this part is about. We’re going to layer in speech-to-text, text-to-speech, web search, image generation via ComfyUI, and finally, auto-memory.

Fair warning: this is the longest installment yet, because this is where things get fun.

Speech-to-Text (STT)

Let’s start with getting your voice into the chat. Here’s the thing: OpenWebUI already ships with Whisper built in. You don’t need to set up a separate container to get speech recognition working. Just go to Admin Panel -> Settings -> Audio, make sure the STT engine is set to “Whisper”, pick a model size, and the microphone icon in the chat composer will just work. That’s it.

So why am I even writing this section? Because the built-in Whisper runs on your CPU, and on a large model it is slow. We’re talking several seconds of wait before your words appear as text. If you’re on a fast machine with spare VRAM, offloading Whisper to its own GPU-accelerated container is worth the extra setup.

The easiest way to do that is to run faster-whisper behind a small OpenAI-compatible API wrapper. There are several community images for this, but I’ve had the best luck with onerahmet/openai-whisper-asr-webservice.

Add this service to your existing Portainer stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
whisper:
    container_name: whisper
    image: onerahmet/openai-whisper-asr-webservice:latest-gpu
    restart: unless-stopped
    gpus: all
    environment:
    - ASR_MODEL=large-v3
    - ASR_ENGINE=faster_whisper
    volumes:
    - whisper_cache:/root/.cache
    ports:
    - "9000:9000"
    networks:
    - ollama-network

And don’t forget to add whisper_cache: under the volumes: block at the bottom of your compose file.

The first time this container starts, it will download the Whisper model (large-v3 is around 3GB and uses roughly 4-6GB of VRAM during inference). If you’re tight on VRAM, drop down to medium or small; the accuracy hit is real but manageable for short voice commands.

Once it’s running, go back to Admin Panel -> Settings -> Audio and switch the STT engine from “Whisper” to “OpenAI” (yes, counterintuitive, but this just means “OpenAI-compatible API”), and point it at http://whisper:9000/v1. Save, then test it by clicking the microphone icon in the chat composer.

tl;dr: if you just want it to work, use the built-in Whisper. If you want it to be fast, run the separate container.

Text-to-Speech (TTS)

Now for the reverse direction. You want the model to actually speak its replies, not just render them as text.

There are several local TTS options, and I went through most of them before landing on one I liked:

Piper: extremely fast, CPU-friendly, but the voices are noticeably robotic. Fine for notifications, not great for conversation.
Coqui XTTS v2: high quality, supports voice cloning, but the project itself is in limbo and the container ecosystem around it has gotten messy.
Kokoro: my current favorite. Small (82M parameter) TTS model, astonishingly natural-sounding voices, and runs happily on a modest GPU or even CPU.

I’ll show the Kokoro setup here since it’s what I actually use day-to-day. The remsky/kokoro-fastapi image wraps it in an OpenAI-compatible endpoint, which is exactly what OpenWebUI wants.

1
2
3
4
5
6
7
8
9
kokoro:
    container_name: kokoro
    image: ghcr.io/remsky/kokoro-fastapi-gpu:latest
    restart: unless-stopped
    gpus: all
    ports:
    - "8880:8880"
    networks:
    - ollama-network

Back in OpenWebUI, go to Admin Panel -> Settings -> Audio, set the TTS engine to “OpenAI”, and point it at http://kokoro:8880/v1. For the API key, you can enter anything; Kokoro doesn’t check it, but OpenWebUI requires the field to be non-empty.

You’ll get a dropdown of available voices. My personal favorite is af_bella for general use; it has a warm, conversational quality that doesn’t feel like it’s reading a textbook at you. Try a few; it’s a matter of taste.

At this point, you can click the mic, speak your prompt, and have the model respond out loud. Combined with a decent set of speakers, this genuinely starts to feel like Her.

Web Search

One of the biggest weaknesses of a purely local LLM is that its knowledge is frozen at training time. If you ask it what happened last Tuesday, you’re going to get a polite shrug or, worse, a confident hallucination.

OpenWebUI has first-class support for web search, and the options have gotten much better over the last year. The easiest to self-host is SearXNG, a metasearch engine that aggregates results from Google, Bing, DuckDuckGo, and others without exposing you to any of them directly.

Add this to your stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
searxng:
    container_name: searxng
    image: searxng/searxng:latest
    restart: unless-stopped
    environment:
    - BASE_URL=http://searxng:8080/
    - INSTANCE_NAME=local-searxng
    volumes:
    - searxng_config:/etc/searxng
    ports:
    - "8888:8080"
    networks:
    - ollama-network

After it starts, you’ll need to edit the generated settings.yml inside the searxng_config volume to enable the JSON output format that OpenWebUI needs. Find the formats: section and add json to the list:

1
2
3
4
search:
  formats:
    - html
    - json

Restart the container, then head into OpenWebUI under Admin Panel -> Settings -> Web Search. Enable it, choose SearXNG as the engine, and set the query URL to http://searxng:8080/search?q=<query>.

Once enabled, there will be a little globe icon in your chat composer. Toggle it on before sending a message and the model will search the web, feed the results into its context, and answer based on what it found. This alone fixes probably 70% of the “my local model feels dumb” complaints I had before setting it up.

If you want something more powerful, paid APIs like Tavily or Brave Search plug in just as easily via their API keys, and give substantially better results than SearXNG for complex queries. I use Tavily for anything technical and SearXNG for everything else.

Image Generation with ComfyUI

Alright, this one is the most involved, but also the most impressive when you get it working. We want to be able to say “draw me a watercolor of a sleepy corgi” in chat and have an image come back.

The cleanest local stack for this is ComfyUI, a node-based interface for running Stable Diffusion and its many descendants (SDXL, Flux, SD3, etc.). OpenWebUI has native support for ComfyUI as an image backend.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
comfyui:
    container_name: comfyui
    image: yanwk/comfyui-boot:cu124-slim
    restart: unless-stopped
    gpus: all
    volumes:
    - comfyui_data:/root
    ports:
    - "8188:8188"
    networks:
    - ollama-network

The first startup takes a while because ComfyUI will pull down its dependencies and a base model. Once it’s up, visit http://<host>:8188 and you’ll see the raw ComfyUI node editor. Don’t worry, you’ll never need to touch it if you don’t want to.

The real work is picking a model. You’ll want to drop a checkpoint into the comfyui_data/ComfyUI/models/checkpoints/ directory. My recommendations, in order of preference:

Flux.1 Dev: current state of the art for prompt adherence. Needs around 12-16GB VRAM, though quantized GGUF variants will run on 8GB.
SDXL Base + Refiner: older but still very capable, runs happily on 8GB VRAM.
SD 1.5: tiny, fast, and the quality is genuinely not great by 2026 standards. Only use this if you’re on something like a 4GB card.

After dropping your checkpoint in, restart the container and go back to OpenWebUI. Admin Panel -> Settings -> Images. Set the engine to ComfyUI, the URL to http://comfyui:8188, and pick your model from the dropdown.

Now in chat, you can type something like “generate an image of a foggy mountain pass at dawn” and the model will call ComfyUI, which will produce the image and drop it inline in the conversation. It is legitimately magical the first time it works.

One footgun to be aware of: image generation is VRAM-hungry. If you’re already loading a 13B+ model into Ollama, ComfyUI may not have enough memory to run. You have three options: shrink the LLM, shrink the image model, or make Ollama unload the LLM when it’s idle by setting OLLAMA_KEEP_ALIVE=0, which will swap the model out when ComfyUI needs the VRAM.

Auto-Memory

This is the feature that really tipped the whole setup over the edge from “neat toy” to “thing I actually use every day.”

Out of the box, every new conversation with your local LLM is a blank slate. It doesn’t remember that you’re a vegetarian, or that you have a dog named Biscuit, or that you prefer terse answers over verbose ones. Every conversation starts from zero.

OpenWebUI has a built-in memory feature (found under User Settings -> Personalization -> Memory) where you can manually add facts. These get injected into every chat automatically - useful, but manually curating a memory list is tedious and you’ll inevitably forget to do it.

The real unlock is one setting: go to your model’s configuration and set Function Calling to Native. That’s it. With native function calling enabled, the model can automatically read from and write to your memories mid-conversation. Say something worth remembering, and it’ll save it. It’ll also reach into your notes and knowledge bases when relevant, without you having to tell it to. The whole thing becomes a lot more alive.

I spent a while messing around with community filter functions that attempted to do this same thing in a more manual, fragile way. They mostly worked. Until they didn’t. Missed memories, duplicate entries, wrong model invocations. Native function calling sidesteps all of that by letting the model itself decide when to use these tools, which is both more reliable and more natural.

A couple of things worth knowing:

Not all models support native function calling well. Smaller 7B models can be hit-or-miss. In my experience, anything 14B and above handles it reliably.
It’s worth reviewing your saved memories occasionally. The model is pretty good at knowing what to keep, but it will sometimes save things that felt important in context and look silly a week later (“user is stressed about a dentist appointment”). You can view and prune your memory list anytime under User Settings -> Personalization -> Memory.

Pulling It All Together

If you’ve followed along through all three parts, your stack now does all of this:

Chats with a local LLM of your choice
Listens to you speak via Whisper
Talks back via Kokoro
Searches the web via SearXNG
Generates images via ComfyUI
Remembers facts about you between sessions

All running on your own hardware, with nothing leaving your network except the explicit web search queries (and even those are anonymized by SearXNG).

Your compose file is probably getting long at this point. Mine is pushing 200 lines. It’s worth splitting into multiple Portainer stacks grouped by purpose - core (Ollama, OpenWebUI), audio (Whisper, Kokoro), tools (SearXNG, ComfyUI). That way you can restart one group without knocking the others offline.

What’s Next

In the final installment, Part 4, I’ll cover the stuff that’s less about the core capabilities and more about living with the thing long-term: exposing it securely over the internet via NGINX and a reverse proxy, multi-user setup for family members, GPU sharing strategies when you want to run more than the VRAM can hold, and my honest-to-god assessment of how this whole local AI stack stacks up against GPT-5 and Claude after six months of daily use.

Short version of the spoiler: it’s not as good. But the gap is much, much smaller than you’d think, and the privacy and control trade-offs are, for me, overwhelmingly worth it.

See you in Part 4.