I was approached by a prospective client to build a chat application that works with a locally hosted LLM. In this blog post, I’ll share my notes on exploring different approaches to serve a local LLM and to build a simple chat UI to interact with it.

Application architecture

Before we do any coding, let’s consider the various components we need and how they interact with each other:

  1. LLM: a self-hosted model that is suitable for chat applications. Use instruction-tuned models for better results, as they are trained to grasp user intent and respond conversationally.
  2. Backend application: a RESTful API that exposes our LLM functionality to the chat client.
  3. Frontend client: a simple web-based chat UI that interfaces with the backend application.

Architecture diagram of an AI chat application

We’ll containerize the application and use Docker Compose to easily run everything with a single command.

Setting up a new Python project

Create a new Python project using uv:

$ uv init llm-chat --python 3.13

Note that Python 3.13 is used while 3.14 is released at the time of writing, since a lot of packages aren’t yet updated to support the latest version.

Organizing dependencies with uv workspace members

This is a good opportunity to set up the project as a workspace and try out different libraries in separate folders/packages called workspace members. It allows us to cleanly separate dependencies for each member while using a single venv for the entire project. Most IDEs (PyCharm/VS Code/etc.) expect a Python project to have a single venv, so this makes development easier while keeping dependencies organized.

Running a local LLM

There are thousands of open source models that you can download from Hugging Face. Popular ones include Llama 3.2 from Meta, Gemma 3 from Google and Phi 4 from Microsoft. Note that some of these, while free to download, require a Hugging Face account and an access token.

For this demo application, I’m using Qwen2.5 0.5B Instruct as it’s relatively small (~1 GB in file size) with a good balance between performance and output quality.

Screenshot of the Qwen2.5 0.5B Instruct model card on the Hugging Face Hub

When you’re viewing the model card on the Hugging Face Hub, on the right side you’ll see a dropdown button labeled “Use this model”. Clicking this dropdown reveals several options including Transformers, vLLM and Text Generation Inference. Let’s start with the first one.

Transformers

Transformers is a Python library provided by Hugging Face for model inference and training. Real Python has an excellent introductory blog post on this library.

From the project root directory, create a new workspace member:

$ uv init playground/transformers --bare --name llm-chat-transformers

This creates a subdirectory playground/transformers with only a pyproject.toml file in it. uv automatically detects that this is a workspace member and adds the following lines to the parent pyproject.toml accordingly:

[tool.uv.workspace]
members = [
    "playground/transformers",
]

Now add Transformers as a dependency (including PyTorch for model inference) to the playground/transformers workspace member:

$ cd playground/transformers
$ uv add "transformers[torch]"

Note that the transformers[torch] dependency is only added to the playground/transformers/pyproject.toml workspace member, not to the parent pyproject.toml.

The model card on Hugging Face Hub provides a high-level code example for using Transformers. Let’s modify this a little bit and try it out:

# playground/transformers/example.py

from transformers import pipeline

text_generator = pipeline("text-generation", model="Qwen/Qwen2.5-0.5B-Instruct")
messages = [
    {"role": "user", "content": "What's the capital of France?"},
]
output = text_generator(messages)
print(output)
# [{'generated_text': [{'role': 'user', 'content': "What's the capital of France?"}, {'role': 'assistant', 'content': 'The capital of France is Paris.'}]}]

With just a few lines of code, you can generate text from a local LLM! Here’s what’s happening:

  1. We initialize a TextGenerationPipeline pipeline using the pipeline function from the Transformers library.
  2. It will download the model from the Hugging Face Hub and cache it locally.
  3. As we’re using a conversational model, we can pass a list of messages to the text_generator pipeline. Otherwise, a string would be accepted.
  4. The text_generator pipeline returns a list of generated text, including the user’s message and the model’s response.

While pipeline() works for simple use cases, if you want to customize the output (for instance, to enable streaming responses) you need to write your own API on top of low-level classes as AutoTokenizer and AutoModelForCausalLM. Plus, it’s not designed with scalability in mind. Fortunately there are alternatives that are more developer-friendly out of the box.

vLLM

vLLM is also a Python library and has the tagline “Easy, fast, and cheap LLM serving for everyone”. It’s designed around an easy-to-use LLM class and provides an OpenAI-compatible API server (using FastAPI) so you can interact with the model using the official OpenAI Python client (or any other HTTP client).

From the project root directory, create a new workspace member and install vLLM there:

$ uv init playground/vllm --bare --name llm-chat-vllm
$ cd playground/vllm
$ uv add vllm

Be aware that vLLM by default collects anonymous usage stats. Opt out by setting the VLLM_NO_USAGE_STATS environment variable.

Let’s try out vLLM’s Python API with the following example and run it with uv run example.py:

# playground/vllm/example.py

from vllm import LLM


def main():
    llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", max_num_batched_tokens=32768)
    messages = [
        {"role": "user", "content": "What is the capital of France?"},
    ]
    outputs = llm.chat(messages)
    for output in outputs:
        print(output.outputs[0].text)
        # The capital of France is Paris.


if __name__ == "__main__":
    main()

Compared to the Transformers example, vLLM’s API is a bit more intuitive as it’s specifically designed around LLMs. Let’s walk through it:

  1. We initialize an LLM object with a Hugging Face model name. vLLM downloads and caches the model locally.
  2. For this model, we need to set either max_model_len or max_num_batched_tokens to tune performance and memory usage. It’s set to a lower value than the default as it’s sufficient for this example.
  3. The LLM object has a chat method that takes a list of messages and returns the outputs.
  4. Note that the code needs to be wrapped in a main() function as vLLM uses multiprocessing.

But you don’t even need to write any Python code to interact with a local LLM. You can serve a model with an OpenAI-compatible API with a single vllm serve command:

$ uv run vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len=4096

You can test this API in another terminal with curl:

$ curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen2.5-0.5B-Instruct","object":"model","created":1761858671,"owned_by":"vllm","root":"Qwen/Qwen2.5-0.5B-Instruct","parent":null,"max_model_len":2048,"permission":[{"id":"modelperm-befeda6f7da24484a58b42c5a08bb5d2","object":"model_permission","created":1761858671,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Installing vLLM with uv and running uv run vllm serve is effortless. However, containerizing it is a different story. vLLM does not provide an official Docker image for Apple Silicon, as Docker containers cannot access the host’s GPU on the Mac.

vLLM provides a Dockerfile.cpu that can be used to build a Docker image that runs inference on the CPU. I’ve managed to build it, but requesting a chat completion caused an obscure error in vLLM’s EngineCore which halted the server. So containerizing vLLM on Apple Silicon is currently a no-go.

Text Generation Inference (TGI)

Text Generation Inference (TGI) is a toolkit by Hugging Face for deploying and serving LLMs in production environments. Launching TGI is as simple as running:

$ docker run -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id <your-model-id>

This will serve the model through an OpenAI-compatible API. But similarly to vLLM, TGI on Apple Silicon is currently unsupported so we’ll move on to the next option.

Ollama

Ollama is a user-friendly tool for running LLMs locally, with a simple CLI to download, run and manage models. It also provides an API with OpenAI-compatible endpoints, for easy integration with other applications.

Ollama has a list of available models that you can pull, but you can also use GGUF models from Hugging Face. Under the hood, Ollama uses llama.cpp as inference engine which only supports GGUF format models.

To create a Docker image with a downloaded model, we can use the official ollama/ollama image:

# playground/ollama/Dockerfile

FROM ollama/ollama:latest

ARG MODEL

# `ollama pull` only works if the server is running in the background,
# so we wait for it to be ready before pulling the model
RUN ollama serve & \
    until ollama list > /dev/null 2>&1; do \
      sleep 1; \
    done && \
    ollama pull $MODEL

Use this image to run Ollama as a container with Docker Compose:

# playground/ollama/docker-compose.yml

services:
  ollama:
    build:
      context: .
      args:
        - MODEL=hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF
    ports:
      - 11434:11434

Note that Hugging Face models are passed as argument to ollama run and ollama pull in this format: hf.co/{username}/{repository}.

Start the container with docker compose up --build and then verify that Ollama is running and has the model installed:

$ curl http://localhost:11434/api/tags
{"models":[{"name":"hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF:latest", ...}

We now have a method to build and run containers that serve a local LLM that works on the Mac, but running Ollama via Docker instead of natively on a Mac host has a major downside: it cannot utilize the GPU so it uses CPU for inference which is significantly slower.

Luckily, there is a way to run local models with Docker on the Mac with GPU acceleration enabled.

Docker Model Runner (DMR)

Docker Model Runner (DMR) was released in April 2025 as part of Docker Desktop 4.40. It’s a way to run AI models locally as easily as any other Docker container and is well integrated with the Docker ecosystem. Most importantly, because the inference engine is running on the host instead of in the container, it enables GPU-accelerated inference on Macs with Apple Silicon.

You can pull models from Docker Hub or GGUF models from Hugging Face. Similar to Ollama, DMR uses llama.cpp as inference engine which only supports GGUF models, but there are plans to support multiple inference engines in the future.

With a compatible Docker Desktop version, running a local model in a container couldn’t be easier:

$ docker model run hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF
Unable to find model 'hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF' locally. Pulling from the server.
Downloaded 397.81MB of 397.81MB
Model pulled successfully
> what's the capital of France?
The capital of France is Paris

It’s also very easy to serve models in your Docker Compose application which other services can access through an OpenAI-compatible API. We’ll use this approach for our chat application.

Building the backend application

Now that we have chosen a method to run an LLM locally, we can start building the chat application on top of it. The backend layer sits between the LLM and the frontend chat client and contains our application logic. Decoupling the client from the LLM using an intermediate layer (facade pattern) enables several benefits:

  • Model flexibility: swap the model for a different one or use a cloud-hosted model without needing to modify the chat client.
  • Unified interface: the client only needs to interface with a single backend API regardless of the underlying model implementation.
  • Control LLM output: filter or transform model responses to only return relevant data to the client.

We’ll use FastAPI to build our backend application, as it’s the fastest way to build production-ready APIs with minimal boilerplate in Python.

First, we set up a new workspace member in the backend directory and add our dependencies:

$ uv init backend --bare --name llm-chat-backend
$ cd backend
$ uv add "fastapi[standard]" openai

Now we’ll create the FastAPI application. To keep things simple, we’ll only expose a single /chat endpoint that accepts a list of messages and returns a single message. Create a main.py file and add the following code:

# backend/main.py

import os

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from openai import OpenAI
from pydantic import BaseModel

app = FastAPI()

# Backend runs on localhost:8001 and frontend on localhost:8000,
# so we need to allow cross-origin requests:
# https://fastapi.tiangolo.com/tutorial/cors/
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:8000"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# `LLM_URL` and `LLM_MODEL` environment variables are set by Docker:
# https://docs.docker.com/ai/compose/models-and-compose/#short-syntax
client = OpenAI(
    api_key="",
    base_url=os.environ["LLM_URL"],
)

# The `LLM_MODEL` value sometimes doesn't exactly match the model ID
# that's accepted by the API. In this case, the lowercase value works,
# but otherwise use `client.models.list()` to fetch the correct model ID.
model = os.environ["LLM_MODEL"].lower()


class Message(BaseModel):
    role: str
    content: str


class ChatRequest(BaseModel):
    messages: list[Message]


@app.post("/chat")
def chat(request: ChatRequest) -> Message:
    # Include a system prompt to guide the model's behavior
    messages = [{"role": "system", "content": "You are a helpful assistant."}] + [
        message.model_dump() for message in request.messages
    ]
    chat_completion = client.chat.completions.create(messages=messages, model=model)
    message = chat_completion.choices[0].message
    return Message(role=message.role, content=message.content)

The application sets up an OpenAI client configured to use our local LLM (via the LLM_URL environment variable), uses Pydantic models to define the request and response schemas and returns a chat completion from the LLM.

In a real-world application, you might want to add authentication, rate limiting, logging, etc. You’ll probably also want to store conversations on the server, so users can view and resume earlier chats.

To containerize this FastAPI application, create this Dockerfile:

# backend/Dockerfile

FROM python:3.13-slim

# Create the virtual environment in a directory separate from the application code,
# to prevent accidentally overriding the container's environment
# when mounting the project directory that contains a local `.venv` directory.
RUN --mount=from=ghcr.io/astral-sh/uv,source=/uv,target=/bin/uv \
    uv venv /opt/venv

ENV UV_PROJECT_ENVIRONMENT=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"

RUN --mount=from=ghcr.io/astral-sh/uv,source=/uv,target=/bin/uv \
    --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
    uv sync

WORKDIR /app

COPY main.py .

CMD ["fastapi", "run"]

Next, we’ll create the frontend chat UI that will communicate with our backend API.

Creating the frontend chat UI

Nowadays, there are many open source AI chatbot components available. Which library you can use depends on your LLM model/backend and the particular features it supports: does it have thinking mode, does it have tool calling, etc.

To better understand the underlying mechanics of a chat UI, we’ll build a simple one from scratch with vanilla HTML, CSS and JavaScript.

Create a directory for the frontend application:

$ mkdir frontend

Add the following index.html file to that directory:

<!doctype html>
<html>
<head>
    <title>LLM Chat</title>
    <link rel="stylesheet" href="style.css">
    <script type="module" src="main.js"></script>
</head>
<body>
    <main id="chat"></main>
    <form>
        <input type="text" placeholder="Ask me anything">
    </form>
    <template id="user-message-template">
        <div class="message user-message">
            <div class="message-role">You</div>
            <div class="message-content"></div>
        </div>
    </template>
    <template id="assistant-message-template">
        <div class="message assistant-message">
            <div class="message-role">AI</div>
            <div class="message-content"></div>
        </div>
    </template>
</body>
</html>

This HTML is very minimal, with the only notable thing being the use of <template> elements that provide reusable HTML snippets for chat messages.

To make it interactive, we should at a minimum implement the following behavior:

  1. Allow the user to submit chat messages.
  2. Keep track of the conversation in memory.
  3. Request chat completion based on the messages from the backend API.
  4. Render the chat messages in the UI.

Create a main.js file with the following code:

// frontend/main.js

// Keep track of the conversation history in memory
const messages = [];

const input = document.querySelector("input");
const chat = document.querySelector("#chat");
const userMessageTemplate = document.querySelector("#user-message-template");
const assistantMessageTemplate = document.querySelector("#assistant-message-template");

function receiveMessage(message) {
    // Add the assistant message to the conversation history
    messages.push(message);

    // Add the assistant message to the chat UI
    const assistantMessage = assistantMessageTemplate.content.cloneNode(true);
    assistantMessage.querySelector('.message-content').textContent = message.content;
    chat.appendChild(assistantMessage);
}

function sendMessage(message) {
    // Add the user message to the conversation history
    messages.push({ role: "user", content: message });

    // Add the user message to the chat UI
    const userMessage = userMessageTemplate.content.cloneNode(true);
    userMessage.querySelector('.message-content').textContent = message;
    chat.appendChild(userMessage);

    // Clear the input field
    input.value = "";

    // Request a chat completion based on the current conversation from the backend API
    fetch("http://localhost:8001/chat", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
        },
        body: JSON.stringify({ messages }),
    })
    .then(response => response.json())
    .then(receiveMessage);
}

// Focus the input field when the page loads
input.focus();

document.querySelector("form").addEventListener("submit", (event) => {
    event.preventDefault();
    sendMessage(input.value);
});

Don’t forget to add some styling to style.css to make it look like a chat interface.

Bringing it all together

We have our local LLM, backend application and frontend chat UI. Now we can bring it all together with this docker-compose.yml file:

services:
  backend:
    build:
      context: backend
    command: fastapi dev --host 0.0.0.0
    ports:
      - 8001:8000
    volumes:
      - ./backend:/app
    models:
      - llm
  frontend:
    depends_on:
      - backend
    image: busybox:latest
    ports:
      - 8000:80
    volumes:
      - ./frontend:/www
    command: httpd -f -v -h /www

models:
  llm:
    model: hf.co/bartowski/Qwen2.5-0.5B-Instruct-GGUF

This will serve our static frontend on http://localhost:8000 and the FastAPI backend at http://localhost:8001. Start the application with docker compose up and open http://localhost:8000 in your browser to chat with your local LLM:

Screenshot of the chat application

Wrapping up

I hope this demo chat application shows that it’s fairly easy to try out open source LLMs and start tinkering with them in containerized applications using Docker Model Runner.

While the chat UI itself is very minimal, I hope it prompts you to think about features common in AI chat apps — streaming partial responses, “thinking” indicators, formatting different kinds of LLM output, and more — that need to be designed and built for a production-ready application.

The full source code of the chat application, along with all code examples from this post, is available on GitHub.