Running Models Locally

If you want to keep your data entirely on your own hardware, avoid per-token costs, or use Errand without an internet connection, you can run AI models locally. This page covers the main options and helps you set realistic expectations.

Before You Start

Running models locally is very different from using a cloud API. Here is what you need to know:

Hardware Requirements

AI models are computationally demanding. The quality of results you can achieve locally depends directly on the hardware you have available:

GPU strongly recommended. Modern AI models run dramatically faster on a GPU. Without one, you will be waiting minutes rather than seconds for each response. For the agent model, a dedicated GPU is effectively a requirement.
Memory matters. A model needs to fit in your GPU’s VRAM (or system RAM for CPU inference). Larger models produce better results but need more memory. A 70B parameter model — the minimum we recommend for agent use — typically requires 40GB+ of VRAM.
Storage. Model files are large. Expect to download 4-40GB per model depending on size and quantisation.

Performance Expectations

Be honest with yourself about what local models can deliver:

Smaller models are less capable. A 7B or 13B parameter model is fine for title generation and Hindsight, but it will struggle with the complex tool-calling and multi-step reasoning that the agent needs. See Choosing the Right Models for minimum tier recommendations.
Slower responses. Even with good hardware, local inference is generally slower than cloud APIs. This is especially noticeable for the agent model, which makes multiple calls during each task.
Quality trade-off. The most capable cloud models (Frontier tier) have no local equivalent. If you need the very best quality, cloud is still the way to get it.

None of this means local models are not useful — they absolutely are, especially for the simpler model slots and for the hybrid approach described below.

Ollama

Ollama is the easiest way to run models locally. It handles downloading, configuring, and serving models with a simple command-line interface. If you have used Docker, the experience is similar — you pull a model and it just works.

Getting Started

Install Ollama from ollama.com
Pull a model: ollama pull llama3.3:70b
Ollama automatically starts a local API server

Recommended Models for Errand

Errand Slot	Recommended Model	Size	Notes
Agent	`llama3.3:70b` or `qwen2.5:72b`	~40GB	Minimum for reliable tool calling. Needs a powerful GPU.
Title Generation	`llama3.2:3b` or `qwen2.5:3b`	~2GB	Any small model works well for this.
Hindsight	`llama3.2:3b` or `qwen2.5:7b`	~2-4GB	Small models are fine; step up to 7B for richer memory.
Transcription	`whisper:large-v3`	~3GB	Standard Whisper model for speech-to-text.

These are starting points. The open-source model landscape evolves rapidly — check the Ollama model library for the latest options.

Connecting Ollama to LiteLLM

LiteLLM has native support for Ollama. Once Ollama is running, add it as a provider in your LiteLLM configuration and your locally hosted models will appear alongside any cloud models you have configured. Errand sees them all the same way.

vLLM

vLLM is a high-performance inference engine designed for production deployments. It is more complex to set up than Ollama but delivers significantly better throughput, especially when serving multiple concurrent requests.

When to Choose vLLM Over Ollama

Production deployments where you need consistent performance under load
GPU clusters with multiple GPUs that you want to use efficiently
Higher throughput — vLLM’s PagedAttention engine is optimised for serving many requests
Team environments where multiple users or Errand instances share the same model server

For a single user on a single machine, Ollama is simpler and works well. For anything larger, vLLM is worth the additional setup effort.

Getting Started

vLLM can be installed via pip or run as a Docker container. See the vLLM documentation for detailed setup instructions.

Connecting vLLM to LiteLLM

vLLM exposes an OpenAI-compatible API, so LiteLLM can connect to it directly. Add your vLLM endpoint as an OpenAI-compatible provider in LiteLLM and configure the models you are serving.

Other Options

Several other tools can serve models locally. These are worth mentioning if you have specific needs:

llama.cpp — The low-level inference engine that powers Ollama under the hood. Use it directly if you want maximum control over quantisation, context sizes, and performance tuning. Command-line focused with no GUI.
LM Studio — A desktop application with a graphical interface for downloading and running models. Good for experimentation and trying out different models before committing to one. Exposes an OpenAI-compatible API that LiteLLM can connect to.
LocalAI — An OpenAI-compatible API wrapper that can serve multiple model types (language, image, audio). Useful if you want a single local server that handles all your AI needs.

The Hybrid Approach

For most users interested in local models, a hybrid setup delivers the best experience: run affordable local models for the simpler tasks, and use a cloud provider for the agent where quality matters most.

Hybrid approach diagram showing the cloud provider handling the agent model while Ollama runs title generation, Hindsight, and Whisper models locally, all routed through LiteLLM

This gives you:

Privacy where it matters. Your voice recordings and memory data never leave your machine.
Low cost for simple tasks. Title generation and memory operations run locally at no per-token cost.
High quality for the agent. The agent — where capability matters most — uses a capable cloud model.
Simplicity. LiteLLM routes requests to the right backend automatically. Errand does not know or care which models are local and which are cloud-hosted.

Configuration Tips

Increase the LLM Timeout

Local models, especially larger ones, can take longer to load into memory and generate responses. The default LLM timeout of 30 seconds may not be enough.

Go to Settings > Task Management and increase the LLM Timeout to at least 120 seconds for local models. If you see timeout errors, increase it further. See the Task Management documentation for details.

Quantisation

Most local model tools support quantised versions of models — smaller, faster files that trade a small amount of quality for significantly reduced memory requirements. For example, a 70B model in 4-bit quantisation needs roughly half the VRAM of the full-precision version.

For Errand’s Efficient-tier tasks (title generation, Hindsight), quantised models work extremely well. For the agent model, use the least aggressive quantisation your hardware can handle to preserve tool-calling reliability.

Model Loading Time

The first request to a local model may take significantly longer as the model loads into GPU memory. Subsequent requests are fast. If you notice long delays on the first task after starting your system, this is normal — the model is warming up.