Ollama and Local LLMs¶

Ollama provides a simple way to run open-source LLMs locally. Complete data privacy, no API costs after download, offline capability, and full customization. The tradeoff is hardware requirements and lower capability than frontier closed-source models.

Key Facts¶

Full data privacy - nothing leaves your machine
No cost after download - free forever
Works offline - no internet required
Server endpoint: http://localhost:11434
Start with Q4 8B model - if it runs smoothly, try larger
Open-source models are not yet at GPT-4/Claude level but the gap is narrowing rapidly

Setup¶

# Install from ollama.com, then:
ollama run llama3.1:8b          # Download + run interactively
ollama list                      # Show downloaded models
ollama serve                     # Start API server (port 11434)
/bye                             # Exit interactive session

Hardware Requirements¶

Model Size	VRAM Needed	Hardware
8B params	8-16 GB	Consumer GPUs (RTX 4060+)
70B params	40+ GB	Professional workstations
405B params	Multiple H100s	Impractical for local use

Quantization Levels¶

Quant	Quality	Size (8B)	Recommendation
Q2	Very low	Smallest	Not recommended
Q4	Good	~4GB	Best starting point
Q5	Better	~5GB	If Q4 runs well
Q6	High	~6GB	Strong GPU
Q8	Near-original	~8GB	RTX 4060+
fp16	Original	~16GB	Needs >= 16GB VRAM

ollama run llama3.1:8b-instruct-q4_0  # Q4 quantized
ollama run llama3.1:8b-instruct-q8_0  # Q8 quantized

Recommended Models¶

Model	Strength
Llama 3.1 8B	Best open-source for local use, competitive with GPT-3.5
Llama 3.1 70B	Beats GPT-3.5 in benchmarks
Gemma 2	Strong alternative from Google
Mistral	Good general-purpose
Qwen 2	Strong multilingual
DeepSeek Coder	Specialized for code
Phi-3	Small but capable (Microsoft)
Dolphin-Llama3	Uncensored variant

Dolphin models: fine-tuned to remove alignment/censorship. Useful for unrestricted local use.

Groq API - Fast Inference Alternative¶

When local hardware is insufficient, Groq provides fast inference via LPU (Language Processing Unit): - Much faster than GPU-based inference - Works with open-source models (Llama 3.1, etc.) - Free tier available, cheap paid pricing - OpenAI-compatible API format

Integration Patterns¶

FlowWise¶

Add "Chat Ollama" node
Set base URL: http://localhost:11434
Set model name (e.g., llama3.1:8b)
Connect to supervisor/chain as usual

LangChain¶

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="llama3.1:8b", temperature=0)
response = llm.invoke("Hello, world!")

Local RAG Stack¶

Chat model: Chat Ollama
Embeddings: Ollama Embeddings (same server)
Vector store: In-memory or Chroma
Document loader: PDFs, text files, web scraper
Text splitter: Recursive character (chunk 700-1000, overlap 50-100)
Memory: Buffer memory for conversation context

Entirely local, free, private.

Local Email Agent¶

Three-agent pipeline on Ollama: 1. Email Summarizer - reads email, produces summary 2. Email Responder - generates reply in user's style (via RAG on sample emails) 3. Email Formatter - formats and saves to file

Complete privacy for email processing.

Gotchas¶

Small quantized models produce errors with complex agent workflows - use capable models for agents
Must upsert documents into vector store before RAG works - the #1 FlowWise mistake
Local models are significantly slower than API models without GPU acceleration
CPU-only inference works but is very slow for models > 3B parameters
Ollama downloads models on first use - plan for download time on slow connections
Quality degrades noticeably below Q4 quantization for most tasks
Function calling is less reliable with local models than with GPT-4 or Claude