Skip to main content
Ollama · LM Studio · vLLM · Cody CLI Local Mode

Self-Hosted LLM — Run AI Models Locally with AINative

Full privacy, zero API costs, and sub-50ms latency. AINative Cody CLI connects to any local runtime — Ollama, LM Studio, or vLLM — and ZeroDB Local gives your self-hosted agent persistent memory with no cloud dependency.

Why self-host your LLM?

Cloud LLM APIs are convenient but come with trade-offs: data leaves your environment, costs scale with every token, and you depend on a vendor's uptime. Self-hosting eliminates all three.

🔒

Complete data privacy

Your prompts, context, and outputs never leave your machine. Essential for healthcare, legal, finance, and any regulated industry where data residency matters.

💰

Zero marginal cost

After hardware setup, inference is free. High-volume workloads — batch summarization, code analysis, document processing — cost nothing per token.

Sub-50ms latency

Local inference eliminates the network round-trip. On a modern GPU, Llama 3.1 8B serves responses at 80–150 tokens/sec — faster than most cloud APIs.

🎛️

Full control

Customize system prompts, context lengths, temperature defaults, and quantization levels. No rate limits, no API quotas, no vendor lock-in.

Supported Runtimes

AINative Cody CLI is runtime-agnostic. It speaks the OpenAI-compatible API format that all three major local runtimes support.

Ollama

Easiest setup — one command to start

Ollama wraps llama.cpp into a single binary with a REST API that mirrors the OpenAI chat completions spec. Best for developer workstations and quick prototyping.

Platforms:macOS, Linux, Windows
Llama 3.3MistralGemma 2Qwen 2.5DeepSeek R1

Quick start:

curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.3

LM Studio

Desktop GUI — download and run in minutes

LM Studio provides a drag-and-drop desktop app with model discovery, built-in chat, and a local API server. No terminal required. Best for non-technical users and testing.

Platforms:macOS, Windows
Any GGUF model from HuggingFace

Quick start:

Download from lmstudio.ai, enable local server in settings

vLLM

Production-grade serving on GPU clusters

vLLM uses PagedAttention to serve large models at maximum throughput. Handles concurrent users efficiently. Best for internal APIs serving hundreds of requests/sec.

Platforms:Linux (NVIDIA GPU required)
Llama 3.1/3.3MistralMixtralQwenall HuggingFace-compatible

Quick start:

pip install vllm && python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct

Cody CLI Local Mode

Cody CLI is AINative's open-source AI coding assistant. In local mode it connects to Ollama or LM Studio instead of cloud APIs — giving you the same agentic capabilities with zero data leaving your machine.

Agentic workflows

Run multi-step coding tasks, file edits, and terminal commands using your local model — no cloud API calls.

MCP tool use

69+ MCP tools work in local mode. Connect your local Llama to filesystem, git, databases, and custom tools.

Offline memory

ZeroDB Local provides SQLite-backed vector memory that persists across sessions with no network dependency.

# Install Cody CLI
npm install -g @ainative/cody-cli

# Configure for local Ollama
cody config set provider ollama
cody config set model llama3.3
cody config set endpoint http://localhost:11434

# Run an agentic task locally — no API key needed
cody run "Refactor the auth module to use JWT refresh tokens"

# Or start an interactive session
cody chat --local

ZeroDB Local

ZeroDB Local is a fully offline version of AINative's ZeroDB memory database. It uses SQLite for structured storage and FAISS for vector search — the same API as cloud ZeroDB, with zero network calls.

Storage engine
SQLite (zero config)
Vector index
FAISS (IVF + HNSW)
Embeddings
Inline via sentence-transformers
Memory API
remember / recall / forget / reflect
Network required
No — fully offline
Install
pip install zerodb-local
from zerodb_local import ZeroDBLocal

# Opens (or creates) a local SQLite database
db = ZeroDBLocal(path="~/.cody/memory.db")

# Store a memory — embeddings computed locally
db.memory.remember(
    user_id="dev_alice",
    content="Alice is working on the billing refactor — uses Stripe not Paddle.",
    metadata={"project": "billing", "importance": 0.9},
)

# Recall relevant context
results = db.memory.recall(
    user_id="dev_alice",
    query="Which payment provider does Alice prefer?",
    limit=3,
)
# Returns: Alice is working on the billing refactor — uses Stripe not Paddle.

Setup Guide

Full local AI stack — Ollama + Cody CLI + ZeroDB Local — in under 5 minutes.

1

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Or download from ollama.com for macOS/Windows

2

Pull a model

ollama pull llama3.3
ollama pull mistral
ollama pull gemma2

Models are cached locally after first download

3

Install Cody CLI

npm install -g @ainative/cody-cli
cody config set model ollama/llama3.3
cody config set endpoint http://localhost:11434

Cody CLI connects to any local Ollama or LM Studio instance

4

Add persistent memory (optional)

pip install zerodb-local
zerodb serve --port 8765
cody config set memory zerodb-local

ZeroDB Local stores memory in SQLite + FAISS — fully offline

Frequently Asked Questions

What is a self-hosted LLM?

A self-hosted LLM is a large language model that runs on your own hardware or private cloud, rather than a third-party API. Data never leaves your environment, and there are no per-token costs after setup.

What hardware do I need to run an LLM locally?

For Llama 3.1 8B you need at least 8GB RAM (CPU only) or 6GB VRAM (GPU). For Llama 3.3 70B you need 40GB+ RAM or a 40GB+ VRAM GPU. Quantized models (Q4) roughly halve memory requirements.

Does Cody CLI work offline?

Yes. Cody CLI local mode connects to any Ollama or LM Studio instance on localhost. With ZeroDB Local, memory and context also persist offline with no cloud dependency.

What is the difference between Ollama, LM Studio, and vLLM?

Ollama is a command-line tool focused on ease of use. LM Studio provides a desktop GUI. vLLM is a high-throughput inference engine designed for production serving on GPU clusters. AINative Cody CLI works with all three.

Can I use ZeroDB memory with a self-hosted LLM?

Yes. ZeroDB Local (pip install zerodb-local) runs entirely offline using SQLite + FAISS. It provides the same memory/recall/reflect API as cloud ZeroDB, with no network calls.

Run your first local LLM in 5 minutes

Download Cody CLI, point it at Ollama, and you have a full agentic AI stack running on your own hardware — no API keys, no data leaving your machine.