Self-Hosted LLM — Run AI Models Locally with AINative
Full privacy, zero API costs, and sub-50ms latency. AINative Cody CLI connects to any local runtime — Ollama, LM Studio, or vLLM — and ZeroDB Local gives your self-hosted agent persistent memory with no cloud dependency.
Why self-host your LLM?
Cloud LLM APIs are convenient but come with trade-offs: data leaves your environment, costs scale with every token, and you depend on a vendor's uptime. Self-hosting eliminates all three.
Complete data privacy
Your prompts, context, and outputs never leave your machine. Essential for healthcare, legal, finance, and any regulated industry where data residency matters.
Zero marginal cost
After hardware setup, inference is free. High-volume workloads — batch summarization, code analysis, document processing — cost nothing per token.
Sub-50ms latency
Local inference eliminates the network round-trip. On a modern GPU, Llama 3.1 8B serves responses at 80–150 tokens/sec — faster than most cloud APIs.
Full control
Customize system prompts, context lengths, temperature defaults, and quantization levels. No rate limits, no API quotas, no vendor lock-in.
Supported Runtimes
AINative Cody CLI is runtime-agnostic. It speaks the OpenAI-compatible API format that all three major local runtimes support.
Ollama
Easiest setup — one command to startOllama wraps llama.cpp into a single binary with a REST API that mirrors the OpenAI chat completions spec. Best for developer workstations and quick prototyping.
Quick start:
curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.3LM Studio
Desktop GUI — download and run in minutesLM Studio provides a drag-and-drop desktop app with model discovery, built-in chat, and a local API server. No terminal required. Best for non-technical users and testing.
Quick start:
Download from lmstudio.ai, enable local server in settingsvLLM
Production-grade serving on GPU clustersvLLM uses PagedAttention to serve large models at maximum throughput. Handles concurrent users efficiently. Best for internal APIs serving hundreds of requests/sec.
Quick start:
pip install vllm && python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-InstructCody CLI Local Mode
Cody CLI is AINative's open-source AI coding assistant. In local mode it connects to Ollama or LM Studio instead of cloud APIs — giving you the same agentic capabilities with zero data leaving your machine.
Agentic workflows
Run multi-step coding tasks, file edits, and terminal commands using your local model — no cloud API calls.
MCP tool use
69+ MCP tools work in local mode. Connect your local Llama to filesystem, git, databases, and custom tools.
Offline memory
ZeroDB Local provides SQLite-backed vector memory that persists across sessions with no network dependency.
# Install Cody CLI
npm install -g @ainative/cody-cli
# Configure for local Ollama
cody config set provider ollama
cody config set model llama3.3
cody config set endpoint http://localhost:11434
# Run an agentic task locally — no API key needed
cody run "Refactor the auth module to use JWT refresh tokens"
# Or start an interactive session
cody chat --localZeroDB Local
ZeroDB Local is a fully offline version of AINative's ZeroDB memory database. It uses SQLite for structured storage and FAISS for vector search — the same API as cloud ZeroDB, with zero network calls.
from zerodb_local import ZeroDBLocal
# Opens (or creates) a local SQLite database
db = ZeroDBLocal(path="~/.cody/memory.db")
# Store a memory — embeddings computed locally
db.memory.remember(
user_id="dev_alice",
content="Alice is working on the billing refactor — uses Stripe not Paddle.",
metadata={"project": "billing", "importance": 0.9},
)
# Recall relevant context
results = db.memory.recall(
user_id="dev_alice",
query="Which payment provider does Alice prefer?",
limit=3,
)
# Returns: Alice is working on the billing refactor — uses Stripe not Paddle.Setup Guide
Full local AI stack — Ollama + Cody CLI + ZeroDB Local — in under 5 minutes.
Install Ollama
curl -fsSL https://ollama.com/install.sh | shOr download from ollama.com for macOS/Windows
Pull a model
ollama pull llama3.3
ollama pull mistral
ollama pull gemma2Models are cached locally after first download
Install Cody CLI
npm install -g @ainative/cody-cli
cody config set model ollama/llama3.3
cody config set endpoint http://localhost:11434Cody CLI connects to any local Ollama or LM Studio instance
Add persistent memory (optional)
pip install zerodb-local
zerodb serve --port 8765
cody config set memory zerodb-localZeroDB Local stores memory in SQLite + FAISS — fully offline
Frequently Asked Questions
What is a self-hosted LLM?
A self-hosted LLM is a large language model that runs on your own hardware or private cloud, rather than a third-party API. Data never leaves your environment, and there are no per-token costs after setup.
What hardware do I need to run an LLM locally?
For Llama 3.1 8B you need at least 8GB RAM (CPU only) or 6GB VRAM (GPU). For Llama 3.3 70B you need 40GB+ RAM or a 40GB+ VRAM GPU. Quantized models (Q4) roughly halve memory requirements.
Does Cody CLI work offline?
Yes. Cody CLI local mode connects to any Ollama or LM Studio instance on localhost. With ZeroDB Local, memory and context also persist offline with no cloud dependency.
What is the difference between Ollama, LM Studio, and vLLM?
Ollama is a command-line tool focused on ease of use. LM Studio provides a desktop GUI. vLLM is a high-throughput inference engine designed for production serving on GPU clusters. AINative Cody CLI works with all three.
Can I use ZeroDB memory with a self-hosted LLM?
Yes. ZeroDB Local (pip install zerodb-local) runs entirely offline using SQLite + FAISS. It provides the same memory/recall/reflect API as cloud ZeroDB, with no network calls.
Run your first local LLM in 5 minutes
Download Cody CLI, point it at Ollama, and you have a full agentic AI stack running on your own hardware — no API keys, no data leaving your machine.