This guide is for IT administrators, DevOps engineers, and technical teams deploying a private AI infrastructure for confidential business documents. It assumes you're comfortable with the command line, Docker, and basic networking. If you're a non-technical user looking to get started quickly, see the companion guide: How to Run a Private AI Assistant on Windows Without Any Technical Experience.
What follows covers the full production stack: GPU-accelerated inference with LocalAI, semantic document retrieval via LocalRecall, autonomous agent workflows with LocalAGI, and drop-in API compatibility so your existing tooling requires zero rewrites.
Why LocalAI over simpler alternatives at this level
Ollama and AnythingLLM are excellent for individuals and small teams. At the enterprise level, LocalAI's advantages become decisive:
- Full OpenAI and Anthropic API compatibility — swap the base URL in any existing tool and it works. No SDK changes, no wrapper libraries.
- LocalRecall — a built-in, persistent semantic memory layer. Documents are chunked, embedded locally, and stored in a vector database. Queries retrieve relevant chunks before inference, keeping context accurate across thousands of documents.
- LocalAGI — autonomous agent framework for scheduled or event-driven workflows: auto-summarize contracts on upload, extract structured data from invoices, flag compliance risks against a custom ruleset.
- Multi-modal support — vision models, audio transcription, and image generation all run through the same API endpoint.
- MIT licensed — no vendor lock-in, no usage fees, full auditability of the codebase.
Hardware recommendations by workload
| Workload | CPU | RAM | GPU | Storage |
|---|---|---|---|---|
| Light Q&A, 7B models | Any modern 8-core | 16 GB | None (CPU inference) | 40 GB SSD |
| Standard document intelligence, 13B models | 12+ core (Ryzen 7 / Core i7+) | 32 GB | RTX 3090 / 4080 | 100 GB NVMe |
| Heavy analysis, 70B models | 16+ core workstation | 64–128 GB | RTX 4090 / A6000 (24 GB VRAM) | 200 GB NVMe |
| Multi-user server | Server-grade CPU | 128 GB+ | Dual A100 / H100 | RAID NVMe |
For most business deployments (contract review, financial document analysis, policy Q&A), a 13B quantized model on a single RTX 4080 or 4090 delivers near-GPT-4 quality with sub-second token generation.
Windows: Full GPU-Accelerated Installation
Prerequisites
- Windows 10/11 (64-bit)
- Docker Desktop with WSL2 backend enabled
- NVIDIA drivers version 527+ (check:
nvidia-smiin PowerShell) - NVIDIA Container Toolkit for Docker GPU passthrough
Verify your GPU is visible to Docker before proceeding:
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
You should see your GPU listed. If you get an error, reinstall the NVIDIA Container Toolkit and restart Docker Desktop.
Install LocalAI with GPU support
Pull and run the CUDA-enabled image:
docker run -d `
--gpus all `
-p 8080:8080 `
-v C:\localai\models:/build/models `
-v C:\localai\config:/build/config `
--name localai `
localai/localai:latest-aio-gpu-nvidia-cuda-12
This mounts two local directories — C:\localai\models for model files and C:\localai\config for configuration. Create them first:
mkdir C:\localai\models
mkdir C:\localai\config
The aio (all-in-one) image bundles a curated model gallery, LocalRecall, and common backends. Once running, the API is available at http://localhost:8080/v1.
Pull production models
Navigate to http://localhost:8080 and use the model gallery UI, or pull via the LocalAI CLI:
# Strong reasoning model — good for contract analysis, financial Q&A
docker exec localai local-ai run llama-3.1-8b-instruct:q4_k_m
# Maximum quality for complex multi-document tasks (requires 24 GB VRAM)
docker exec localai local-ai run llama-3.1-70b-instruct:q4_k_m
# Embedding model for LocalRecall (required for RAG)
docker exec localai local-ai run nomic-embed-text
Persistent configuration
For production, manage model configuration with YAML files in C:\localai\config\. Example config for a business Q&A model:
name: business-assistant
backend: llama-cpp
model: llama-3.1-8b-instruct.Q4_K_M.gguf
context_size: 8192
gpu_layers: 40
parameters:
temperature: 0.2
top_p: 0.9
repeat_penalty: 1.1
Lower temperature (0.1–0.3) improves factual accuracy on document retrieval tasks — important for contracts and compliance work where hallucination is a liability.
Linux: Full GPU-Accelerated Installation
Linux is the preferred platform for server deployments due to better Docker GPU support, lower overhead, and direct NVIDIA driver access.
Prerequisites
- Ubuntu 22.04 LTS or Debian 12 (recommended); RHEL 8+ also supported
- Docker Engine (not Docker Desktop)
- NVIDIA drivers and NVIDIA Container Toolkit
Install NVIDIA Container Toolkit on Ubuntu:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Install LocalAI
mkdir -p /opt/localai/{models,config}
docker run -d --gpus all --restart unless-stopped -p 8080:8080 -v /opt/localai/models:/build/models -v /opt/localai/config:/build/config --name localai localai/localai:latest-aio-gpu-nvidia-cuda-12
The --restart unless-stopped flag ensures LocalAI starts automatically on reboot — important for always-on server deployments.
Apple Silicon (ARM Linux / macOS)
For M-series Macs or ARM Linux servers:
docker run -d --restart unless-stopped -p 8080:8080 -v /opt/localai/models:/build/models -v /opt/localai/config:/build/config --name localai localai/localai:latest-aio-cpu
LocalAI on Apple Silicon uses Metal acceleration automatically when run natively (binary install). The Docker image uses CPU inference — for full Metal support, use the binary install on macOS.
CPU-only deployment (air-gapped or low-budget)
For strict air-gap environments without GPU access:
docker run -d --restart unless-stopped -p 8080:8080 -v /opt/localai/models:/build/models -e THREADS=8 --name localai localai/localai:latest-aio-cpu
Set THREADS to the number of physical CPU cores. 7B–13B quantized models (Q4_K_M format) run acceptably on 8-core CPUs for low-concurrency document Q&A.
Enabling LocalRecall for enterprise document intelligence
LocalRecall is LocalAI's semantic memory and RAG system. It embeds your documents locally, stores them in a vector database, and retrieves relevant chunks before each inference call — grounding answers in your actual content rather than model weights.
Configuration
In your LocalAI config directory, create localrecall.yaml:
localrecall:
enabled: true
embedding_model: nomic-embed-text
vector_db: chromadb # local ChromaDB instance, no external dependencies
chunk_size: 512
chunk_overlap: 64
Ingesting documents
LocalAI exposes a REST API for document management. Ingest a directory of files programmatically:
# Upload a single document
curl -X POST http://localhost:8080/v1/localrecall/documents -F "file=@/path/to/contract.pdf" -F "collection=legal-contracts"
# Query across a collection
curl -X POST http://localhost:8080/v1/localrecall/query -H "Content-Type: application/json" -d '{
"collection": "legal-contracts",
"query": "What are the termination clauses across all contracts?",
"limit": 10
}'
Collections act as namespaces — use one per document category (legal, finance, HR policy) for clean retrieval boundaries and access control.
Integrating RAG into chat completions
To automatically inject retrieved context into chat requests, reference the collection in your system prompt or use LocalAI's built-in RAG-aware endpoint:
curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "business-assistant",
"localrecall_collection": "legal-contracts",
"messages": [
{"role": "user", "content": "Summarize indemnification clauses in the Acme and Bluebell contracts."}
]
}'
LocalAI retrieves the most semantically relevant chunks from your collection before sending to the model — no external API calls, no data leaving the machine.
Connecting existing business tools
The OpenAI-compatible API (http://localhost:8080/v1) means zero rewrites for tools that already integrate with ChatGPT.
Python / LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="business-assistant",
base_url="http://localhost:8080/v1",
api_key="not-needed", # LocalAI doesn't require a real key
temperature=0.2
)
response = llm.invoke("Summarize the key risks in the Q3 financial report.")
print(response.content)
LlamaIndex (document intelligence)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
Settings.llm = OpenAI(
model="business-assistant",
api_base="http://localhost:8080/v1",
api_key="not-needed"
)
Settings.embed_model = OpenAIEmbedding(
model="nomic-embed-text",
api_base="http://localhost:8080/v1",
api_key="not-needed"
)
Microsoft Office / Word add-ins
Any Word add-in or Copilot alternative that accepts a custom OpenAI endpoint can point at http://localhost:8080/v1. No network changes or VPN required for on-premises deployments.
Continue.dev (VS Code / JetBrains)
In .continue/config.json:
{
"models": [{
"title": "LocalAI — Business Assistant",
"provider": "openai",
"model": "business-assistant",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}]
}
Gives your developers fully local AI assistance over proprietary codebases — nothing leaves the machine.
Autonomous workflows with LocalAGI
LocalAGI extends LocalAI with a planning-and-execution loop. Agents can watch directories, call external tools, and chain multi-step tasks without human intervention.
Example: auto-summarize new contracts
# localagi/agents/contract-monitor.yaml
name: contract-monitor
trigger:
type: file-watch
path: /opt/documents/incoming-contracts
patterns: ["*.pdf", "*.docx"]
tasks:
- name: summarize
prompt: |
You are a legal document analyst. Summarize the key terms of the uploaded contract,
including: parties, duration, payment terms, termination clauses, and any unusual clauses.
Output as structured JSON.
- name: flag-risks
prompt: |
Review the summary and flag any clauses that deviate from standard terms.
Reference the company playbook at /opt/documents/legal-playbook.pdf.
- name: save-output
action: write-file
path: /opt/documents/processed/{filename}_summary.json
Example: weekly financial risk report
name: finance-weekly
trigger:
type: schedule
cron: "0 8 * * 1" # Every Monday at 8 AM
tasks:
- name: analyze
collection: finance-reports
prompt: |
Analyze all financial documents from the past 7 days.
Identify: cash flow risks, outstanding receivables over 90 days,
budget variances greater than 10%, and any flagged compliance items.
- name: email-report
action: send-email
to: "cfo@company.com"
subject: "Weekly Financial Risk Summary — {date}"
Verifying your air-gap
Before processing any sensitive documents, confirm no external traffic is being generated.
Windows:
# Watch active connections while running a query
netstat -an | findstr "ESTABLISHED" | findstr /V "127.0.0.1"
No lines should appear. All connections should be loopback only.
Linux:
# Monitor outbound traffic from the LocalAI container
sudo nsenter -t $(docker inspect -f '{{.State.Pid}}' localai) -n ss -tunp | grep ESTABLISHED | grep -v 127.0.0.1
Again, no external connections should appear during inference.
For stricter environments, apply an iptables rule to block all outbound traffic from the container:
# Block all outbound from LocalAI except loopback
iptables -I DOCKER-USER -s $(docker inspect -f '{{.NetworkSettings.IPAddress}}' localai) -j DROP
iptables -I DOCKER-USER -s $(docker inspect -f '{{.NetworkSettings.IPAddress}}' localai) -d 127.0.0.0/8 -j ACCEPT
Production checklist
- GPU visible to Docker —
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smireturns expected output - LocalAI API responding —
curl http://localhost:8080/v1/modelsreturns model list - Embedding model loaded —
nomic-embed-textor equivalent appears in model list - LocalRecall enabled and collection created — test with a sample document upload and query
- No external network traffic during inference — confirmed via
netstatorss --restart unless-stoppedset on container (Linux) or Docker Desktop set to start on login (Windows)- Model config YAML committed to version control for reproducibility
- Backup strategy for
/opt/localai/modelsand vector DB data directory