By AIHumanLove Editorial · Published 28 March 2026

Why run Claude Code locally?

The appeal is straightforward. If you're reviewing proprietary code, working under security-sensitive constraints, or simply want to avoid cloud-dependent tooling, running code analysis locally offers three concrete advantages:

Full privacy. Your code never leaves your machine. No transmission, no logging, no third-party access.
No subscriptions. Once you set up the infrastructure, there are no per-request costs or API fees.
No lock-in. You control which models run where, and can easily swap or customise them.

The tradeoff is straightforward too: local open-source models tend to be less capable than Claude (the cloud version). They're better at code generation and explanation, less sharp at complex multi-step reasoning. For many code review tasks — spotting obvious bugs, checking style, catching security red flags — they're more than adequate. For nuanced architectural analysis, you might still prefer cloud-based reasoning.

What you need: tools and hardware

You'll need three components: Ollama (the local model runtime), a code-capable LLM, and Claude Code (the agent framework from Anthropic). Let's look at the hardware baseline first, because it makes or breaks the experience.

Hardware requirements

For actually usable code review — not toy demos — you should aim for:

Component	Minimum	Recommended	Comfortable
RAM	16 GB	32 GB	32 GB+
GPU VRAM	8 GB NVIDIA/AMD	16 GB	24 GB+
GPU Type	Any CUDA/ROCm card	RTX 3080 or better	RTX 4090 or M-series Mac 32GB+
Storage	50 GB free	100 GB free	150+ GB

An Apple Silicon Mac with 32 GB unified memory works quite well. On the PC side, an RTX 3090 or RTX 4080 gives you comfortable headroom for larger models. If you only have 8 GB VRAM, you can still run smaller 7B models, but expect slower inference and potential slowdowns when processing large files.

Model choice

Not all open models are equally useful for code work. Two stand out in early 2026:

Qwen 2.5 Coder. Available in 7B and 32B variants. Excellent code generation and analysis. The 7B version fits comfortably on 8GB VRAM; the 32B needs 16-20GB. Fast inference, good instruction-following.
GLM 4.7 Flash. Very fast, 128K context window. Good for reviewing large files in one go. Lighter than Qwen in terms of memory but slightly less precise on code-specific tasks.

For most developers starting out, Qwen 2.5 Coder (7B) is the best starting point. It's fast enough for real-time feedback, capable enough for meaningful code analysis, and fits on modest hardware.

Setting up: step by step

1. Install Ollama

Download Ollama from ollama.com. It's available for macOS, Linux, and Windows. Installation is straightforward — just run the installer and let it complete.

Verify installation by opening a terminal and typing:

ollama --version

You should see a version number. If you're on Linux, also verify your GPU is detected:

ollama list

2. Pull a model

Pull Qwen 2.5 Coder:

ollama pull qwen2.5-coder:7b

This downloads roughly 5 GB of quantised model weights. On a typical broadband connection, expect 5-15 minutes. Ollama handles everything — decompression, format conversion, optimisation.

If you have 16GB VRAM, you can try the 32B version instead:

ollama pull qwen2.5-coder:32b

3. Start the Ollama server

Ollama runs as a background service. Start it with:

ollama serve

The server listens on http://localhost:11434 by default. Leave this terminal running, or configure Ollama to auto-start as a system service (varies by OS).

4. Configure Claude Code

Claude Code reads three environment variables to connect to your local Ollama instance. Set them before launching Claude Code:

export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL="http://localhost:11434"

On Windows (PowerShell), use:

$env:ANTHROPIC_AUTH_TOKEN="ollama"
$env:ANTHROPIC_API_KEY=""
$env:ANTHROPIC_BASE_URL="http://localhost:11434"

These settings tell Claude Code to use your local Ollama instance instead of Anthropic's cloud API.

5. Launch Claude Code

With those environment variables set, start Claude Code as normal:

claude code

Claude Code will now route requests to your local Ollama instance. On first use, it may take 10-30 seconds to load the model into VRAM. Subsequent requests are faster.

Your first code review

Once everything is running, code review feels natural. Point Claude Code at a directory and ask it to review a specific file or feature:

Review this function for security issues
Check if this code follows our style guide
What's the time complexity of this algorithm?
Refactor this for readability

Claude Code can read files, execute commands, and iterate on suggestions. For code review specifically, the /loop command is useful for batch tasks — running a review across multiple files, checking PRs, or automating repetitive analysis.

Keep expectations realistic: Local models excel at code generation, refactoring, and explaining existing code. They're adequate at spotting common bugs. They're less sharp at complex architectural or security reasoning compared to Claude (cloud). Use local for fast, frequent reviews; keep cloud tools for critical security audits.

Local vs cloud: the real tradeoffs

Running locally isn't always better. It's different, with genuine advantages and real limitations.

Local wins

Complete privacy: your code stays on your machine.
No network latency for small files.
Works offline.
No per-request billing.

Cloud wins

Significantly better reasoning for complex problems.
Faster for very large files (cloud servers have more resources).
Automatic updates; no manual model management.
Better handling of edge cases and unusual code patterns.

The privacy argument deserves specific attention. When you send code to a cloud service, two things happen: your request is transmitted over the network, and the provider may log it (for debugging, abuse prevention, or model improvement). Even if a company claims not to retain data, the patterns your code represents are learned by the model during training — deletion of the original files doesn't erase that learning. For proprietary code, trade secrets, or regulated data, local execution genuinely eliminates that risk category.

Practical tips and common gotchas

Quantisation affects quality

Ollama pulls quantised models by default. A 7B model at Q4_K_M quantisation uses roughly 4-5 GB VRAM; the same model at full 16-bit precision needs 14 GB. You lose some reasoning precision, but usually not enough to matter for code review. If quality degrades, experiment with higher-precision variants (look for :q6 or :q8 tags).

Model loading takes time

First inference after a restart may pause for 10-30 seconds whilst the model loads into VRAM. This is normal. Subsequent requests use the cached model and are much faster. If you review code frequently, keep Ollama running in the background.

Context windows matter

Qwen 2.5 Coder has a 32K context window; GLM 4.7 Flash has 128K. For large files or many-file reviews, a larger context window helps the model "see" more code at once. If you're reviewing enormous codebases, consider the 32B Qwen variant or GLM.

Temperature settings

For code review, lower temperature (more deterministic) is usually better than high temperature. Claude Code defaults to reasonable values, but if responses feel too creative or unstable, check your local configuration.

Frequently asked questions

Can I use Claude Code with Ollama without Anthropic's API?

Yes. That's the entire point of this setup. Ollama implements Anthropic's API format, so Claude Code treats it as a drop-in replacement. No Anthropic account or API key required.

What if my GPU is too small for Qwen 7B?

Fall back to smaller models like Mistral 7B or Orca Mini (3B). They're less capable but still useful for basic code review. Alternatively, run in CPU-only mode (much slower) or use layer offloading if your GPU driver supports it.

Can I switch models on the fly?

Yes. If you have multiple models pulled, you can specify which one Ollama uses via environment variables or CLI flags. See Ollama's documentation for the exact syntax. In practice, most developers stick with one good model to avoid context switching overhead.

Is local code review secure against code extraction?

More secure than cloud, yes. A local LLM can't exfiltrate your code over the network. But like all ML models, it learns patterns from what it sees. An attacker with access to your machine could extract weights or examine the model's hidden states to infer code structure. For truly sensitive code (cryptographic keys, security algorithms), review should still be manual or use hardware-isolated environments.

What about running Ollama on a separate machine?

You can. Ollama supports network connections. Configure ANTHROPIC_BASE_URL to point to a remote machine's IP:port. This is useful if you want a beefy GPU server and lighter client machines. Network latency does increase, though.

Where to go from here

Once you're comfortable with basic setup, consider:

Setting up Ollama as a system service so it auto-starts.
Experimenting with multiple models and comparing their code review quality.
Creating custom prompts or review templates for your team's style guide.
Running Ollama on a shared GPU server if you work with other developers.

The local code review toolchain is now practical for serious development work. It won't replace human review, but it's a capable first pass, and it's entirely under your control.

Research heading. Research content here.

Rumour heading. Rumour content here.

Warning heading. Warning content here.

-->