How to Use Cursor with a Local LLM?

By |

TL;DR

  • Ollama and LM Studio are the standard local model servers; the recommended models have changed significantly since 2025
  • Cursor's coordination logic runs in a cloud-bound sandbox — a public tunnel is architecturally required, not just a workaround
  • Ollama defaults to a 4k–8k token context window; Cursor regularly sends 30k+ tokens — set num_ctx to 32k or higher or your model will silently drop codebase context
  • Local models via the OpenAI Override work for Chat and Cmd+K — Cursor Tab autocomplete still defaults to cloud models
  • Model Context Protocol (MCP), adopted by Cursor in early 2025, lets local models act as agents — reading files, querying databases, and using tools beyond text generation

What Has Changed About Using Cursor with a Local LLM in 2026?

The core method — overriding Cursor's OpenAI Base URL — still works, but the models, tunneling options, proxy tools, and capabilities available in February 2026 are substantially different from what earlier guides describe.

Three shifts define the current landscape.

First, reasoning models have replaced standard chat models as the benchmark for local coding. Models that think before they output now produce meaningfully better code than their predecessors.

Second, MCP transformed local LLMs from text generators into agents. A local model connected to an MCP server can read your filesystem, query your database, and take actions — not just respond to prompts.

Third, context window management is now a required configuration step. Without it, your local model will silently forget the beginning of large code files mid-session.


Which Local Models Should You Use with Cursor in 2026?

The current leaders for local coding in February 2026 are Qwen3-Coder and DeepSeek-V3.2, with GLM-4.7 Thinking as a strong reasoning-first option.

Gemma 3 12B and DeepSeek Coder 6.7B, while still functional, are now considered legacy picks for serious coding work.

ModelTypeContext
Qwen3-Coder 32B/7BMoE Coding256K (1M+)
DeepSeek-V3.2Agentic Reasoning128K
GLM-4.7 ThinkingSelf-Correcting1M (Full Repo)
Gemma 3 12BMultimodal Vision128K

For cloud-backed work within the same Cursor session, Claude Opus 4.6 (released February 5, 2026) is the current frontier model for agentic coding. Claude Sonnet 4.6 (released February 17, 2026) adds a 1M token context window and improved computer use capabilities.

Performance varies by hardware. Speed figures you find online reflect specific machine configurations — do not treat them as universal benchmarks.


Why Do You Still Need a Public Tunnel to Use Cursor with a Local LLM?

Cursor's coordination logic — indexing, prompt construction, and context assembly — runs in a cloud-bound sandbox that cannot access your private network by design.

This is not simply Cursor blocking your IP address. It is an architectural constraint: the part of Cursor that builds your prompts does not run on your local machine. It requires a publicly reachable HTTPS endpoint to communicate with your local model server.

Three tunnel options are available in 2026:

  • ngrok — easiest to set up; free tier tunnels are typically ephemeral and generate a new URL on each restart
  • Cloudflare Tunnel (cloudflared) — free, permanent static URLs; more stable for daily use than ephemeral ngrok links
  • Tailscale Funnel — more secure than ngrok; suited for teams with existing Tailscale networks

All three produce a public HTTPS URL you paste into Cursor's Override Base URL field with /v1 appended. The configuration steps in Cursor are identical regardless of which tunnel you choose. Verify your specific plan limits and behavior directly with whichever provider you use.


How Do You Configure Cursor to Use a Local LLM?

Open Cursor Settings → Models → OpenAI Configuration, paste your tunnel URL with /v1 appended into the Override Base URL field, enter any non-empty API key placeholder, and add your model using its exact server-side tag.

Full configuration steps:

  1. Start your local model server — Ollama or LM Studio — and confirm it is running
  2. Start your tunnel and copy the public HTTPS URL it generates
  3. Open Cursor → Settings → Models → OpenAI Configuration
  4. Enable the API Key field and enter any non-empty placeholder: "ollama", "1234", or "local"
  5. Paste your tunnel URL into Override OpenAI Base URL and append /v1 — example: https://abcd1234.ngrok.io/v1
  6. Click Add Custom Model and enter the exact model tag from your server

For Ollama: run ollama list to confirm the exact tag before entering it in Cursor. If you pulled qwen3-coder:7b, type qwen3-coder:7b — any variation returns a 404 error.

For LM Studio on Windows: enter the model name exactly as LM Studio reports it in its interface. This is a JSON field matching requirement — Cursor sends the model name in the request body and LM Studio must match it precisely.

When debugging, check your local server logs first. LM Studio has a console view and Ollama outputs logs to the terminal. These show exactly what requests are arriving and where they fail — faster than troubleshooting from Cursor's side.


What Is the Context Window Setting You Cannot Skip?

Ollama defaults to a 4k–8k token context window, but Cursor regularly sends 30k or more tokens when working with large codebases — without a manual override, your local model will silently drop earlier context and produce hallucinated output.

To fix this, set num_ctx when running your Ollama model:

ollama run qwen3-coder:7b --num-ctx 32768
```

Or set it persistently in a Modelfile:
```
FROM qwen3-coder:7b
PARAMETER num_ctx 32768

For large projects, 65,536 tokens is a safer target if your hardware supports it. Qwen3-Coder's native 256K context window makes it particularly well-suited for large codebase work without hitting this ceiling.

LM Studio users can adjust context length directly in the model settings panel before starting the server.


What Can You Actually Do with Cursor and a Local LLM?

Via the OpenAI Override, local models support Cursor Chat and Cmd+K inline generation — but Cursor Tab autocomplete still defaults to cloud models and is not supported by local setups.

This is a hard limitation worth understanding before committing to the setup.

What local models give you:

  • Full Cursor Chat conversations with your codebase
  • Cmd+K inline code generation and edits
  • MCP-connected agent actions when configured

What local models do not give you:

  • Cursor Tab autocomplete — this requires sub-100ms latency that local models cannot reliably deliver over a tunnel
  • Composer multifile edits in all configurations — behavior varies by setup

What Is MCP and How Does It Work with a Local LLM in Cursor?

Model Context Protocol (MCP), launched by Anthropic in November 2024 and adopted by Cursor in early 2025, allows a local LLM to act as a coding agent — reading files, querying databases, and using tools rather than just generating text.

Without MCP, a local model only sees what you paste into the chat. With MCP, it can read your actual filesystem, query your local database, pull from APIs, and take actions on your behalf — without any of that data leaving your machine.

To enable MCP in Cursor with a local model:

  1. Configure your MCP server — filesystem, database, or custom tool server
  2. Connect the MCP server endpoint in Cursor settings
  3. Select your local model via the Override method as described above
  4. Cursor routes tool calls through the MCP server while using your local model for reasoning

This is the step that moves a local LLM from a text generator to a genuine coding agent with real project context.


Is There a Better Way to Manage Local and Cloud Models Together in Cursor?

LiteLLM is a proxy layer that sits between Cursor and your local model server, giving you unified model management and the ability to switch between local and cloud backends without touching Cursor's settings.

Instead of pointing Cursor directly at your Ollama or LM Studio server, you point it at LiteLLM. LiteLLM routes to whichever backend you configure — local models, cloud APIs, or both.

For users who want a simpler option, LLM-Router remains a lightweight Mac-compatible tool for switching between a local Ollama model and Claude without reconfiguring Cursor. Use the model name prefix syntax:

  • Claude Sonnet 4.6: anthropic/claude-sonnet-4-6-20260217 or just claude-sonnet-4-6-20260217
  • Ollama: ollama/qwen3-coder:7b (confirm your exact tag with ollama list)

What Are the Limitations of Using Cursor with a Local LLM?

The two most impactful limitations are that Cursor Tab autocomplete does not work with local models, and that Ollama's default context window is too small for serious codebase work without a manual override.

Full limitation summary:

  • Cursor Tab not supported: Autocomplete requires cloud model latency. Local models via Override cover Chat and Cmd+K only
  • Context window too small by default: Always set num_ctx to at least 32,768 in Ollama — without this, large codebase sessions degrade silently
  • Ephemeral tunnel URLs: ngrok free tier generates a new URL on each restart; use Cloudflare Tunnel for a permanent static URL
  • Exact model name matching required: Any mismatch between Cursor's model field and your server tag returns a 404 — confirm with ollama list or LM Studio's interface
  • Active internet required: Cursor's coordination layer is cloud-bound; local inference does not mean offline operation
  • MCP requires separate configuration: Agent capabilities do not work out of the box — you must set up and connect an MCP server
  • Hardware determines performance: All benchmark figures are machine-specific — test on your own hardware before committing to a model

 

 

Frequently Asked Questions

How do I use Cursor with a local LLM without paying for API access?

Install Ollama, pull a model like Qwen3-Coder, expose it with a free Cloudflare Tunnel, and override Cursor's OpenAI Base URL with the tunnel address plus /v1. Enter any placeholder as the API key — local servers do not validate it. You pay nothing for inference; your only cost is the hardware running the model.

Why does my local LLM give wrong answers halfway through a large file in Cursor?

Almost certainly a context window issue. Ollama defaults to 4k–8k tokens and Cursor sends 30k or more for large codebases. Set num_ctx to at least 32,768 when starting your Ollama model. Without this, the model silently drops earlier context and fills the gap with hallucinated output.

Does Cursor Tab autocomplete work with a local LLM?

No. Cursor Tab requires sub-100ms response latency that local models served over a tunnel cannot reliably deliver. The OpenAI Override method supports Chat and Cmd+K only. Autocomplete continues to use cloud models regardless of your local model configuration.