Local LLM Providers

Run Limit with local models using Ollama, LM Studio, vLLM, or any OpenAI-compatible server.

Quick Start

# ~/.limit/config.toml
provider = "local"

[providers.local]
model = "llama3.2"  # Your model name
base_url = "http://localhost:11434/v1/chat/completions"

That's it! The local provider requires no API key and uses sensible defaults for local servers.

Supported Providers

Provider	Default Port	Status
Ollama	11434	Full support
LM Studio	1234	Full support
vLLM	8000	Full support
Other	Varies	OpenAI-compatible

Provider Aliases

Limit accepts these provider names (all use the same OpenAI-compatible protocol):

local - Generic local provider (recommended)
ollama - Ollama-specific alias
lmstudio - LM Studio-specific alias
vllm - vLLM-specific alias

# All equivalent:
provider = "local"
provider = "ollama"
provider = "lmstudio"
provider = "vllm"

Ollama

Ollama is the most popular way to run LLMs locally.

Installation

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or via Homebrew
brew install ollama

Start Server

ollama serve

Pull a Model

ollama pull llama3.2
ollama pull qwen2.5-coder:7b
ollama pull deepseek-coder:6.7b

Configuration

provider = "ollama"

[providers.ollama]
model = "llama3.2"
base_url = "http://localhost:11434/v1/chat/completions"
# api_key not required for Ollama

List Available Models

ollama list

Recommended Models for Coding

Model	Size	Best For
`qwen2.5-coder:7b`	7B	General coding, fast
`deepseek-coder:6.7b`	6.7B	Code generation
`codellama:7b`	7B	Code completion
`llama3.2:3b`	3B	Lightweight, fast responses
`llama3.1:8b`	8B	General purpose

LM Studio

LM Studio provides a GUI to run local models.

Setup

Download from lmstudio.ai
Open LM Studio
Go to the "Local Server" tab
Start the server (default: http://localhost:1234)
Load a model

Configuration

provider = "lmstudio"

[providers.lmstudio]
model = "local-model"  # Model name shown in LM Studio
base_url = "http://localhost:1234/v1/chat/completions"

Notes

LM Studio must be running with a model loaded
The model name in config should match what's shown in LM Studio
Supports GGUF format models from Hugging Face

vLLM

vLLM is a high-performance inference server.

Installation

pip install vllm

Start Server

vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000

Configuration

provider = "vllm"

[providers.vllm]
model = "meta-llama/Llama-3.2-3B-Instruct"
base_url = "http://localhost:8000/v1/chat/completions"

With API Token (Hugging Face)

[providers.vllm]
model = "meta-llama/Llama-3.2-3B-Instruct"
base_url = "http://localhost:8000/v1/chat/completions"
api_key = "hf_xxx"  # If server requires auth

Custom Servers

Any OpenAI-compatible API server works with the local provider.

Configuration Template

provider = "local"

[providers.local]
model = "your-model-name"
base_url = "http://your-server:port/v1/chat/completions"
api_key = ""  # Optional, if server requires auth
max_tokens = 4096
timeout = 120

Common Endpoints

Server	Typical Endpoint
Ollama	`/v1/chat/completions`
LM Studio	`/v1/chat/completions`
vLLM	`/v1/chat/completions`
text-generation-webui	`/v1/chat/completions`
LocalAI	`/v1/chat/completions`

Important: Always include the full endpoint path in base_url. Limit does not auto-append paths.

Advanced Configuration

All Options

[providers.local]
model = "llama3.2"           # Required: model identifier
base_url = "http://..."       # Required: full API endpoint
api_key = ""                  # Optional: auth key if needed
max_tokens = 4096             # Optional: max output tokens (default: 4096)
timeout = 120                 # Optional: request timeout in seconds (default: 60)
max_iterations = 100          # Optional: agent loop limit (default: 100)

Environment Variable

You can also set the base URL via environment variable:

export LOCAL_API_KEY=""  # Not needed for most local servers
lim

Troubleshooting

"Connection refused"

Ensure your local server is running
Check the port matches your server
Verify base_url includes the full path

"HTTP 404 Not Found"

Verify base_url path is correct
Check server logs for the correct endpoint
Some servers use /v1/api/completions instead of /v1/chat/completions

Slow Responses

Try a smaller model (e.g., llama3.2:3b instead of llama3.1:8b)
Increase timeout if model is slow to generate
Check GPU/CPU utilization

Out of Memory

Use a quantized model (GGUF Q4_K_M or similar)
Reduce model size (fewer parameters)
Close other applications

Testing Your Setup

# Start Limit
lim

# Check current model
lim> /model

# Simple test
lim> hello, can you help me with code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local LLM Providers

Quick Start

Supported Providers

Provider Aliases

Ollama

Installation

Start Server

Pull a Model

Configuration

List Available Models

Recommended Models for Coding

LM Studio

Setup

Configuration

Notes

vLLM

Installation

Start Server

Configuration

With API Token (Hugging Face)

Custom Servers

Configuration Template

Common Endpoints

Advanced Configuration

All Options

Environment Variable

Troubleshooting

"Connection refused"

"HTTP 404 Not Found"

Slow Responses

Out of Memory

Testing Your Setup

See Also

FilesExpand file tree

LOCAL_PROVIDERS.md

Latest commit

History

LOCAL_PROVIDERS.md

File metadata and controls

Local LLM Providers

Quick Start

Supported Providers

Provider Aliases

Ollama

Installation

Start Server

Pull a Model

Configuration

List Available Models

Recommended Models for Coding

LM Studio

Setup

Configuration

Notes

vLLM

Installation

Start Server

Configuration

With API Token (Hugging Face)

Custom Servers

Configuration Template

Common Endpoints

Advanced Configuration

All Options

Environment Variable

Troubleshooting

"Connection refused"

"HTTP 404 Not Found"

Slow Responses

Out of Memory

Testing Your Setup

See Also