This guide explains how to configure and use local Large Language Models (LLMs) with Redstring's Wizard agent. Running models locally provides privacy, offline capability, zero API costs, and lower latency.
Redstring supports any OpenAI-compatible local LLM server, including:
- Ollama - Easy-to-use local LLM runtime
- LM Studio - User-friendly desktop app for running models
- LocalAI - Self-hosted AI inference server
- vLLM - High-performance inference engine
- Custom servers - Any OpenAI-compatible endpoint
All local providers use the same OpenAI /v1/chat/completions API format, making them compatible with Redstring's Wizard agent.
macOS:
brew install ollamaLinux:
curl -fsSL https://ollama.com/install.sh | shWindows: Download installer from ollama.com
ollama serveThe server will start on http://localhost:11434 by default.
# Popular models
ollama pull llama2
ollama pull llama3
ollama pull mistral
ollama pull codellama
ollama pull phi
ollama pull gemma- Open Redstring → AI Panel → Click the 🔑 icon
- Select "💻 Local LLM Server" from the provider dropdown
- Click the "Ollama" preset button
- Verify endpoint:
http://localhost:11434/v1/chat/completions - Enter model name (e.g.,
llama2) - Click "Test Connection" to verify
- Click "Save Configuration"
Default Port: 11434
Common Models:
llama2- Meta's Llama 2 (7B, 13B, 70B variants)llama3- Meta's Llama 3 (8B, 70B variants)mistral- Mistral 7Bcodellama- Code-focused Llama variantphi- Microsoft Phi modelsgemma- Google Gemma models
Setup Steps:
- Install Ollama from ollama.com
- Run
ollama servein terminal - Pull desired model:
ollama pull <model-name> - Configure in Redstring using Ollama preset
Documentation: ollama.com
Default Port: 1234
Setup Steps:
- Download LM Studio from lmstudio.ai
- Install and launch the application
- Download a model through the LM Studio UI
- Start the local server (Settings → Local Server → Start Server)
- Configure in Redstring:
- Endpoint:
http://localhost:1234/v1/chat/completions - Model: Use the model name from LM Studio
- Endpoint:
Documentation: lmstudio.ai
Default Port: 8080
Setup Steps:
- Install LocalAI via Docker:
docker run -p 8080:8080 -ti localai/localai:latest
- Or download binary from localai.io
- Configure in Redstring:
- Endpoint:
http://localhost:8080/v1/chat/completions - Model:
gpt-3.5-turboor model name configured in LocalAI
- Endpoint:
Documentation: localai.io
Default Port: 8000
Setup Steps:
- Install vLLM:
pip install vllm
- Start server:
python -m vllm.entrypoints.openai.api_server --model <model-name>
- Configure in Redstring:
- Endpoint:
http://localhost:8000/v1/chat/completions - Model: Use the model name you started vLLM with
- Endpoint:
Documentation: docs.vllm.ai
If you have a custom server that implements the OpenAI API format:
- Ensure your server exposes
/v1/chat/completionsendpoint - Configure in Redstring:
- Endpoint:
http://localhost:<port>/v1/chat/completions - Model: Model name as recognized by your server
- API Key: Only if your server requires authentication
- Endpoint:
- Open AI Panel → Click 🔑 icon
- Select "💻 Local LLM Server"
- Click a preset button (Ollama, LM Studio, etc.)
- Endpoint and model suggestions will auto-fill
- Adjust if needed, then test connection
- Save configuration
- Select "💻 Local LLM Server" from provider dropdown
- Enter endpoint URL manually (e.g.,
http://localhost:11434/v1/chat/completions) - Enter model name
- Click "Test Connection" to verify
- Save configuration
The "Test Connection" button will:
- Check if the server is running
- Verify the endpoint is accessible
- List available models (if supported)
- Show clear error messages if connection fails
Recommended Models:
- llama3:8b - Good balance of quality and speed
- mistral - Fast and capable
- llama2:13b - Better quality, slower
Minimum Requirements:
- 8GB RAM for 7B models
- 16GB RAM for 13B models
- 32GB+ RAM for 70B models
- Use smaller models for faster responses (7B-8B parameters)
- Close other applications to free up RAM
- Use GPU acceleration if available (CUDA, Metal, etc.)
- Monitor system resources - local models can be CPU/RAM intensive
Solutions:
- Verify the LLM server is running (check terminal/process list)
- Check the port number matches your configuration
- Try accessing the endpoint directly:
curl http://localhost:11434/v1/models - Restart the server
Solutions:
- Verify the model name matches exactly (case-sensitive)
- For Ollama: Run
ollama listto see installed models - Pull the model if missing:
ollama pull <model-name> - Check server logs for model loading errors
Solutions:
- Use a smaller model (7B instead of 13B+)
- Close other applications to free RAM
- Enable GPU acceleration if available
- Reduce
max_tokensin advanced settings - Check system CPU/RAM usage
Solutions:
- Use a smaller model
- Reduce
max_tokensparameter - Close other applications
- Restart the server
- Check system RAM availability
Solutions:
- Stop other services using the port
- Change the port in your LLM server configuration
- Update Redstring endpoint URL to match new port
- All data stays local - No API calls leave your machine
- No cloud processing - Everything runs on your hardware
- No data collection - Your conversations remain private
- Local servers typically don't require API keys
- If your server requires authentication, configure it in Redstring
- Firewall rules may block localhost connections - adjust if needed
- Keep your LLM server software updated
Local Advantages:
- ✅ Zero API costs
- ✅ Complete privacy
- ✅ Works offline
- ✅ Lower latency (no network)
- ✅ No rate limits
Local Disadvantages:
- ❌ Requires powerful hardware
- ❌ Slower inference (CPU vs cloud GPU)
- ❌ Limited model selection
- ❌ Higher system resource usage
Cloud Advantages:
- ✅ No hardware requirements
- ✅ Fast inference (cloud GPUs)
- ✅ Access to latest models
- ✅ No system resource usage
Cloud Disadvantages:
- ❌ API costs
- ❌ Data sent to external servers
- ❌ Requires internet connection
- ❌ Rate limits
You can configure custom endpoints for:
- Remote servers on your network
- Docker containers
- Cloud instances with OpenAI-compatible APIs
- Reverse proxies
Example: http://192.168.1.100:11434/v1/chat/completions
Most local servers don't require API keys. If your server does:
- Enter the API key in Redstring configuration
- The key will be stored locally (obfuscated)
- Sent in
Authorization: Bearer <key>header
Adjust in Advanced Settings:
- Temperature - Controls randomness (0.0-1.0)
- Max Tokens - Maximum response length
- System Prompt - Customize Wizard behavior
- Start with Ollama - Easiest to set up and use
- Test connection before using - Verify server is accessible
- Monitor resources - Local models can be resource-intensive
- Use appropriate models - Match model size to your hardware
- Keep servers updated - Get latest features and fixes
- Document your setup - Note which models work best for your use case
- Server won't start - Check installation, ports, and logs
- Models won't load - Verify disk space and model files
- Slow performance - Check system resources and model size
- Connection errors - Verify endpoint URL and server status
- Ollama: ollama.com | GitHub
- LM Studio: lmstudio.ai
- LocalAI: localai.io | GitHub
- vLLM: docs.vllm.ai | GitHub
- Start Ollama:
ollama serve - Pull model:
ollama pull llama2 - Configure in Redstring (Ollama preset)
- Ask Wizard: "Create a graph about renewable energy"
- Wizard generates nodes and edges using local model
- Configure multiple profiles:
- Profile 1: Ollama (local, llama2)
- Profile 2: OpenRouter (cloud, claude-3-sonnet)
- Switch profiles as needed
- Each profile maintains its own configuration
- Pull multiple models:
ollama pull llama2 llama3 mistral - Test each model with same prompt
- Compare response quality and speed
- Choose best model for your use case
Local LLM integration provides a powerful, private alternative to cloud-based AI services. With Redstring's support for OpenAI-compatible endpoints, you can use any local LLM server that fits your needs.
Start with Ollama for the easiest setup, then explore other providers as needed. Remember to test connections, monitor system resources, and choose models appropriate for your hardware.