If you want AI coding assistance without sending your code to the cloud, this setup is for you. Since VS Code 1.113, GitHub Copilot Chat natively supports custom language model providers — which means you can point it directly at a locally running Ollama instance and use whatever model you want.
This post covers two things: how I organize my Ollama models in a dedicated repository, and how to wire them up to VS Code.
Part 1: Organizing Ollama Models with Modelfiles
Rather than pulling models ad-hoc and tweaking parameters every time, I maintain a dedicated Git repository for my Ollama setup: cebor/ollama_models.
The core idea is a clean separation between two model roles — coding and planning — each tuned with different parameters and matched to the available hardware.
Hardware
The setup targets two machines with quite different capabilities:
| Machine | Chip | RAM | VRAM |
|---|---|---|---|
| MacBook | Apple M3 | 24 GB Unified Memory | — |
| PC | NVIDIA RTX 5090 | 64 GB RAM | 32 GB VRAM |
The context window sizes and which models are available differ accordingly.
The Models
| Model | Type | Size (Q4_K_M) | Mac ctx | PC ctx |
|---|---|---|---|---|
gemma4:26b-a4b-it-q4_K_M | MoE | ~18 GB | 16384 | 65536 |
gemma4:31b-it-q4_K_M | Dense | ~21 GB | 8192 | 32768 |
qwen3.6:27b-q4_K_M | Dense | ~17 GB | 8192 | 32768 |
qwen3.6:35b-a3b-q4_K_M | MoE | ~24 GB | — | 65536 |
A few things worth noting here:
- Gemma4 26b is MoE — despite being smaller on disk (~18 GB), it’s used as the coding model. MoE (Mixture-of-Experts) architectures only activate a subset of parameters per token, making them significantly faster at inference. That speed advantage matters a lot during active coding sessions.
- Gemma4 31b and Qwen3.6 27b are Dense — used for planning. Dense models reason more thoroughly per parameter, which is better suited for architectural discussions, reviewing context, and exploratory conversations.
- Qwen3.6 35b is PC-only due to its size — it doesn’t fit comfortably on the MacBook’s unified memory.
Custom Model Names
After pulling the base models, each is created with a custom name via ollama create:
Mac:
ollama create gemma4-26b-coding -f ./mac-m3-24gb/gemma4-26b-a4b-it-q4_K_M.txt
ollama create gemma4-31b-planning -f ./mac-m3-24gb/gemma4-31b-it-q4_K_M.txt
ollama create qwen3.6-27b-planning -f ./mac-m3-24gb/qwen3.6-27b-q4_K_M.txt
PC:
ollama create gemma4-26b-coding -f ./pc-rtx5090-32gb/gemma4-26b-a4b-it-q4_K_M.txt
ollama create gemma4-31b-planning -f ./pc-rtx5090-32gb/gemma4-31b-it-q4_K_M.txt
ollama create qwen3.6-27b-planning -f ./pc-rtx5090-32gb/qwen3.6-27b-q4_K_M.txt
ollama create qwen3.6-35b-coding -f ./pc-rtx5090-32gb/qwen3.6-35b-a3b-q4_K_M.txt
The result is a consistent set of named models (gemma4-26b-coding, gemma4-31b-planning, etc.) that show up identically in VS Code’s model picker on both machines, regardless of the underlying hardware-specific configuration.
Modelfile Parameters
Each Modelfile sets a handful of parameters tuned for the model’s role:
| Parameter | Value | Description |
|---|---|---|
num_ctx | varies by model/hardware | Maximum context length in tokens |
num_predict | 2048 / 4096 | Maximum response length |
temperature | 0.2 (coding) / 0.5 (planning) | Lower = more deterministic; higher = more creative |
repeat_penalty | 1.1 | Prevents repetitive outputs |
The temperature difference is intentional: coding benefits from deterministic, predictable output while planning is better served by a model that explores ideas more freely.
Repository Structure
ollama-modelfiles/
├── README.md
├── mac-m3-24gb/
│ ├── gemma4-26b-a4b-it-q4_K_M.txt
│ ├── gemma4-31b-it-q4_K_M.txt
│ └── qwen3.6-27b-q4_K_M.txt
├── pc-rtx5090-32gb/
│ ├── gemma4-26b-a4b-it-q4_K_M.txt
│ ├── gemma4-31b-it-q4_K_M.txt
│ ├── qwen3.6-27b-q4_K_M.txt
│ └── qwen3.6-35b-a3b-q4_K_M.txt
└── scripts/
└── ollama-network-expose.ps1
The scripts/ folder also contains a PowerShell script for exposing the Ollama API on the local network — useful if you want to access your PC’s Ollama instance from other devices:
# Run from repo root in an elevated PowerShell session
powershell -ExecutionPolicy Bypass -File .\scripts\ollama-network-expose.ps1
Part 2: Integrating Ollama with VS Code Copilot Chat
Prerequisites
- Ollama v0.18.3+
- VS Code 1.113+
- GitHub Copilot Chat extension 0.41.0+
Note on GitHub login: VS Code requires you to be signed in with a GitHub account to use the model selector — even for fully local, custom models. However, no paid GitHub Copilot subscription is required. The free GitHub Copilot Free tier is sufficient to enable custom model selection.
Manual Setup
- Open the Copilot Chat sidebar from the top-right activity bar.
- Click the settings gear icon (⚙) to open the Language Models window.
- Click Add Models and select Ollama from the provider list — VS Code will load all locally available Ollama models.
- Click Unhide next to your Ollama models to make them selectable in chat.
- Make sure Local is selected at the bottom of the Copilot Chat panel.
That’s it. Your locally hosted models are now available directly in the Copilot Chat model picker.
Switching Between Models
Once set up, switching models is just a dropdown in the chat panel. In practice:
- Coding / Agent / Edit sessions →
gemma4-26b-codingorqwen3.6-35b-coding(PC only) - Planning / exploratory sessions →
gemma4-31b-planningorqwen3.6-27b-planning
This mirrors the role separation maintained in the Modelfile repository and keeps the workflow consistent across both machines.
Wrapping Up
Running local models in VS Code Copilot Chat is surprisingly seamless once Ollama is set up. Maintaining Modelfiles in version control is worth the small upfront effort — a single ollama create command restores a known-good configuration on any machine.
The full repository with all Modelfiles and the network expose script is available at: github.com/cebor/ollama_models.
Have questions or want to share your own model setup? Feel free to reach out.