If you want AI coding assistance without sending your code to the cloud, this setup is for you. Since VS Code 1.113, GitHub Copilot Chat natively supports custom language model providers — which means you can point it directly at a locally running Ollama instance and use whatever model you want.

This post covers two things: how I organize my Ollama models in a dedicated repository, and how to wire them up to VS Code.


Part 1: Organizing Ollama Models with Modelfiles

Rather than pulling models ad-hoc and tweaking parameters every time, I maintain a dedicated Git repository for my Ollama setup: cebor/ollama_models.

The core idea is a clean separation between two model roles — coding and planning — each tuned with different parameters and matched to the available hardware.

Hardware

The setup targets two machines with quite different capabilities:

MachineChipRAMVRAM
MacBookApple M324 GB Unified Memory
PCNVIDIA RTX 509064 GB RAM32 GB VRAM

The context window sizes and which models are available differ accordingly.

The Models

ModelTypeSize (Q4_K_M)Mac ctxPC ctx
gemma4:26b-a4b-it-q4_K_MMoE~18 GB1638465536
gemma4:31b-it-q4_K_MDense~21 GB819232768
qwen3.6:27b-q4_K_MDense~17 GB819232768
qwen3.6:35b-a3b-q4_K_MMoE~24 GB65536

A few things worth noting here:

  • Gemma4 26b is MoE — despite being smaller on disk (~18 GB), it’s used as the coding model. MoE (Mixture-of-Experts) architectures only activate a subset of parameters per token, making them significantly faster at inference. That speed advantage matters a lot during active coding sessions.
  • Gemma4 31b and Qwen3.6 27b are Dense — used for planning. Dense models reason more thoroughly per parameter, which is better suited for architectural discussions, reviewing context, and exploratory conversations.
  • Qwen3.6 35b is PC-only due to its size — it doesn’t fit comfortably on the MacBook’s unified memory.

Custom Model Names

After pulling the base models, each is created with a custom name via ollama create:

Mac:

ollama create gemma4-26b-coding -f ./mac-m3-24gb/gemma4-26b-a4b-it-q4_K_M.txt
ollama create gemma4-31b-planning -f ./mac-m3-24gb/gemma4-31b-it-q4_K_M.txt
ollama create qwen3.6-27b-planning -f ./mac-m3-24gb/qwen3.6-27b-q4_K_M.txt

PC:

ollama create gemma4-26b-coding -f ./pc-rtx5090-32gb/gemma4-26b-a4b-it-q4_K_M.txt
ollama create gemma4-31b-planning -f ./pc-rtx5090-32gb/gemma4-31b-it-q4_K_M.txt
ollama create qwen3.6-27b-planning -f ./pc-rtx5090-32gb/qwen3.6-27b-q4_K_M.txt
ollama create qwen3.6-35b-coding -f ./pc-rtx5090-32gb/qwen3.6-35b-a3b-q4_K_M.txt

The result is a consistent set of named models (gemma4-26b-coding, gemma4-31b-planning, etc.) that show up identically in VS Code’s model picker on both machines, regardless of the underlying hardware-specific configuration.

Modelfile Parameters

Each Modelfile sets a handful of parameters tuned for the model’s role:

ParameterValueDescription
num_ctxvaries by model/hardwareMaximum context length in tokens
num_predict2048 / 4096Maximum response length
temperature0.2 (coding) / 0.5 (planning)Lower = more deterministic; higher = more creative
repeat_penalty1.1Prevents repetitive outputs

The temperature difference is intentional: coding benefits from deterministic, predictable output while planning is better served by a model that explores ideas more freely.

Repository Structure

ollama-modelfiles/
├── README.md
├── mac-m3-24gb/
│   ├── gemma4-26b-a4b-it-q4_K_M.txt
│   ├── gemma4-31b-it-q4_K_M.txt
│   └── qwen3.6-27b-q4_K_M.txt
├── pc-rtx5090-32gb/
│   ├── gemma4-26b-a4b-it-q4_K_M.txt
│   ├── gemma4-31b-it-q4_K_M.txt
│   ├── qwen3.6-27b-q4_K_M.txt
│   └── qwen3.6-35b-a3b-q4_K_M.txt
└── scripts/
    └── ollama-network-expose.ps1

The scripts/ folder also contains a PowerShell script for exposing the Ollama API on the local network — useful if you want to access your PC’s Ollama instance from other devices:

# Run from repo root in an elevated PowerShell session
powershell -ExecutionPolicy Bypass -File .\scripts\ollama-network-expose.ps1

Part 2: Integrating Ollama with VS Code Copilot Chat

Prerequisites

  • Ollama v0.18.3+
  • VS Code 1.113+
  • GitHub Copilot Chat extension 0.41.0+

Note on GitHub login: VS Code requires you to be signed in with a GitHub account to use the model selector — even for fully local, custom models. However, no paid GitHub Copilot subscription is required. The free GitHub Copilot Free tier is sufficient to enable custom model selection.

Manual Setup

  1. Open the Copilot Chat sidebar from the top-right activity bar.
  2. Click the settings gear icon (⚙) to open the Language Models window.
  3. Click Add Models and select Ollama from the provider list — VS Code will load all locally available Ollama models.
  4. Click Unhide next to your Ollama models to make them selectable in chat.
  5. Make sure Local is selected at the bottom of the Copilot Chat panel.

That’s it. Your locally hosted models are now available directly in the Copilot Chat model picker.

Switching Between Models

Once set up, switching models is just a dropdown in the chat panel. In practice:

  • Coding / Agent / Edit sessionsgemma4-26b-coding or qwen3.6-35b-coding (PC only)
  • Planning / exploratory sessionsgemma4-31b-planning or qwen3.6-27b-planning

This mirrors the role separation maintained in the Modelfile repository and keeps the workflow consistent across both machines.


Wrapping Up

Running local models in VS Code Copilot Chat is surprisingly seamless once Ollama is set up. Maintaining Modelfiles in version control is worth the small upfront effort — a single ollama create command restores a known-good configuration on any machine.

The full repository with all Modelfiles and the network expose script is available at: github.com/cebor/ollama_models.


Have questions or want to share your own model setup? Feel free to reach out.