Ollama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine will now measure the exact amount of memory required compared to an estimation in previous versions of Ollama. This has several benefits:
nvidia-smi will now match ollama ps making it easy to track memory utilization on your systemAll models implemented in Ollama’s new engine now have this new feature enabled by default, with more models coming soon as they transition to Ollama’s new engine.
gemma3:12b| Old | New |
|---|---|
| 52.02 tokens/s token generation speed | 85.54 tokens/s token generation speed |
| 19.9GiB of VRAM | 21.4GiB of VRAM |
| 48⁄49 layers loaded on GPU | 49⁄49 layers loaded on GPU |
mistral-small3.2| Old | New |
|---|---|
| 127.84 tokens/s prompt evaluation speed | 1380.24 tokens/s prompt evaluation speed |
| 43.15 tokens/s token generation speed | 55.61 tokens/s token generation speed |
| 19.9GiB of VRAM | 21.4GiB of VRAM |
| 40⁄41 layers loaded on GPU | 41⁄41 layers loaded on GPU + vision model |
All models implemented in Ollama’s new engine use the new memory management features:
gpt-ossllama4, llama3.2-vision (soon: llama3.2, llama3.1, llama3)gemma3, embeddinggemma, gemma3nqwen3, qwen2.5vl (soon: qwen3-coder)mistral-small3.2all-minilm and other embedding models