This is a summary of how I built and tuned a local OpenCode-style coding setup on AMD ROCm using Lemonade + llama.cpp, moving through multiple models, context issues, and performance tuning before settling on Gamma 26B A4B Instruct (IT).
Started with Lemonade running llama.cpp on ROCm via Docker:
docker run -d \
--name lemonade-server \
-p 13305:13305 \
-v lemonade-cache:/root/.cache/huggingface \
-v lemonade-llama:/opt/lemonade/llama \
ghcr.io/lemonade-sdk/lemonade-server:latest
Then moved to ROCm-enabled mode:
docker run -d \
--name lemonade-server \
-p 13305:13305 \
-v lemonade-cache:/root/.cache/huggingface \
-v lemonade-llama:/opt/lemonade/llama \
-v lemonade-recipe:/root/.cache/lemonade \
-e LEMONADE_LLAMACPP=rocm \
--device=/dev/kfd \
--device=/dev/dri \
ghcr.io/lemonade-sdk/lemonade-server:latest
ROCm stack confirmed:
Verified via:
rocminfo
rocm-smi
GPU utilization was high and stable, but performance depended heavily on llama.cpp flags and KV behaviour.
Even with:
--ctx-size 65536
Runtime still reported:
n_ctx_slot = 16384
Which eventually led to:
request (72859 tokens) exceeds available context size (16384 tokens)
So despite the CLI flag, actual slot context was effectively capped at 16K due to:
I cycled through:
Issues encountered:
<tool_call> tokensEventually it became clear that model behavior mattered less than runtime stability.
First meaningful gains came from:
--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32
This alone pushed throughput from baseline into the mid-30 TPS range.
Batch tuning had a major effect:
| Batch config | Behaviour |
|---|---|
| 2048 | high spikes, unstable latency |
| 512 | balanced |
| 256 | most stable |
Final direction:
--batch-size 256 --ubatch-size 64
Further tuning:
--kv-unified \
--cache-reuse 256 \
--no-warmup
And in some tests:
--no-context-shift (for stability)Key observation: KV reuse had more impact than raw compute tuning.
Logs consistently showed:
This caused:
This is the actual measured TPS comparison you provided:
| Configuration | Time | Tokens | TPS |
|---|---|---|---|
| Batch 512 | 39.457 | 768 | — |
| Batch 512 | 44.806 | 2048 | 239.29 |
| Batch 2048 | — | 16881 | — |
| Batch 2048 | 95 | 23025 | 64.67 |
After tuning, the stable configuration became:
--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32 \
--kv-unified \
--no-warmup \
--cache-reuse 256 \
--batch-size 256 \
--ubatch-size 64
This gave:
Even with:
--ctx-size 65536
The system still enforced:
Conclusion:
context size is not just a flag — it’s enforced by model + KV + runtime alignment
After all experiments, I settled on:
Gamma 26B A4B Instruct (IT)
Why:
Compared to earlier models:
Final system:
The key takeaway from the whole setup:
Most gains came not from the model itself, but from fixing context + slot behavior and stabilizing KV cache handling.