OpenCode + Local LLM Setup (ROCm / Lemonade / llama.cpp)

blog.mattsbit.co.uk

This is a summary of how I built and tuned a local OpenCode-style coding setup on AMD ROCm using Lemonade + llama.cpp, moving through multiple models, context issues, and performance tuning before settling on Gamma 26B A4B Instruct (IT).

1. Base Setup

Started with Lemonade running llama.cpp on ROCm via Docker:

docker run -d \
  --name lemonade-server \
  -p 13305:13305 \
  -v lemonade-cache:/root/.cache/huggingface \
  -v lemonade-llama:/opt/lemonade/llama \
  ghcr.io/lemonade-sdk/lemonade-server:latest

Then moved to ROCm-enabled mode:

docker run -d \
  --name lemonade-server \
  -p 13305:13305 \
  -v lemonade-cache:/root/.cache/huggingface \
  -v lemonade-llama:/opt/lemonade/llama \
  -v lemonade-recipe:/root/.cache/lemonade \
  -e LEMONADE_LLAMACPP=rocm \
  --device=/dev/kfd \
  --device=/dev/dri \
  ghcr.io/lemonade-sdk/lemonade-server:latest

2. Hardware Check

ROCm stack confirmed:

AMD Ryzen AI MAX+ 395
Radeon 8060S (gfx1151)
Unified memory pool (~128GB)

Verified via:

rocminfo
rocm-smi

GPU utilization was high and stable, but performance depended heavily on llama.cpp flags and KV behaviour.

3. First Issue: Context Mismatch

Even with:

--ctx-size 65536

Runtime still reported:

n_ctx_slot = 16384

Which eventually led to:

request (72859 tokens) exceeds available context size (16384 tokens)

So despite the CLI flag, actual slot context was effectively capped at 16K due to:

model GGUF constraints
KV cache limits
Lemonade slot handling
context shifting behavior

4. Model Experiments

I cycled through:

GLM-4.7-Flash-GGUF
Qwen3 8B (tool-call instability issues)
Qwen3 30B Q2/Q4 variants

Issues encountered:

unexpected <tool_call> tokens
context exhaustion at high prompt sizes
unstable KV reuse across slots

Eventually it became clear that model behavior mattered less than runtime stability.

5. Initial Performance Tuning

First meaningful gains came from:

--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32

This alone pushed throughput from baseline into the mid-30 TPS range.

6. Batch Size Experiments

Batch tuning had a major effect:

Batch config	Behaviour
2048	high spikes, unstable latency
512	balanced
256	most stable

Final direction:

--batch-size 256 --ubatch-size 64

7. KV + Cache Tuning

Further tuning:

--kv-unified \
--cache-reuse 256 \
--no-warmup

And in some tests:

--no-context-shift (for stability)
reduced parallelism when slot issues appeared

Key observation: KV reuse had more impact than raw compute tuning.

8. Slot + Context Issues

Logs consistently showed:

LCP similarity-based slot reuse
KV cache accumulation across prompts
inflated memory state from previous sessions

This caused:

unpredictable context sizes
stale prompt reuse
occasional silent degradation before hard failure

9. Performance Comparison (Your Data)

This is the actual measured TPS comparison you provided:

Configuration	Time	Tokens	TPS
Batch 512	39.457	768	—
Batch 512	44.806	2048	239.29
Batch 2048	—	16881	—
Batch 2048	95	23025	64.67

10. Final Stable Runtime Configuration

After tuning, the stable configuration became:

--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32 \
--kv-unified \
--no-warmup \
--cache-reuse 256 \
--batch-size 256 \
--ubatch-size 64

This gave:

stable throughput (~30–36 TPS typical range depending on load)
reduced KV churn
fewer slot reuse artifacts
more predictable latency under long contexts

11. Context Failure Reality

Even with:

--ctx-size 65536

The system still enforced:

effective slot limit (~16K in some cases)
hard failures at ~72K token requests

Conclusion:

context size is not just a flag — it’s enforced by model + KV + runtime alignment

12. Final Model Choice

After all experiments, I settled on:

Gamma 26B A4B Instruct (IT)

Why:

more stable under long sessions
fewer tool-call artifacts than Qwen3
consistent KV behavior in llama.cpp
better balance of speed vs reliability on ROCm APU

Compared to earlier models:

GLM → fast but unstable under long context
Qwen3 → powerful but noisy / tool-call prone
Gamma 26B A4B IT → stable and predictable

End State

Final system:

Lemonade + llama.cpp (ROCm backend)
tuned batching + KV settings
strict context control
Gamma 26B A4B IT as primary coding model

The key takeaway from the whole setup:

Most gains came not from the model itself, but from fixing context + slot behavior and stabilizing KV cache handling.

Feeds