Running Qwen3.6-A3B-Heretic on dual-GPU: the Mark VII deployment
For six months our chat brain was a 31B dense Gemma 4. It ran fine — about 50 tokens/sec of decode, 64K context — but we were starting to feel the ceiling. Long conversations with several agents in flight. Voice mode. Tool calls. Reasoning queries. The dense model was working hard.
Two weeks ago we swapped it for Qwen3.6-35B-A3B-Heretic, an abliterated MoE with 35B total parameters and roughly 3B activated per token. On the same dual-GPU rig, decode jumped to ~200 tok/s — four times faster — while quality on our internal evals went up. This post is the full deployment: the math that makes 35B feel like 3B, the tensor split, the KV-cache settings, the 262K-per-slot context, the adversarial battery, and the things that broke.
KULVEX is the self-hosted AI platform we're running this on. Most of the post is generic enough to apply to any llama.cpp deployment, but the integration details (numbered Mark releases, the model alias trick, the llama-server config) are KULVEX-flavoured.
Why MoE was the right move for chat
Mixture-of-Experts is the architecture nobody loves theoretically and everyone loves practically. The model has 35B parameters in total weight, but for any given token it only activates ~3B of them — eight experts of ~4B each, of which two are routed. The math:
Dense 31B ── every token: 31B params * 2 ops/param = 62 GFLOPs MoE 35B-A3B ── every token: 3B params * 2 ops/param = 6 GFLOPs
Same hardware, same memory bandwidth, ~10× less compute per token. That's where the 4× decode speed-up comes from (the rest goes to activation overhead and routing cost). The total weights still need to live in VRAM, so you're paying for 35B of memory — but on a dual-GPU rig with 56 GB combined, that fits comfortably at Q4_K_M with room for a 262K-token context per slot.
For chat, this trade is exactly right. Dense parameter count buys you knowledge breadth and stylistic consistency, but tokens-per-second is what makes a conversation feel alive. MoE keeps the breadth and gives you the speed.
Why the "Heretic" abliterated variant
Qwen3.6's base release behaves well on benign prompts and refuses on a wide refusal surface — including a number of cases where the right answer for an owner-controlled agent is to engage. Refusal is a behaviour, like any other, and abliteration is the technique to remove it surgically.
The community has converged on doing abliteration in full bf16 precision (mlabonne, huihui-ai, mradermacher) before quantising. Doing it the other way round — quantising first, then trying to abliterate the quantised weights — produces garbled output (PPL in the 300k+ range; we tried). The Heretic variant is one of the bf16-abliterated checkpoints, then quantised cleanly.
We ran an adversarial battery against it after deployment: ten prompts of the kind that triggered refusals on the base Qwen3.6. Refusal rate: 0/10. The model engages with whatever the owner asks. That's the contract for a personal assistant — the operator decides the limits, not the model vendor.
The hardware: 4090 + 5090
The lab box is a single workstation:
- Ryzen 7 7700X, 64 GB DDR5
- RTX 4090 (24 GB VRAM, 1 TB/s memory bandwidth)
- RTX 5090 (32 GB VRAM, 1.79 TB/s memory bandwidth)
- Combined 56 GB VRAM across PCIe 5.0 x16 + x16
- 2 TB NVMe for model weights
- 1500 W PSU (the 5090 spikes hard under MoE load)
The asymmetry between the two cards matters. The 5090 has 33% more bandwidth and 33% more memory than the 4090. For a tensor-parallel split you want more weights on the faster card so the bottleneck card finishes its piece first.
The llama-server config that ships
Mark VII runs as a systemd user service. The launch line condenses a lot of choices:
# ~/.config/systemd/user/mnemo-llama.service (excerpt)
ExecStart=/opt/llama.cpp/llama-server \
--model data/models/Qwen3.6-35B-A3B-Abliterated-Heretic-Q4_K_M/\
Qwen3.6-35B-A3B-Heretic-Q4_K_M.gguf \
--mmproj data/models/Qwen3.6-35B-A3B-Abliterated-Heretic-Q4_K_M/\
mmproj-Qwen3.6-35B-A3B-Abliterated-Heretic.gguf \
--tensor-split 18,38 \
--n-gpu-layers 99 \
--ctx-size 524288 \
--parallel 2 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--port 8090 \
--host 0.0.0.0 \
--jinja \
--chat-template-kwargs '{"enable_thinking": false}'The choices, one by one:
--tensor-split 18,38
Loads ~18 layers on the 4090 and ~38 on the 5090. The ratio matches the bandwidth ratio (1 / 1.79 ≈ 18 / 38). Equal splits leave the slower card waiting; this split keeps both cards busy.
--ctx-size 524288 --parallel 2
Half a million tokens of context, divided across two parallel slots. Each slot gets 262,144 tokens — enough for a chat agent and a tool-using subagent in flight at the same time. The combined context is what the docs call n_ctx; with parallel=2 it's split.
--cache-type-k q8_0 --cache-type-v q8_0
8-bit KV-cache. Halves the memory footprint of the cache with negligible quality loss. At 524K context, the cache otherwise eats 30+ GB by itself.
--chat-template-kwargs '{\"enable_thinking\": false}'
Qwen3.6 supports an optional "thinking" mode that generates a <think> block before the answer. Useful for hard reasoning, but it leaks into chat output if anything in the wrapper doesn't strip it cleanly. We disable it for chat and re-enable it programmatically for code-style queries via the API. The output filter (covered here) is the second line of defence when it leaks anyway.
The numbers: Mark VI vs Mark VII, same hardware
The 4× decode speedup is the headline. The other change that surprised us positively was tool-use schema compliance — Qwen3.6 emits cleaner JSON tool-call envelopes than Gemma 4 did, which cut down on retry-on-parse-failure paths in the orchestrator.
Things that broke during the swap
Thinking-mode leaks into chat
With thinking ON, ~5% of chat replies leaked the <think> block visibly. Disabling thinking via chat_template_kwargs fixed the bulk; the output filter caught the residual.
PSU shutdowns under sustained load
We initially tried Qwen3-Coder-Next 80B at Q8_0. The 5090 spiked hard enough to trip the PSU rail twice in ten minutes. Mark VII at Q4_K_M draws much less peak power (the active 3B-per-token means fewer tensors lit up at once) and is rock-stable.
Hardcoded model-name references
Our codebase had ~12 places hardcoded with the old mnemo:mark6-31b alias. Rather than churn all of them at swap time, we kept the alias as a stable identifier and let it point to the new model. The runtime doesn't care — the alias just routes to whichever weights are loaded on port 8090.
Tensor-split miscalibration
Our first split was 24,32 (eyeballed proportional to VRAM). The 4090 idled while the 5090 finished. The corrected 18,38 split came from running the same prompt and watching nvidia-smi until both cards saturated together.
What this stack still can't do
- Not a coding model. Qwen3.6-A3B is good at chat and reasoning; for code generation we keep a dedicated GLM-4.7-Flash or Qwen3-Coder model ready to swap in. KULVEX's model selector handles the routing automatically when a task is code-heavy.
- Vision is competent, not great. The
mmprojfile makes Qwen3.6 multimodal — image input works, OCR works, basic scene description works. For dense diagrams or complex visual reasoning, the dedicated vision pipeline (YOLO + a smaller VLM) does better. - The 56 GB rig is enthusiast-tier. The KULVEX installer auto-picks smaller models for lighter hardware (Qwen3-14B for single-24-GB cards, Qwen3-8B-MoE for 12 GB, etc.). Mark VII is the recommended-tier configuration — readiness label "maximum" in the UI — not the only option.
Run Mark VII (or auto-pick) on your hardware
KULVEX's installer probes your GPU(s) on first run and selects the best-fit model from a curated catalogue. On a 56 GB-class rig you'll get Mark VII; on lighter hardware you'll get a smaller model that fits with margin for context.
# Linux / macOS — single-line install curl -fsSL https://kulvex.ai/install.sh | bash # To force-swap to Mark VII manually: bash /home/curly/jarvis/scripts/swap-to-qwen36.sh # To roll back to Mark VI (Gemma 4 31B): bash /home/curly/jarvis/scripts/swap-to-gemma4.sh
For pricing tiers and the recommended-hardware table, see kulvex.ai/pricing.
Related reading
- The output filter — catching the thinking-block leaks Mark VII still produces sometimes.
- Replacing Home Assistant — what Mark VII drives in the home stack.
- Private-by-default — why running a 35B model on your own GPU matters.
What we're looking for
If you're running Qwen3.6-A3B-Heretic on a different GPU configuration and your numbers diverge significantly from the table, we'd like to compare. Asymmetric tensor-split tuning across non-NVLink rigs is the specific corner where consumer hardware diverges most.