DGX GB10 / llama.cpp / 2026-05-13

Gemma4 + Qwen3.6 MTP 推理優化報告

目標是讓 Gemma4 family 在 llama.cpp 上維持低 PPL,同時透過 MTP / ngram-mod / TBQ KV / FP4 候選路線提高 decode TPS。Qwen3.6 27B MTP 作為 native MTP 對照組。

Master goal NOT DONE

PPL and 26B coherence are done; 31B/E4B/26B candidate coherence, clean TPS, fallback, and RC deploy remain.

Gemma4 31B 候選 pure K3/V3 + MTP b4

PPL `30.26`,ASR-active 長輸出 median `25.10 t/s`,KV 約 `359 MiB`。

Gemma4 26B 候選 K3/V4 + MTP b3

PPL `25.71`,balanced default;pure K3/V3 可當 memory lane。

Qwen3.6 控制組 native MTP d4: 1.67x

Unsloth Q4_K_M + `mtp-clean` server:512 token avg `19.30 t/s` vs baseline `11.54`。

目前限制 ASR-active noisy

多數新 TPS 仍受 ASR / resident services 影響;候選排序可信,clean absolute TPS 待空機重跑。

Master Goal Checkpoint

每個小目標完成後都要回頭檢視大目標:不是全部 criteria 滿足,就不能宣稱完成。

Area Status Next action
PPLDone for promoted Q4/TBQ candidatesDo not rerun unless artifact/kernel changes.
Output qualityPartialRun promoted-candidate UTF-8 gates for 31B, 26B, and E4B.
TPS / acceptBlocked for clean finalRepeat clean gate only when ASR is idle or explicitly paused.
KV final tablePartialFinish candidate-level 8K/128K FP16/INT8/TBQ comparison.
Deploy safetyNot doneBack up/restore `:8080` fallback before RC service promotion.

New runner ready: `ops/dgx/run_gemma4_promoted_coherence_gate.py`. Current dry-run correctly refuses to start candidates because ASR is ~428% CPU and resident Gemma `:8092` is active.

一頁決策

用於 production / canary / 下一輪測試的排序。

1

31B Q4_K_M + pure K3/V3 + MTP/ngram b4

目前最強大模型路線。PPL 安全、KV 比 K3/V4 小、長輸出 TPS 領先。適合非即時高品質推理與長上下文任務。

2

E4B Q4_K_M + pure K3/V3 或 K3/V4 + MTP/ngram b3

pure K3/V3 PPL 最好且 KV 小;K3/V4 有些微速度邊際。兩者都可進 coherence / clean TPS gate。

3

26B-A4B Q4_K_M + K3/V4 + MTP/ngram b3

balanced default。pure K3/V3 PPL 較高但不是壞掉,可保留為 128K / memory-first lane。

4

E2B 不優先 MTP / pure K3V3

E2B pure K3/V3 PPL 變差,且先前 MTP TPS 收益不明顯;除非有特殊小模型 prompt class,不當主線。

Gemma4 PPL Gate

canonical chat-template strided PPL,`-c 256 -b 256 -ub 256 --ppl-stride 64 --chunks 16`。數值為第 16 chunk 累積 PPL。

Model q8/q8 K3/V4 pure K3/V3 判讀
31B 31.5688 31.8968 30.2595 pure K3/V3 PPL-safe 且略好。
26B-A4B 29.0967 25.7098 38.2839 K3/V4 balanced;pure K3/V3 可做 memory lane。
E4B 26.2416 25.8709 24.6580 pure K3/V3 PPL-safe 且最好。
E2B 90.3992 92.6161 105.1485 pure K3/V3 傷品質,不優先。

Gemma4 UTF-8 / Coherence Gate

2026-05-13 23:55,針對現有 `:8092` Gemma4 26B 128K canary。這是 ASR-active 小型 coherence / accept smoke,不是 clean TPS。

Prompt Pass t/s Accept Readout
zh_marker True 45.62 77.78% Traditional Chinese marker, no mojibake.
embedded_facts True 176.11 100.00% Correctly restates Gemma4 + llama.cpp + MTP/ngram + TBQ3.
repeat_line True 190.73 76.56% Six exact Traditional Chinese lines.

結論:19:40 canary fail 是 request mojibake / self-introspection gate 問題,不是 Gemma4 26B MTP+ngram+TBQ canary incoherent。之後 promotion gate 要用 embedded-fact prompt + server args/timings,而不是問模型自我辨識 runtime flags。

Gemma4 TPS / Accept Matrix

ASR-active core gate。完整 rows 展開如下;候選排序可信,absolute TPS 等 clean repeat。

Model Case p10 t/s median t/s accept med KV MiB
31B pure K3/V3 base 9.36 9.42 - 359.38
31B pure K3/V3 MTP b4 18.40 25.10 65.97% 359.38
31B K3/V4 MTP b4 18.56 23.76 66.89% 424.06
26B-A4B pure K3/V3 base 49.31 49.68 - 89.84
26B-A4B pure K3/V3 MTP b3 52.84 61.39 68.03% 89.84
26B-A4B K3/V4 MTP b3 51.91 61.62 71.04% 106.02
E4B pure K3/V3 base 45.74 46.18 - 32.81
E4B pure K3/V3 MTP b3 57.37 61.91 51.39% 32.81
E4B K3/V4 MTP b3 56.82 63.12 51.46% 38.72

補充:更早的 family pure K3/V3 gate 有 E2B rows:pure base `79.00 t/s`、pure MTP `67.80 t/s`、K3/V4 MTP `83.86 t/s`;但 E2B PPL 與 MTP 收益都不支持列入主候選。

Qwen3.6 Native MTP

控制組:Unsloth Qwen3.6 27B MTP GGUF + `am17an/llama.cpp:mtp-clean`。

Model
`Qwen3.6-27B-Q4_K_M.gguf`
Runtime
`llama-server --spec-type draft-mtp`
512 baseline
`11.54 t/s` avg
512 MTP d4
`19.30 t/s` avg
Speedup
`1.67x`
Acceptance
`72.9-75.4%`

Qwen3.6 旁路整理

為 Gemma4 MTP 行為提供對照,不取代 Gemma 主線。

Lane 狀態 數據 / 觀察 決策
Native MTP 可用 512 tokens: `11.54 -> 19.30 t/s`,約 `1.67x`。 作為 llama.cpp 單檔 MTP reference。
DFlash draft 可跑但匹配敏感 NVFP4 + DFlash first smoke `19.9-21.3 t/s`,accept `28-30%`;UD-Q4 target accept 可到 `41%+`。 不是最乾淨控制組;需 target/drafter artifact match。
NVFP4 GGUF 非 MTP-preserved `Qwen3.6-27B-NVFP4-Q4_K_M.gguf` 無 `nextn`;target-only tg64 `11.88`。 可當 FP4 target smoke,不是 native MTP artifact。
Old feat/mtp-runtime 不適用 Unsloth clean GGUF 仍報 `expected 866, got 857`。 不要用這條 branch 測 Qwen3.6 MTP。

完整測試數據 Appendix

以下是目前整理進報告的所有已知測試 rows。ASR-active / noisy 的數據保留 caveat,不拿來當 clean absolute TPS。

Gemma4 KV Cache Size: FP16 / INT8 / TBQ3

以現有 bench 的 KV self size 為基準,假設該表為 8K context;128K 以線性 x16 換算。FP16/INT8 為依 bit-width 從實測 pure K3/V3 反推的估算,K3/V4 為實測值。節省比例以 FP16 KV 為 100% baseline。

Model FP16 KV 8K INT8 KV 8K K3/V4 KV 8K pure K3/V3 KV 8K FP16 KV 128K INT8 KV 128K K3/V4 KV 128K pure K3/V3 KV 128K TBQ3 saving vs FP16
E2B62.51 MiB31.25 MiB13.83 MiB11.72 MiB0.98 GiB0.49 GiB0.22 GiB0.18 GiBK3/V3 saves 81.3%; K3/V4 saves 77.9%
E4B174.99 MiB87.49 MiB38.72 MiB32.81 MiB2.73 GiB1.37 GiB0.61 GiB0.51 GiBK3/V3 saves 81.3%; K3/V4 saves 77.9%
26B-A4B479.15 MiB239.57 MiB106.02 MiB89.84 MiB7.49 GiB3.74 GiB1.66 GiB1.40 GiBK3/V3 saves 81.3%; K3/V4 saves 77.9%
31B1,916.69 MiB958.35 MiB424.06 MiB359.38 MiB29.95 GiB14.97 GiB6.63 GiB5.62 GiBK3/V3 saves 81.3%; K3/V4 saves 77.9%

Gemma4 Family Pure K3/V3 Gate

Model pure K3/V3 base pure K3/V3 MTP pure speedup K3/V4 MTP pure KV K3/V4 KV policy
E2B79.0067.800.86x83.8611.72 MiB13.83 MiBavoid MTP/pure K3V3 for speed
E4B45.5855.501.22x55.4032.81 MiB38.72 MiBpure K3/V3 b3 candidate
26B-A4B48.5849.021.01x45.1689.84 MiB106.02 MiBpure K3/V3 b3 candidate, rerun PPL
31B9.6021.262.21x15.41359.38 MiB424.06 MiBstrongest pure K3/V3 b4 candidate

Gemma4 Long-token / No-CUDA-Graphs Gate

Model Recommended long-token mode Speed Accept KV Note
31Bpure K3/V3 MTP b4 + no graphs23.79 t/s65.97%359.38 MiBsame speed as K3/V4, lower KV
26B-A4Bpure K3/V3 MTP b3 + no graphs62.14 t/s68.03%89.84 MiBsame/slightly better speed than K3/V4, lower KV
E4BK3/V4 MTP b3 for speed, pure K3/V3 b3 for memory63.82 K3/V4 / 61.89 pure~51%38.72 / 32.81 MiBsmall trade-off
E2Bno MTP priority---previous gate showed MTP not useful

Gemma4 26B Artifact Baseline Probe

Artifact tg512 graph-on tg512 no-graphs Readout
current symlink / Bartowski Q442.9643.84current MTP target remains better in chat MTP
official-dir Q455.6847.17target-only faster, but did not transfer to chat MTP
uncensored Q453.8547.94target-only strong
MXFP4_MOE46.1045.74quality smoke passed, speed not default
IQ3_XXS54.7947.83target-only strong, not current MTP default
NVFP4 ggufbench31.4330.06quality fixed but slow
NVFP4 catlilface29.3827.09external NVFP4 smoke passed, not fast
Heretic Q4failfailfailed to load

Gemma4 26B Official Target Chat MTP Compare

Target pure K3/V3 base pure K3/V3 MTP speedup accept K3/V4 MTP
current / Bartowski43.6747.341.08x68.03%42.29
official-dir46.5339.720.85x67.97%41.18

Gemma4 31B Artifact Baseline Probe

Artifact tg512 graph-on tg512 no-graphs Readout
official Q48.047.73Q4 ceiling confirmed
base-path Q48.247.69same class as official
abliterated Q48.828.61~7-10% faster due smaller file
local NVFP46.306.12quality fixed but slower
DFlash v1/v2/v3failfaildraft/special files, not target replacements here

Gemma4 FP4 / Quality Gates

Case Result Decision
26B NVFP4 fuller PPLNVFP4 2.8483 vs Q4 2.8107, 16 chunksPPL safe, not speed default
31B NVFP4 fuller PPLNVFP4 1.1847 vs Q4 1.1327, 16 chunksPPL safe, not speed default
26B external NVFP4 smokeQ4 15.4698 vs external NVFP4 14.8021download not corrupt; quality smoke passed
26B MXFP4 fuller PPLMXFP4 44.5990 vs Q4 67.4635dequant/quality smoke passed
26B MXFP4 TPS smokeMXFP4 K3/V4 MTP b3 62.67 t/s, accept 65.37%; Q4 K3/V4 b3 68.17 t/s, accept 73.28%MXFP4 slower than Q4 default

Qwen3.6 Target-only / DFlash First Data

Case pp / tg / speed accept n_accept / n_drafted Note
NVFP4-Q4 target-onlypp128 610.03 / tg64 11.88--ASR-active llama-bench
Q4_K_M MTP-preserved target-onlypp128 651.12 / tg64 10.80--ASR-active llama-bench
NVFP4-Q4 + DFlash default p-min65 tokens in 3.263s = 19.919 t/s28.125%45 / 160rough 1.68x vs target tg64
NVFP4-Q4 + DFlash p-min 0.065 tokens in 3.052s = 21.299 t/s30.263%46 / 152rough 1.79x vs target tg64

Qwen3.6 DFlash Acceptance A/B

Case Target Draft decoded t/s accept n_accept / n_drafted
nv_d8_p0NVFP4-Q4_K_M89.35519.444%42 / 216
ud_d8_p0UD-Q4_K_XL823.52541.250%99 / 240
ud_d4_p0UD-Q4_K_XL414.93939.815%43 / 108

Qwen3.6 DFlash Follow-up Sweep

Case decoded t/s accept n_accept / n_drafted Note
UD raw prompt d425.08265.278%94 / 144highest accept
UD raw prompt d826.10143.534%101 / 232good short-output TPS
UD raw prompt d1228.93031.790%103 / 324highest short-output TPS
UD raw prompt d1622.24525.432%103 / 405too much draft overhead
UD no-think d811.36020.192%21 / 104prompt format likely hurts
UD no-think d1212.84113.462%21 / 156prompt format likely hurts
UD d8 n25616.95528.323%179 / 632longer output degraded
UD d12 n25616.52219.846%181 / 912longer output degraded

Qwen3.6 Native MTP Sweep

Case tokens t/s accept n_accept / n_generated
baseline server12811.66--
MTP d3 server12819.2580.2%89 / 111
baseline server25611.63--
MTP d2 server25619.0992.7%165 / 178
MTP d3 server25618.1676.3%177 / 232
MTP d4 server25621.1377.4%192 / 248
MTP d3 + ngram-mod25617.7776.3%ngram generated 0 tokens
baseline server avg x351211.54--
MTP d4 server avg x351219.3072.9-75.4%~380 / ~510

風險與下一步

接下來不再重複 PPL,除非換 artifact 或 kernel。

立即下一步

跑候選的 TPS / accept / output coherence 小矩陣:31B pure K3/V3 b4、E4B pure/K3V4 b3、26B K3/V4 b3。

Clean repeat

ASR idle 或經批准暫停後,重跑 clean TPS gate,更新 absolute TPS。

FP4

Gemma4 NVFP4/MXFP4 已有品質 smoke,但目前不是 default;要做就是 Blackwell-native FP4 / converter alignment。

部署策略

先保留/備份 legacy fallback,再開 RC service;不直接替換 production。