PPL and 26B coherence are done; 31B/E4B/26B candidate coherence, clean TPS, fallback, and RC deploy remain.
DGX GB10 / llama.cpp / 2026-05-13
Gemma4 + Qwen3.6 MTP 推理優化報告
目標是讓 Gemma4 family 在 llama.cpp 上維持低 PPL,同時透過 MTP / ngram-mod / TBQ KV / FP4 候選路線提高 decode TPS。Qwen3.6 27B MTP 作為 native MTP 對照組。
PPL `30.26`,ASR-active 長輸出 median `25.10 t/s`,KV 約 `359 MiB`。
PPL `25.71`,balanced default;pure K3/V3 可當 memory lane。
Unsloth Q4_K_M + `mtp-clean` server:512 token avg `19.30 t/s` vs baseline `11.54`。
多數新 TPS 仍受 ASR / resident services 影響;候選排序可信,clean absolute TPS 待空機重跑。
Master Goal Checkpoint
每個小目標完成後都要回頭檢視大目標:不是全部 criteria 滿足,就不能宣稱完成。
| Area | Status | Next action |
|---|---|---|
| PPL | Done for promoted Q4/TBQ candidates | Do not rerun unless artifact/kernel changes. |
| Output quality | Partial | Run promoted-candidate UTF-8 gates for 31B, 26B, and E4B. |
| TPS / accept | Blocked for clean final | Repeat clean gate only when ASR is idle or explicitly paused. |
| KV final table | Partial | Finish candidate-level 8K/128K FP16/INT8/TBQ comparison. |
| Deploy safety | Not done | Back up/restore `:8080` fallback before RC service promotion. |
New runner ready: `ops/dgx/run_gemma4_promoted_coherence_gate.py`. Current dry-run correctly refuses to start candidates because ASR is ~428% CPU and resident Gemma `:8092` is active.
一頁決策
用於 production / canary / 下一輪測試的排序。
31B Q4_K_M + pure K3/V3 + MTP/ngram b4
目前最強大模型路線。PPL 安全、KV 比 K3/V4 小、長輸出 TPS 領先。適合非即時高品質推理與長上下文任務。
E4B Q4_K_M + pure K3/V3 或 K3/V4 + MTP/ngram b3
pure K3/V3 PPL 最好且 KV 小;K3/V4 有些微速度邊際。兩者都可進 coherence / clean TPS gate。
26B-A4B Q4_K_M + K3/V4 + MTP/ngram b3
balanced default。pure K3/V3 PPL 較高但不是壞掉,可保留為 128K / memory-first lane。
E2B 不優先 MTP / pure K3V3
E2B pure K3/V3 PPL 變差,且先前 MTP TPS 收益不明顯;除非有特殊小模型 prompt class,不當主線。
Gemma4 PPL Gate
canonical chat-template strided PPL,`-c 256 -b 256 -ub 256 --ppl-stride 64 --chunks 16`。數值為第 16 chunk 累積 PPL。
| Model | q8/q8 | K3/V4 | pure K3/V3 | 判讀 |
|---|---|---|---|---|
| 31B | 31.5688 | 31.8968 | 30.2595 | pure K3/V3 PPL-safe 且略好。 |
| 26B-A4B | 29.0967 | 25.7098 | 38.2839 | K3/V4 balanced;pure K3/V3 可做 memory lane。 |
| E4B | 26.2416 | 25.8709 | 24.6580 | pure K3/V3 PPL-safe 且最好。 |
| E2B | 90.3992 | 92.6161 | 105.1485 | pure K3/V3 傷品質,不優先。 |
Gemma4 UTF-8 / Coherence Gate
2026-05-13 23:55,針對現有 `:8092` Gemma4 26B 128K canary。這是 ASR-active 小型 coherence / accept smoke,不是 clean TPS。
| Prompt | Pass | t/s | Accept | Readout |
|---|---|---|---|---|
| zh_marker | True | 45.62 | 77.78% | Traditional Chinese marker, no mojibake. |
| embedded_facts | True | 176.11 | 100.00% | Correctly restates Gemma4 + llama.cpp + MTP/ngram + TBQ3. |
| repeat_line | True | 190.73 | 76.56% | Six exact Traditional Chinese lines. |
結論:19:40 canary fail 是 request mojibake / self-introspection gate 問題,不是 Gemma4 26B MTP+ngram+TBQ canary incoherent。之後 promotion gate 要用 embedded-fact prompt + server args/timings,而不是問模型自我辨識 runtime flags。
Gemma4 TPS / Accept Matrix
ASR-active core gate。完整 rows 展開如下;候選排序可信,absolute TPS 等 clean repeat。
| Model | Case | p10 t/s | median t/s | accept med | KV MiB |
|---|---|---|---|---|---|
| 31B | pure K3/V3 base | 9.36 | 9.42 | - | 359.38 |
| 31B | pure K3/V3 MTP b4 | 18.40 | 25.10 | 65.97% | 359.38 |
| 31B | K3/V4 MTP b4 | 18.56 | 23.76 | 66.89% | 424.06 |
| 26B-A4B | pure K3/V3 base | 49.31 | 49.68 | - | 89.84 |
| 26B-A4B | pure K3/V3 MTP b3 | 52.84 | 61.39 | 68.03% | 89.84 |
| 26B-A4B | K3/V4 MTP b3 | 51.91 | 61.62 | 71.04% | 106.02 |
| E4B | pure K3/V3 base | 45.74 | 46.18 | - | 32.81 |
| E4B | pure K3/V3 MTP b3 | 57.37 | 61.91 | 51.39% | 32.81 |
| E4B | K3/V4 MTP b3 | 56.82 | 63.12 | 51.46% | 38.72 |
補充:更早的 family pure K3/V3 gate 有 E2B rows:pure base `79.00 t/s`、pure MTP `67.80 t/s`、K3/V4 MTP `83.86 t/s`;但 E2B PPL 與 MTP 收益都不支持列入主候選。
Qwen3.6 Native MTP
控制組:Unsloth Qwen3.6 27B MTP GGUF + `am17an/llama.cpp:mtp-clean`。
- Model
- `Qwen3.6-27B-Q4_K_M.gguf`
- Runtime
- `llama-server --spec-type draft-mtp`
- 512 baseline
- `11.54 t/s` avg
- 512 MTP d4
- `19.30 t/s` avg
- Speedup
- `1.67x`
- Acceptance
- `72.9-75.4%`
Qwen3.6 旁路整理
為 Gemma4 MTP 行為提供對照,不取代 Gemma 主線。
| Lane | 狀態 | 數據 / 觀察 | 決策 |
|---|---|---|---|
| Native MTP | 可用 | 512 tokens: `11.54 -> 19.30 t/s`,約 `1.67x`。 | 作為 llama.cpp 單檔 MTP reference。 |
| DFlash draft | 可跑但匹配敏感 | NVFP4 + DFlash first smoke `19.9-21.3 t/s`,accept `28-30%`;UD-Q4 target accept 可到 `41%+`。 | 不是最乾淨控制組;需 target/drafter artifact match。 |
| NVFP4 GGUF | 非 MTP-preserved | `Qwen3.6-27B-NVFP4-Q4_K_M.gguf` 無 `nextn`;target-only tg64 `11.88`。 | 可當 FP4 target smoke,不是 native MTP artifact。 |
| Old feat/mtp-runtime | 不適用 | Unsloth clean GGUF 仍報 `expected 866, got 857`。 | 不要用這條 branch 測 Qwen3.6 MTP。 |
完整測試數據 Appendix
以下是目前整理進報告的所有已知測試 rows。ASR-active / noisy 的數據保留 caveat,不拿來當 clean absolute TPS。
Gemma4 KV Cache Size: FP16 / INT8 / TBQ3
以現有 bench 的 KV self size 為基準,假設該表為 8K context;128K 以線性 x16 換算。FP16/INT8 為依 bit-width 從實測 pure K3/V3 反推的估算,K3/V4 為實測值。節省比例以 FP16 KV 為 100% baseline。
| Model | FP16 KV 8K | INT8 KV 8K | K3/V4 KV 8K | pure K3/V3 KV 8K | FP16 KV 128K | INT8 KV 128K | K3/V4 KV 128K | pure K3/V3 KV 128K | TBQ3 saving vs FP16 |
|---|---|---|---|---|---|---|---|---|---|
| E2B | 62.51 MiB | 31.25 MiB | 13.83 MiB | 11.72 MiB | 0.98 GiB | 0.49 GiB | 0.22 GiB | 0.18 GiB | K3/V3 saves 81.3%; K3/V4 saves 77.9% |
| E4B | 174.99 MiB | 87.49 MiB | 38.72 MiB | 32.81 MiB | 2.73 GiB | 1.37 GiB | 0.61 GiB | 0.51 GiB | K3/V3 saves 81.3%; K3/V4 saves 77.9% |
| 26B-A4B | 479.15 MiB | 239.57 MiB | 106.02 MiB | 89.84 MiB | 7.49 GiB | 3.74 GiB | 1.66 GiB | 1.40 GiB | K3/V3 saves 81.3%; K3/V4 saves 77.9% |
| 31B | 1,916.69 MiB | 958.35 MiB | 424.06 MiB | 359.38 MiB | 29.95 GiB | 14.97 GiB | 6.63 GiB | 5.62 GiB | K3/V3 saves 81.3%; K3/V4 saves 77.9% |
Gemma4 Family Pure K3/V3 Gate
| Model | pure K3/V3 base | pure K3/V3 MTP | pure speedup | K3/V4 MTP | pure KV | K3/V4 KV | policy |
|---|---|---|---|---|---|---|---|
| E2B | 79.00 | 67.80 | 0.86x | 83.86 | 11.72 MiB | 13.83 MiB | avoid MTP/pure K3V3 for speed |
| E4B | 45.58 | 55.50 | 1.22x | 55.40 | 32.81 MiB | 38.72 MiB | pure K3/V3 b3 candidate |
| 26B-A4B | 48.58 | 49.02 | 1.01x | 45.16 | 89.84 MiB | 106.02 MiB | pure K3/V3 b3 candidate, rerun PPL |
| 31B | 9.60 | 21.26 | 2.21x | 15.41 | 359.38 MiB | 424.06 MiB | strongest pure K3/V3 b4 candidate |
Gemma4 Long-token / No-CUDA-Graphs Gate
| Model | Recommended long-token mode | Speed | Accept | KV | Note |
|---|---|---|---|---|---|
| 31B | pure K3/V3 MTP b4 + no graphs | 23.79 t/s | 65.97% | 359.38 MiB | same speed as K3/V4, lower KV |
| 26B-A4B | pure K3/V3 MTP b3 + no graphs | 62.14 t/s | 68.03% | 89.84 MiB | same/slightly better speed than K3/V4, lower KV |
| E4B | K3/V4 MTP b3 for speed, pure K3/V3 b3 for memory | 63.82 K3/V4 / 61.89 pure | ~51% | 38.72 / 32.81 MiB | small trade-off |
| E2B | no MTP priority | - | - | - | previous gate showed MTP not useful |
Gemma4 26B Artifact Baseline Probe
| Artifact | tg512 graph-on | tg512 no-graphs | Readout |
|---|---|---|---|
| current symlink / Bartowski Q4 | 42.96 | 43.84 | current MTP target remains better in chat MTP |
| official-dir Q4 | 55.68 | 47.17 | target-only faster, but did not transfer to chat MTP |
| uncensored Q4 | 53.85 | 47.94 | target-only strong |
| MXFP4_MOE | 46.10 | 45.74 | quality smoke passed, speed not default |
| IQ3_XXS | 54.79 | 47.83 | target-only strong, not current MTP default |
| NVFP4 ggufbench | 31.43 | 30.06 | quality fixed but slow |
| NVFP4 catlilface | 29.38 | 27.09 | external NVFP4 smoke passed, not fast |
| Heretic Q4 | fail | fail | failed to load |
Gemma4 26B Official Target Chat MTP Compare
| Target | pure K3/V3 base | pure K3/V3 MTP | speedup | accept | K3/V4 MTP |
|---|---|---|---|---|---|
| current / Bartowski | 43.67 | 47.34 | 1.08x | 68.03% | 42.29 |
| official-dir | 46.53 | 39.72 | 0.85x | 67.97% | 41.18 |
Gemma4 31B Artifact Baseline Probe
| Artifact | tg512 graph-on | tg512 no-graphs | Readout |
|---|---|---|---|
| official Q4 | 8.04 | 7.73 | Q4 ceiling confirmed |
| base-path Q4 | 8.24 | 7.69 | same class as official |
| abliterated Q4 | 8.82 | 8.61 | ~7-10% faster due smaller file |
| local NVFP4 | 6.30 | 6.12 | quality fixed but slower |
| DFlash v1/v2/v3 | fail | fail | draft/special files, not target replacements here |
Gemma4 FP4 / Quality Gates
| Case | Result | Decision |
|---|---|---|
| 26B NVFP4 fuller PPL | NVFP4 2.8483 vs Q4 2.8107, 16 chunks | PPL safe, not speed default |
| 31B NVFP4 fuller PPL | NVFP4 1.1847 vs Q4 1.1327, 16 chunks | PPL safe, not speed default |
| 26B external NVFP4 smoke | Q4 15.4698 vs external NVFP4 14.8021 | download not corrupt; quality smoke passed |
| 26B MXFP4 fuller PPL | MXFP4 44.5990 vs Q4 67.4635 | dequant/quality smoke passed |
| 26B MXFP4 TPS smoke | MXFP4 K3/V4 MTP b3 62.67 t/s, accept 65.37%; Q4 K3/V4 b3 68.17 t/s, accept 73.28% | MXFP4 slower than Q4 default |
Qwen3.6 Target-only / DFlash First Data
| Case | pp / tg / speed | accept | n_accept / n_drafted | Note |
|---|---|---|---|---|
| NVFP4-Q4 target-only | pp128 610.03 / tg64 11.88 | - | - | ASR-active llama-bench |
| Q4_K_M MTP-preserved target-only | pp128 651.12 / tg64 10.80 | - | - | ASR-active llama-bench |
| NVFP4-Q4 + DFlash default p-min | 65 tokens in 3.263s = 19.919 t/s | 28.125% | 45 / 160 | rough 1.68x vs target tg64 |
| NVFP4-Q4 + DFlash p-min 0.0 | 65 tokens in 3.052s = 21.299 t/s | 30.263% | 46 / 152 | rough 1.79x vs target tg64 |
Qwen3.6 DFlash Acceptance A/B
| Case | Target | Draft | decoded t/s | accept | n_accept / n_drafted |
|---|---|---|---|---|---|
| nv_d8_p0 | NVFP4-Q4_K_M | 8 | 9.355 | 19.444% | 42 / 216 |
| ud_d8_p0 | UD-Q4_K_XL | 8 | 23.525 | 41.250% | 99 / 240 |
| ud_d4_p0 | UD-Q4_K_XL | 4 | 14.939 | 39.815% | 43 / 108 |
Qwen3.6 DFlash Follow-up Sweep
| Case | decoded t/s | accept | n_accept / n_drafted | Note |
|---|---|---|---|---|
| UD raw prompt d4 | 25.082 | 65.278% | 94 / 144 | highest accept |
| UD raw prompt d8 | 26.101 | 43.534% | 101 / 232 | good short-output TPS |
| UD raw prompt d12 | 28.930 | 31.790% | 103 / 324 | highest short-output TPS |
| UD raw prompt d16 | 22.245 | 25.432% | 103 / 405 | too much draft overhead |
| UD no-think d8 | 11.360 | 20.192% | 21 / 104 | prompt format likely hurts |
| UD no-think d12 | 12.841 | 13.462% | 21 / 156 | prompt format likely hurts |
| UD d8 n256 | 16.955 | 28.323% | 179 / 632 | longer output degraded |
| UD d12 n256 | 16.522 | 19.846% | 181 / 912 | longer output degraded |
Qwen3.6 Native MTP Sweep
| Case | tokens | t/s | accept | n_accept / n_generated |
|---|---|---|---|---|
| baseline server | 128 | 11.66 | - | - |
| MTP d3 server | 128 | 19.25 | 80.2% | 89 / 111 |
| baseline server | 256 | 11.63 | - | - |
| MTP d2 server | 256 | 19.09 | 92.7% | 165 / 178 |
| MTP d3 server | 256 | 18.16 | 76.3% | 177 / 232 |
| MTP d4 server | 256 | 21.13 | 77.4% | 192 / 248 |
| MTP d3 + ngram-mod | 256 | 17.77 | 76.3% | ngram generated 0 tokens |
| baseline server avg x3 | 512 | 11.54 | - | - |
| MTP d4 server avg x3 | 512 | 19.30 | 72.9-75.4% | ~380 / ~510 |
風險與下一步
接下來不再重複 PPL,除非換 artifact 或 kernel。
立即下一步
跑候選的 TPS / accept / output coherence 小矩陣:31B pure K3/V3 b4、E4B pure/K3V4 b3、26B K3/V4 b3。
Clean repeat
ASR idle 或經批准暫停後,重跑 clean TPS gate,更新 absolute TPS。
FP4
Gemma4 NVFP4/MXFP4 已有品質 smoke,但目前不是 default;要做就是 Blackwell-native FP4 / converter alignment。
部署策略
先保留/備份 legacy fallback,再開 RC service;不直接替換 production。