DGX GB10 / llama.cpp / updated 2026-05-14 09:41 +08

Gemma4 + Qwen3.6 MTP 推理優化報告

目前主線是讓 Gemma4 family 在 llama.cpp 上同時達成低 PPL 與更高真實 decode TPS。優先堆疊 Q4_K_M、Gemma4 assistant/MTP、ngram-mod、TBQ KV、長上下文穩定性；FP4 只在 PPL、coherence、speed smoke 都通過後再升級。

Master goal NOT DONE

PPL、KV 計算、long no-graphs 穩定性、26B canary coherence 已有證據。

31B 主候選 pure K3/V3 + MTP b4

PPL 30.2595；ASR-active long median 23.79-25.10 t/s；128K KV 5.62 GiB。

26B 路線 K3/V4 或 pure K3/V3

K3/V4 accept 較高；pure K3/V3 省 KV，且 live canary coherence 3/3 pass。

Qwen3.6 對照組 native MTP 可用

Qwen3.6 是正向 control lane；TBQ3 KV 尚未完成 gate，不代表不能用。

TPS 評測

TPS 沒有丟，前一版繁中報告把它藏在 candidate table 裡，這版改成獨立區塊。注意：目前 clean TPS final 仍受 ASR / resident service 影響；下表只放 master/candidate readiness 已接受的排序證據。

Model / case	Decode TPS	Accept	KV 8K	判斷
31B pure K3/V3 + MTP b4, no-graphs	23.79-25.10 t/s	~65.97%	359.38 MiB	目前 31B 主候選；比 K3/V4 省 KV。
31B K3/V4 + MTP b4, no-graphs	23.76-23.77 t/s	~66.89%	424.06 MiB	速度接近 pure，但 KV 較大。
26B pure K3/V3 + MTP b3, no-graphs	61.39-62.14 t/s	~68.03%	89.84 MiB	省 KV；live canary coherence 已 pass。
26B K3/V4 + MTP b3, no-graphs	61.62-61.72 t/s	~71.04%	106.02 MiB	Accept 較高，是 balanced lane。
E4B pure K3/V3 + MTP b3, no-graphs	61.89-61.91 t/s	~51.39%	32.81 MiB	省 KV，PPL 也安全。
E4B K3/V4 + MTP b3, no-graphs	63.12-63.82 t/s	~51.46%	38.72 MiB	約快 3%，但 KV 較大。

被撤下的 A/B 表	原因	下一步
上一版的「最新 MTP A/B probe」	那組 seed-prime / GPU_ARGMAX 數字來自中途 diagnostic branch，不是目前 master/candidate readiness 接受的 current promoted candidate TPS。	已從正式 TPS 表撤下；需要在目前 binary、目前 service config、同一 prompt set 下重測後才可放回。
F16 KV diagnostic rows	F16 KV 只能當 control，不是 Wayne 指定的 production 目標。	保留在工作紀錄，不放在 production TPS summary。
TBQ mode14/mode15 seed-prime rows	可能有價值，但目前尚未通過 promoted coherence + clean repeat gate。	下一次安全 headroom 下，一次只跑一個 promoted candidate gate，再更新此表。

Reference lane	TPS	用途	是否最終目標
31B ngram-mod only reference	28.6 t/s	代表 ngram-heavy prompt 的上限對照，且 overhead 低。	不是最終目標；使用者目標仍是 MTP + ngram + TBQ。
26B live :8092 tiny UTF-8 canary smoke	median 77.20 t/s	只證明 live canary 輸出品質與 draft timing 存在。	不是 clean TPS，因 ASR/resident load 仍在。

Clean TPS final 仍未完成：需要 ASR idle/paused，或明確接受 noisy run；且不能一次開多個 DGX 測試。正式 TPS 表目前只採用 master checklist 與 candidate readiness table 已接受的數據。

Readiness Audit

Read-only audit 確認 promoted artifacts 與 helper scripts 都在；目前 blocker 是 runtime/load 狀態，不是缺檔。

項目	目前讀值	判斷
Model artifacts	31B target 19G + assistant 338M；26B target 16G + assistant 311M；E4B target 5.1G + assistant 76M。	Promoted artifacts 齊全。
26B target path	Symlink 指向 /mnt/storage/models/bartowski-gemma4-26B/google_gemma-4-26B-A4B-it-Q4_K_M.gguf。	OK；audit script 已改成 follow symlinks。
Clear ports	:8080、:8087、:8088、:8089、:8093。	Fallback 與部分 candidate ports 可用。
Blocked ports/load	:8090 python PID 2317；:8092 Gemma4 canary PID 3161112；ASR PID 2378446 約 344% CPU。	未清掉或未批准前，不跑 clean TPS 或 31B RC smoke。

本地摘要：ops/dgx/gemma4-readiness-audit-20260514.md。遠端 helper：/tmp/gemma4_readiness_audit.sh。

Master Goal Checkpoint

每個小目標完成後都要回頭檢查大目標。只要 gates 沒全部通過，或 Wayne 沒明確 waive，就不能宣稱完成。

Area	Status	下一步
PPL	Promoted Q4/TBQ candidates 已完成	除非 artifact、kernel、template 改變，否則不重跑。
Output quality	Partial	對 31B、26B、E4B promoted candidates 跑 UTF-8 gates。
TPS / accept	Clean final blocked	ASR idle 或明確暫停後，才重跑 clean gate。
KV memory	Current candidates 已整理	見 candidate readiness table；設定改了才更新。
Deploy safety	Partial	Backup 完成、dry-run ready；fallback restore 需要明確批准。
RC service	Partial	Service files 已 verify，但尚未 start/smoke。

Runtime state：lock file 是空的 stale file；ASR 仍熱；:8092 26B canary active；:8090 被 python 佔住。本輪報告更新沒有啟動新 benchmark。

Guard Fixes

M2.7/Locke static review 抓到兩個 non-invasive safety issue，已修好並在 DGX 上驗證。

Artifact	修正	驗證
run_gemma4_promoted_coherence_gate.py	Candidate pass 現在必須同時滿足 text pass、HTTP 200、有 draft timing、CUDA error 0、GGML_ABORT 0。	Remote python3 -m py_compile rc=0；dry-run rc=75，因 ASR/resident Gemma blocker 仍存在。
restore_gemma4_legacy_8080_dryrun.sh	預設 source 優先使用 backed-up legacy unit，避免 buun source path 日後漂移。需要覆蓋時仍可顯式傳 SRC=。	Remote bash -n rc=0；dry-run rc=0，source 解析到 backup unit。

Candidate Final Readiness

Source of truth：ops/dgx/gemma4-candidate-final-readiness-20260514.md。KV 數字是一條 sequence 的估算。

Candidate	PPL status	Coherence	ASR-active TPS / accept	KV 128K	Blocker
31B Q4_K_M + pure K3/V3 + MTP/ngram b4	Pass 30.2595	Promoted service 尚未跑	23.79-25.10 t/s，~65.97%	5.62 GiB	:8090 occupied；需要 coherence + clean TPS
26B-A4B Q4_K_M + K3/V4 + MTP/ngram b3	Pass 25.7098	K3/V4 promoted service 尚未跑	61.62-61.72 t/s，~71.04%	1.66 GiB	需要 K3/V4 coherence + clean TPS
26B-A4B Q4_K_M + pure K3/V3 + MTP/ngram b3	Pass but higher 38.2839	Live :8092 canary 3/3 pass，08:41 已刷新	61.39-62.14 t/s，~68.03%；tiny canary median 77.20 t/s 是 noisy smoke	1.40 GiB	需要 clean TPS；決定 pure vs K3/V4 final
E4B Q4_K_M + pure K3/V3 + MTP/ngram b3	Pass 24.6580	Promoted service 尚未跑	61.89-61.91 t/s，~51.39%	0.51 GiB	需要 coherence + clean TPS
E4B Q4_K_M + K3/V4 + MTP/ngram b3	Pass 25.8709	Promoted service 尚未跑	63.12-63.82 t/s，~51.46%	0.61 GiB	需要 coherence + clean TPS

KV Savings Controls

FP16 / INT8 只是 savings math control，不是目前 production default。

Model	FP16 128K	INT8 128K	K3/V4 128K	pure K3/V3 128K	目前判斷
E2B	0.98 GiB	0.49 GiB	0.22 GiB	0.18 GiB	低優先；MTP mainline 不值得優先推。
E4B	2.73 GiB	1.37 GiB	0.61 GiB	0.51 GiB	K3/V4 稍快；pure K3/V3 省記憶體。
26B-A4B	7.49 GiB	3.74 GiB	1.66 GiB	1.40 GiB	K3/V4 accept 較高；pure canary coherence 已 pass。
31B	29.95 GiB	14.97 GiB	6.63 GiB	5.62 GiB	pure K3/V3 是目前 leading promoted lane。

UTF-8 Coherence Evidence

Live 26B canary 在 :8092 於 08:41 再次通過 UTF-8-safe gate。先前 mojibake 失敗已確認是 harness encoding 問題，不是模型品質結論。

Prompt	Pass	t/s	Accept	Readout
zh_marker	True	77.20	77.78%	繁中 marker 正確，沒有 mojibake。
embedded_facts	True	61.08	56.96%	能依照 prompt 內嵌事實整理 Gemma4 + llama.cpp + MTP/ngram + TBQ3。
repeat_line	True	143.60	90.00%	六行繁中句子完全輸出。

RC Service Readiness

以下只是 service-file readiness。尚未 start、restart 或 deploy。

Candidate	Service file	Port	Policy status	Blocker
31B pure K3/V3 MTP b4	llama-gemma4-31b-fullt3-mtp-rc.service	8090	no-graphs + pure mode env set	:8090 目前被占用
26B K3/V4 MTP b3	llama-gemma4-26b-k3v4-mtp-rc.service	8088	no-graphs set	需要 fallback restore + coherence gate
E4B pure K3/V3 MTP b3	llama-gemma4-e4b-fullt3-mtp-rc.service	8089	no-graphs + pure mode env set	需要 fallback restore + coherence gate
E4B K3/V4 MTP b3	llama-gemma4-e4b-k3v4-mtp-rc.service	8093	no-graphs set	需要 fallback restore + coherence gate

Remote systemd-analyze verify 已通過這些 copied templates。

Legacy Fallback Backup

Deployment safety 已先做 read-only backup 與 dry-run；沒有動 runtime state。

Item	Status	Path / note
Read-only backup	Done	/home/waynehsu/backups/gemma4-legacy-state-20260514_081224
Local copy	Done	ops/dgx/tmp/gemma4-legacy-state-20260514_081224
Restore helper	Dry-run passed	ops/dgx/restore_gemma4_legacy_8080_dryrun.sh
:8080 fallback	尚未啟動	需要明確批准後才能 --execute。
Source unit	Resolved	使用 backed-up llama-gemma4-26b.service；buun source path 目前缺檔。

Execution Runbook

Exact-command runbook：ops/dgx/gemma4-execution-commands-20260514.md。

1. Check state

任何 runtime step 前先讀 lock、ports、ASR CPU、GPU residency。

2. One coherence gate

Headroom 安全時，一次只跑一個 promoted candidate，並且拿 lock。

3. Restore fallback

:8080 fallback restore/smoke 只在明確批准後 execute。

4. RC smoke

Fallback safe 後，RC service 一次只 start/smoke 一個。

Side Lanes

Side frameworks 可當對照或後續研究，但 Gemma4 mainline 仍是 MTP/ngram/TBQ 優先。

Lane	Readout	Decision
Qwen3.6 native MTP	可當正向 MTP control；Unsloth notes 顯示實用 uplift，但 prompt processing cost 可能上升。	保留為 control，不取代 Gemma4 mainline。
Qwen3.6 + TBQ3 KV	尚未完成本機 gate。TBQ3 是 runtime KV cache 量化，不是 GGUF 權重量化；理論上可在 TurboQuant fork 用 `-ctk turbo3 -ctv turbo3` 或 K3/V4 嘗試。	目前不能寫成已驗證；要補 PPL + UTF-8/coherence + TPS/accept 後才可進正式表。
DFlash / DDTree	DFlash 有 forks/PR；DDTree 目前在 local scan 中仍偏 fork-only。	MTP/TBQ gates 完成前，不切主線。
NVFP4 / MXFP4	26B external NVFP4 PPL smoke 已 sane；目前 speed 還不是 default。	只有 loader、PPL、coherence、speed 都贏 Q4/TBQ lane 才 promote。