jmydurant
|
8836990bde
|
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
|
2025-06-26 22:18:08 +08:00 |
|
jmydurant
|
578dbc8d9a
|
feat: chunked prefill for MLA (Blackwell) (#4651)
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
|
2025-06-26 09:01:00 +08:00 |
|
zhhuang-nv
|
a891013e3c
|
[feat] Optimize KV Cache Reuse for MLA (#4869)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
|
2025-06-13 11:03:05 +08:00 |
|
zhhuang-nv
|
8452775db8
|
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA
write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* support fp8 kv cache for MLA kv cache reuse
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
|
2025-05-23 19:47:50 +08:00 |
|
zhhuang-nv
|
97bc680cd8
|
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA
load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* add CI test
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* use GPTJ style RoPE for MLA
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix rebase error and some docs
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix kv_lens
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* tiny fix
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: use normal device memory instead of pinned memory for unit test
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* fix L0 tests
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile after rebase
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments again
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
|
2025-05-15 15:22:21 +08:00 |
|