Commit Graph

681 Commits

Author SHA1 Message Date
Suyog Gupta
f94af0fb86
[AutoDeploy] Make all ranks agree on kv-cache size (#4007)
* make all ranks agree on kv-cache size

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* lint

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* minor cleanups

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

* use all_gather_object wrapper

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>

---------

Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
2025-05-02 04:07:28 +08:00
Erin
83f37614ef
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388)
* support return logprob in llmapi

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

update and add test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

stability test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* revert removal of old flag

Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
2025-05-01 12:47:14 -04:00
xinhe-nv
009d5e9fa3
test: [CI] Add failed cases into waives.txt (#3943)
* update waive list

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

* waive test_llm_commandr_v01_single_gpu_summary for GH200

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

---------

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-01 23:43:11 +08:00
bhsueh_NV
129bf19980
model: support Qwen3 (#4010)
* add qwen3 dense model pytorch backend support, initial commit

solve the results error issue

add qwen3 moe model pytorch backend support

reformat the code

* perf - use flash_infer rmsnorm for qwen3

* feat - support qwen3 moe rmsnorm

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications.

* fix bugs of running qwen3 public models and fp8 models

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs due to rebase

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs captured by pre-commi

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of attention

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Keddy Jin <jin.gq@aliyun.com>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Co-authored-by: shao <shao@nvidia.com>
2025-05-01 23:12:41 +08:00
nv-guomingz
dc344b6a4f
fix:https://nvbugs/5246733 (#3989)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-05-01 22:52:31 +08:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS (#3865)
* add relaxed acceptance for DS R1

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* clean and update docs

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* Modified based on review

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix mtp manager issue

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

---------

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
hlu1
1294ecb12f
Add attention workspace memory check (#3970)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-30 23:51:09 -07:00
milesial
6ded5f984b
Llama4 processor fixes (#3994)
* fix: Propagate sampling params

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>

* fix: type hints

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>

---------

Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:45:53 +08:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest (#3855)
* Add a new param to LlmRequest and Request to natively support mm

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* update comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update tests to match the new LlmRequest constructor parameters

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify unitTest and modify mm_embeding's dict name in llama4

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix LlmRequest initialization in kvCacheManagerTest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up code for promt_tuning_config

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up prompt_tuning_config in GenerationRequest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Frank
1e317c98c6
[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776)
* Move world options to a different group for clarity.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add eos_id option.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-01 09:42:46 +08:00
Yukun He
9cc5922a0b
Clean up allreduce op in Deepseek V3 model. (#3829)
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 07:56:36 +08:00
Dom Brown
b40f351b7a
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests (#3206)
* Squash of dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Add timer + waive test with suspected GptSession bug

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Respond to reviewer comments

Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>

---------

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
2025-05-01 05:31:08 +08:00
Erin
941e82faa6
waive test_tinyllama_guided_decoding (#3997)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-30 12:46:32 -07:00
tburt-nv
7053d0ad5a
infra: add conan (#3744)
This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes.

Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches.


Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-30 11:53:14 -07:00
Mike Iovine
8c2c969fcb
[fix] Pad requests to maximum draft length in spec decode (#3957)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-30 11:02:18 -04:00
nv-guomingz
dd959de0fd
chore: update internal_cutlass_kernels. (#3973)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-30 22:13:17 +08:00
Julien Debache
83670571dd
feat: Mistral-Large-2 support in the Pytorch workflow
- Added modelling file for models configured by a `MistralConfiguration` object as it is slightly different from the Llama one
2025-04-30 20:12:39 +08:00
Ming Wei
ed887940d4
infra: open source XQA kernels (#3762)
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:

1. NVRTC glue code
2. XQA kernel code

During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.

Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
2025-04-30 18:05:15 +08:00
Chuang Zhu
1ada3c9800
unwaive disagg tests (#3925)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-30 16:44:00 +08:00
Bo Li
a80d2373a3
fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862)
* Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp.
For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Run pre-commit.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compile error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

---------

Signed-off-by: Bo Li <bobboli0202@gmail.com>
2025-04-30 14:27:38 +08:00
tburt-nv
afb7d3adce
remove release branch codeowners (#3954)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-30 11:59:42 +08:00
djns99
cc989ea49f
perf: Optimise MOE prologue to use fused setup function (#3790)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-04-30 11:44:48 +08:00
Zhanrui Sun
86e7474a9b
chore: bump version to 0.20.0rc2 (#3949)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-30 11:44:43 +08:00
yuxianq
f568cbb671
chore: Remove duplicated get_sm_version. (#3935)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-30 11:43:53 +08:00
xinhe-nv
a31afcf3a9
update waive list (#3890)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-30 11:07:48 +08:00
QI JUN
c6fea946e1
chore: update multi-gpu trigger file list (#3971)
* update multi-gpu trigger file list

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-30 09:15:26 +08:00
Pamela Peng
f98a80f9d9
sync internal cutlass kernel changes (#3968)
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-30 08:57:28 +08:00
QI JUN
99929e724b
ci: skip pipeline parallelism test of pytorch flow (#3947)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-30 01:00:16 +08:00
Pamela Peng
c8649ce3aa
skip blackwell tests for sm120 (#3815)
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 09:53:35 -07:00
Fanrong Li
e6b482ef47
fix: change the seq_lens sync copy to an async one (#3786)
---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-29 23:56:49 +08:00
tomeras91
35010e8073
Support NemotronH FP8 Quantization
(1) match quant exclude modules names to TRTLLM names 
(2) No need for any special weight loading for quantization scales weights (#3891)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-04-29 18:51:43 +03:00
xiweny
68a19a33d4
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770)
* upgrade cutlass to 3.9

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* update internal cutlass kernels

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* fix file

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* remove unnecessary change

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* update hash

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

---------

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 11:19:11 -04:00
yuxianq
0f8ec693b2
fix: get head_dim from model’s config. (#3916)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 23:04:29 +08:00
HuiGao-NV
8e6eead6a5
refactor: (part1) Add contraints doc for fusedMoe module. (#3882)
* Add doc string for FusedMoe module
* Address comments.

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-29 22:23:02 +08:00
Junhong Liu
06e76020d7
feat: parallel q_b_proj and concat (#3917)
* add parallel_q_b_proj_and_concat

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* code cleanup

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* one gemm/concat and then split the latent_cache and pass them separately to context/gen

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

---------

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
2025-04-29 22:07:05 +08:00
Dom Brown
8709fe8b53
chore: bump version to 0.19.0 (#3598) (#3841)
test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-29 16:57:22 +08:00
zhhuang-nv
94e6167879
optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-04-29 14:17:07 +08:00
bhsueh_NV
2e230b73ec
change log level of some text from info to debug (#3930)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 13:38:34 +08:00
yuxianq
adfa04745e
fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 11:26:13 +08:00
bhsueh_NV
0610d0ff84
add num_scheduled_requests into print_log (#3914)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 11:22:22 +08:00
Frank
cf15efa15e
[TRTLLM-4883][fix]: Update output speed calculation. (#3923)
* Update gen tps calculation.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add back output speed for comparison.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix issue with f-string.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix some spacing.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Replace output speed with per-request genphase tput.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add gen TPS breakdown.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Update some tagging.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-29 11:04:12 +08:00
QI JUN
c381380ecc
increase H100 CI nodes for PyTorch only pipelines (#3927)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-29 10:58:43 +08:00
Perkz Zheng
35c5e4f1c5
feat: add CGA reduction fmha kernels on Blackwell. (#3763)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* address the comments

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-04-29 10:43:54 +08:00
hlu1
d2f312b8e4
Fix fp8 kvcache (#3877)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-29 10:31:10 +08:00
WeiHaocheng
8a994d879f
feat: fix erros on scaffolding README (#3899)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-29 10:15:06 +08:00
qixiang-99
f370dd0e32
refactor(test): remove random context sequence lengths and set seed for reproducibility in attention tests (#3919)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-29 10:08:04 +08:00
yuxianq
b91da764de
chore: remove DummyKvCacheManager. (#3896)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 09:59:37 +08:00
Jinyang Yuan
dafc28fb85
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863) 2025-04-29 09:09:43 +08:00
Erin
0577ea0155
waive test_attention_no_cache (#3921)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-28 13:57:01 -07:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues (#3686)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00