* add cases for rtx_pro_6000 and update test filter
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* amend a typo in model llama_v3.1_405b_instruct fp4 and add more cases for rtx pro 6000 and waive_list
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* Add llama4 disagg accuracy tests
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Make it async and add GSM8K benchmark
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* add ll-nm-nano tests that map to nim requirements
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* prune some pytorch cases (fp8)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* removing pyt backend test changes
- When validating the pytorch tests with the isl/osl/conc/quant settings (that is done for cpp backend too), seeing hangs that need further debugging.
- Therefore don't want to block this PR, hence removing them.
- Seeing
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
---------
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* add test list for rtx5090 and rtx_pro_6000
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* add 2gpu llama70b test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* remove duplicate and invalid test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* add 2gpus test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
---------
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Reorganize TestDeepSeekR1::test_nvfp4_8gpus
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
* chore: Remove GptSession/V1 from TRT workflow
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove stateful decoders
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession buffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession utils
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession kernels
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove V1 GPT models from tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSessionBenchmark from scripts and docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSession IO classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from test lists
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove useless encoder test
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove mActualBatchSize from DecoderState
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove static batching from ExecutorTest
- Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter.
- Adjusted related test functions to reflect the changes in parameter lists.
- Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
fix for perf test script issue
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* add llama_3.2_1B model and fix for lora script issue
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* Add fp8 kv cache tests to DSV3-Lite integration tests.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update gsm8k.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update CI list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update TestDeepSeekR1.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix test list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Need quant_config besides pytorch_config.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list (bug 5239087).
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Correct test name.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
---------
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <bobboli0202@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* tests: skip writing prepare_dataset output to logs
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* Remove the WAR for test items incompleted
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test item manually
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test definition file
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix some other test names
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update name for waived case name, too
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix name for multi-gpu tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix typo
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix other qa tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix tests name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Correct test names in waive.txt
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add new test_durations file
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix names after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Update test duration to latest
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* add llama3.2 ptp test case
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* update test list
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
---------
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* implement variable window attention by breaking the block manager into window block managers per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* revert isCyclic to be true if the min attention window is reached, not per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add explanatory comment to mCyclicThreshold
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* load correct gemma config
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* if TYPE_CHECKING
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* pass dtype as well
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* test_gemma variable sliding window attention
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* turn off request delaying for MaxUtil
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* make comments better
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* windowSizesTotalSum using std::accumulate
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comments
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove assert that kills disagg tests, since it isn't necessary
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add Gemma3 to SUPPORTED_HF_ARCHITECTURES
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* support Gemma3
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix kvfactor field for deepseek
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix gemma-3 entries in testlist to include vswa
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* only quantize gemma2 VSWA
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
remove misleading comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix: disable KV cache reuse if using attention sink (#3021)
* fix: disable KV cache reuse if using attention sink
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: disable KV cache reuse if sink bubble
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add comment
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>