* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* add passing E2E LoRA flow
Signed-off-by: Shahar Mor <smor@nvidia.com>
* add experimental feature
Signed-off-by: Shahar Mor <smor@nvidia.com>
* fix llma_args definition
Signed-off-by: Shahar Mor <smor@nvidia.com>
* decreased manually size of max loras to address OOM
Signed-off-by: Shahar Mor <smor@nvidia.com>
---------
Signed-off-by: Shahar Mor <smor@nvidia.com>
* Fix hang bug when KV cache is low
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Review comments
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Fix attentiondp typo
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add CI test for this case
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* fix: Fix the insertion order for responder futures
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* fix: Fix disagg CPP
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* implement variable window attention by breaking the block manager into window block managers per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* revert isCyclic to be true if the min attention window is reached, not per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add explanatory comment to mCyclicThreshold
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* load correct gemma config
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* if TYPE_CHECKING
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* pass dtype as well
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* test_gemma variable sliding window attention
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* turn off request delaying for MaxUtil
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* make comments better
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* windowSizesTotalSum using std::accumulate
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comments
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove assert that kills disagg tests, since it isn't necessary
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add Gemma3 to SUPPORTED_HF_ARCHITECTURES
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* support Gemma3
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix kvfactor field for deepseek
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix gemma-3 entries in testlist to include vswa
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* only quantize gemma2 VSWA
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
remove misleading comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix: disable KV cache reuse if using attention sink (#3021)
* fix: disable KV cache reuse if using attention sink
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: disable KV cache reuse if sink bubble
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add comment
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* added files for nemotron-h
Signed-off-by: Luis Vega <lvega@nvidia.com>
* use try/except to import RMSNorm
Signed-off-by: Luis Vega <lvega@nvidia.com>
---------
Signed-off-by: Luis Vega <lvega@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* chore: unify pp_layers helpers
Fix assumptions about equal number of layers per PP rank
in prepare_attention_inputs
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
* fix: Fixing issue with first gen token being returned twice with streaming
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing not_expectring_strings in test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* fix: add kv memory size per token of draft model to calculate max number
of tokens of kv cache
Signed-off-by: Hui Gao
* Fix code to get model_config of draft model
Signed-off-by: Hui Gao
---------
Signed-off-by: Hui Gao
* refactor: remove cumLogProbs and logProbs from DecoderBuffers
- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.
These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched
- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.
These changes enhance clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: integrate DecoderState for sequence length management in decoding process
- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.
refactor: update SlotDecoderBuffers to manage sequence lengths directly
- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.
These changes improve maintainability and streamline the decoding workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Delegate to asyncSend method in SlotDecoderBuffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add NVFP4 UB pattern optimization pass in torch compile
* Add an additional flag for UB fp4 pattern to avoid inverse the scale
* Add NVFP4 related UB patterns
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* Update atol, some points fails for B200 umbriel.
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
---------
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>