update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* tests: skip writing prepare_dataset output to logs
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* disable overlap in encoder
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: invokeGatherBatch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: overlap same batch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: add enableTrtOverlap to ExecutorConfig
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* disable overlap for beam search and spec decode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* skip overlap tests with beam search or speculative decoding
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* enable overlap in GptChunkedLongContextTests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Enable overlap in gptManagerBenchmark
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Improve early exit
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use OptionalRef for newOutputTokens tensor
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add overlap scheduling support to TRTLLMDecoder
- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: allNewTokens in PP
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Remove stdout pipe for genai-perf and make stress time as public parameter.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update llmRequest based on comment.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* launch process function refactor.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Fix AllReduce kernel hang issue when both tp and pp are enabled.
Allocate one workspace for each pp rank to avoid potential race.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* update waive list
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Remove the WAR for test items incompleted
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test item manually
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test definition file
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix some other test names
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update name for waived case name, too
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix name for multi-gpu tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix typo
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix other qa tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix tests name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Correct test names in waive.txt
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add new test_durations file
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix names after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Update test duration to latest
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
* refactor: Move ModelSpec from tests to core library
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from runtime to separatedir
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use new bindings path and clean up
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Updated licenses
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove script_dir from path
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add rename_weights_with_regex function for dynamic weight key renaming
Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Implement SiglipVisionModel and related components
Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder.
Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs.
Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Add CLIPVisionModel and associated components
Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests
Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests
Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes
Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading
Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components
Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Consolidate weight loading logic into a shared implementation
Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter
Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization
Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Squash of dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add timer + waive test with suspected GptSession bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Respond to reviewer comments
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* infra: install Triton in the base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* install Triton from the base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* update base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Address review comments
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* update base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* waive test
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow
- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* update: Waive head_dim=72 test cases and enhance test representation
- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* add MNNVL memory mapping support
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add more MPI environment for trtllm-llmapi-launch
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MoE communication and prepare kernels
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MNNVL AlltoAll support for DeepSeekV3
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add output dump for throughput benchmark
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* support dynamic kernel launch grid
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments #2
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* reorganize some unit tests of PyTorch
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* fix ci
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* fix: nvbugs/5234029 fix Qwen2.5-VL image test case by adding more answer candidate
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* remove qwen2.5_vl from waive list
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
---------
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* add passing E2E LoRA flow
Signed-off-by: Shahar Mor <smor@nvidia.com>
* add experimental feature
Signed-off-by: Shahar Mor <smor@nvidia.com>
* fix llma_args definition
Signed-off-by: Shahar Mor <smor@nvidia.com>
* decreased manually size of max loras to address OOM
Signed-off-by: Shahar Mor <smor@nvidia.com>
---------
Signed-off-by: Shahar Mor <smor@nvidia.com>
* generalizing cudagraph to multiple dynamic inputs
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* fix for failing test
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* Add test stages for sm120
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update chip name and config name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Split tests to gb202 and gb203
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Don't flash driver for rtx-5090
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Skip the failed cases
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Change the test stage names
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Reduce 5080 jobs and add back gpu list which doesn't support dynamic driver flashing
Signed-off-by: qqiao <qqiao@nvidia.com>
* Skip failed case on gb202
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix condition to dynamic driver flashing
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* Fix hang bug when KV cache is low
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Review comments
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Fix attentiondp typo
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add CI test for this case
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* fix: Fix the insertion order for responder futures
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* fix: Fix disagg CPP
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* add llama3.2 ptp test case
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* update test list
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
---------
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* Waive L0 tests
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* the test is fixed in PR 3711
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* feat: Integrate GPUDirect Storage (GDS) into Executor API
Squash of several dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add step to generate new duration file
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Install python in earlier step
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Clone repo and add debug info
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Remove debug info and only generate duration for post-merge
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Test for the new duration file
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update the duration file format
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Move generate_duration.py to scripts folder and add try-catch avoiding any broken
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* feat: adding multimodal (only image for now) support in trtllm-bench
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* fix: add in load_dataset() calls to maintain the v2.19.2 behavior
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* re-adding prompt_token_ids and using that for prompt_len
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* updating the datasets version in examples as well
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* api changes are not needed
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* moving datasets requirement and removing a missed api change
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing review comments
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* refactoring the quickstart example
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* implement variable window attention by breaking the block manager into window block managers per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* revert isCyclic to be true if the min attention window is reached, not per window size
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add explanatory comment to mCyclicThreshold
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* load correct gemma config
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* if TYPE_CHECKING
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* set temp_attention_window_inputs to None explicitly
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* pass dtype as well
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* test_gemma variable sliding window attention
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* turn off request delaying for MaxUtil
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* make comments better
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* windowSizesTotalSum using std::accumulate
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comments
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* remove assert that kills disagg tests, since it isn't necessary
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* add Gemma3 to SUPPORTED_HF_ARCHITECTURES
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* support Gemma3
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix kvfactor field for deepseek
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix gemma-3 entries in testlist to include vswa
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* only quantize gemma2 VSWA
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
remove misleading comment
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix test_gemma
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* fix: disable KV cache reuse if using attention sink (#3021)
* fix: disable KV cache reuse if using attention sink
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: disable KV cache reuse if sink bubble
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add comment
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* added files for nemotron-h
Signed-off-by: Luis Vega <lvega@nvidia.com>
* use try/except to import RMSNorm
Signed-off-by: Luis Vega <lvega@nvidia.com>
---------
Signed-off-by: Luis Vega <lvega@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265)
* [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1)
* [None][Doc] - Update docs for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88)
* [Infra] - Fix or WAR issues in the package sanity check stages
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe)
* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path
Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a)
* cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue
Signed-off-by: Ruodi Lu <ruodil@nvidia.com>
(cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8)
* Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'"
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f)
* [Infra]Restrict setuptools version to avoid sasb pip install issue
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
(cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c)
* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path
Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac)
* WAR for bug 5173448
Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com>
(cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f)
* [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9)
* [Docs] - Doc changes for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4)
* [Doc] - Doc change for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235)
* [Infra] update version to 0.18.1
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit 59e8326c75639275837d34de8e140358737a3365)
* Add back nemotron file.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix recurrentgemma reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Adding WAR for bug 5173448.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Remove duplicated file.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update examples/prompt_lookup/requirements.txt
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Remove glm-4-9b from model dir in chatglm test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Remove indent change.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Revert changes on l0_test.groovy.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update dev images
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
* Remove duplicated import.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix custom op
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
* Fix flashinfer & vanilla backend
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
* Skip problematic case.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Ruodi Lu <ruodil@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
Because it is duplicated with test_fp4_linear. Also, cpp profiler has been unified with the new AutoTuner already.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* add dgx_h200 tests
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix pre-commit
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* change bsl branch
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* change multi gpu related file list
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix: Fixing issue with first gen token being returned twice with streaming
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing not_expectring_strings in test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* Add numNodes to ParallelConfig
If not provided, attempt to find the number of nodes by
adding the number of local ranks 0
Update device IDs check accordingly
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* Add ParallelConfig pickle test
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
---------
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* refactor: Update ExecutorConfig to use AdditionalModelOutput type
- Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output.
- Updated related serialization and deserialization methods to accommodate the new type.
- Adjusted tests to reflect the changes in the output handling structure.
This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove equality operator from TrtGptModelOptionalParams
- Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class.
- Updated the pybind11 bindings to remove the exposure of the equality operator to Python.
This change streamlines the class definition and reduces unnecessary complexity in the bindings.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput
- Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag.
- Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism.
- Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality.
This refactor improves the flexibility and clarity of output management in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce findOutputTensor utility function for output tensor retrieval
- Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity.
- Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity.
- Enhanced error checking for additional context and generation output tensors.
This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Check final indices of additional output tensors and update tests
- Added checks to verify the final indices of additional output tensors for context and generation outputs.
- Updated unit tests to verify the changes.
- Add lastTokenIds input tensor to test engines.
- Logits output depends on gatherContextLogits parameter.
- Removed gatherContextOutputs parameter from the validate method in LlmRequest.
- Context outputs do not depend on computeContextLogits parameter.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Check final indices of additional output tensors and update tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Remove equality operator from TrtGptModelOptionalParams
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Update executor.md
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Clean up includes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add rope matcher and unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* capture cos and sin from graph
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* revert fuse_mha op change
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update to address comment and remove redundant unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view and transpose into graph nodes and update unit test to test custom op directly
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view into custom op, update bfs with bound, update custom op return type to be half precision
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* custom op update to support 3D input
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 rope test, update custom op with is_neox flag
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 style rope to matcher and update unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate into two transformations
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update, cache locally and propagate meta info of qk nodes
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: fix cos_sin_cache not float
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: move cache into matcher
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* feat: Add NVFP4 UB pattern optimization pass in torch compile
* Add an additional flag for UB fp4 pattern to avoid inverse the scale
* Add NVFP4 related UB patterns
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* Update atol, some points fails for B200 umbriel.
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
---------
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
* feat: trtllm-gen fp4 GEMM
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Clean up
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Remove incorrect header
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Reviewer comment
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Rename nvsmall to nemotron NAS
* Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests
* Add NemotronNAS to pytorch supported models table
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
* Update some description which is out of date
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Apply suggestions from code review
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
* test: Add single gpu disaggregated tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add deepseek with overlap tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Use updated prompt
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Move test to disaggregated folder
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.
* In common usecase, the Userbuffers will be allocated correctly during
warm up stage. There is no dynamic allocation during inference.
* UB fusion pattern is rewroten using the new UB Allocator. It contains
following passes:
1. Fuse Quant with allreduce, replace with UB impl, and insert a
copy_to_userbuffers. Currently the normal allreduce still does not
support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
allreduce.
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* fix: Fix p-tuning test bug
* A change in the vocab_size calculation for T5Tokenizer,
introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning.
In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added.
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
* Several optimizations and fixings on the Autotuner.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on current linear for nvFP4 data type.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Fixing and revising according to reviewer's suggestions.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Replace runner with runner id to achieve a serializable cache.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Code clean up and minor fixings.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Move all tunable runners and custom ops into torch_custom_ops.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing comment in sanity check
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* init trtllm attn no cache
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test
fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove unnecessary debug logs and clean up commented code
refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment
remove Debug code.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: Resolve comments for Python code
Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata
Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* docs: Add is_dummy_attention field to attention metadata for simulation operations
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes
Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: fix rebase format issue
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: extend attention mask type handling in MHARunnerFixedParams
Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams
Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
---------
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
This test can cause nondeterministic failures on CI with unexpected kernel profiling results. Given longer delay time or cache clear will not solve the issue. Thus, loose the test checks to avoid these false alarms.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Update fp8 sf layout for blackwell and enable fp8 gemm e2e
* Add test case when m needs to be padded
* Better comment
Signed-off-by: Chang Liu <liuc@nvidia.com>
* Add TODO for fp8 quant kernel
Signed-off-by: Chang Liu <liuc@nvidia.com>
* Enable DCO check
Signed-off-by: Chang Liu <liuc@nvidia.com>
* Fix lint
---------
Signed-off-by: Chang Liu <liuc@nvidia.com>
* Waive for https://nvbugspro.nvidia.com/bug/5189673
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Update waive
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* add random image test for llama-3.2-11b-vision
Signed-off-by: Ivy Zhang <yanzh@nvidia.com>
* rename case
Signed-off-by: Ivy Zhang <yanzh@nvidia.com>
---------
Signed-off-by: Ivy Zhang <yanzh@nvidia.com>
Co-authored-by: Larry <larryx@nvidia.com>
CI got Passed: https://nv/trt-llm-cicd/job/helpers/job/PR_Github/522/