* feat: Reduce branch overhead in groupRMSNorm kernels
* Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Properly get decoding mode according to same logic as cpp.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Cross reference getDecodingMode implementations in pytorch - cpp.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Better bindings for DecodingMode.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Revert to version in main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Revert configuration.py.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* disable overlap in encoder
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: invokeGatherBatch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: overlap same batch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: add enableTrtOverlap to ExecutorConfig
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* disable overlap for beam search and spec decode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* skip overlap tests with beam search or speculative decoding
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* enable overlap in GptChunkedLongContextTests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Enable overlap in gptManagerBenchmark
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Improve early exit
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use OptionalRef for newOutputTokens tensor
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add overlap scheduling support to TRTLLMDecoder
- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: allNewTokens in PP
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Remove stdout pipe for genai-perf and make stress time as public parameter.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update llmRequest based on comment.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* launch process function refactor.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* feat: Implement synchronous request termination in batch manager
- Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call.
- Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately.
- Enhanced logging for clarity in token management during request processing.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: MockedModelCancelRequest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: terminate with timeout
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Update doc string for allottedTimeMs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move executor recv functions into classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance MPI logging and error handling
- Updated MPI logging to include destination and tag information for better traceability during send and receive operations.
- Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests.
- Improved code structure for clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce MpiTag enumeration and update MPI function signatures
- Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability.
- Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags.
- Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity.
- Removed redundant MPI tag constants from several classes, streamlining the code.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Rename tags for consistency
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from tests to core library
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from runtime to separatedir
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use new bindings path and clean up
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Updated licenses
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove script_dir from path
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Move all casters to customCasters.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use customCasters in all bindings.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added customCasters to userbuffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Squash of dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add timer + waive test with suspected GptSession bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Respond to reviewer comments
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes.
Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches.
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:
1. NVRTC glue code
2. XQA kernel code
During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.
Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
* Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp.
For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Run pre-commit.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Fix compile error.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
---------
Signed-off-by: Bo Li <bobboli0202@gmail.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow
- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* update: Waive head_dim=72 test cases and enhance test representation
- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* add MNNVL memory mapping support
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add more MPI environment for trtllm-llmapi-launch
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MoE communication and prepare kernels
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MNNVL AlltoAll support for DeepSeekV3
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add output dump for throughput benchmark
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* support dynamic kernel launch grid
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments #2
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
- Removed kernel under test call, as it was not needed
- Removed kernel itself
- Removed kernel tests
- Removed other unused kernels and their tests
- Some static analysis clean up
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>