Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
- Moved sorting related logic to a dedicated function for better clarity and maintainability.
- Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID.
- Updated function documentation to reflect the sorting behavior and its purpose.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: CreateNewDecoderRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Consolidate request generation in CreateNewDecoderRequests
- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Simplify request handling in CreateNewDecoderRequests
- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests
- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests
- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters
- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests
- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* MoE TRTLLM backend for Qwen3
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* add extra moe_backend to test
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* address comments
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* conditionally compile kernels on newer archs
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* missing positional arg
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* Update the routing kernels
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Revise usage of TLLM_LOG_ERROR
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Add unit test for Qwen3 moe (trtllm_gen backend)
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* improve weight processing speed of moe_backend=TRTLLM; roughly 2x
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* tidy and minor fix
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* temporarily disable accuracy test that has known issue
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
---------
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
* Add Julien's origina kernel.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Get rid of UpdateKVCache functionality.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add kernels.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add torch OP.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update cmake.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Torch OP must use double as argument dtype.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix misaligned access when head_dim=64.
In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Remove unroll (compiler can do that).
Cleanup code.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add switch for interleave.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Refactor vectorized load/store.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Implement is_neox. Result not correct yet.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix is_neox=True.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add q_weight and k_weight.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
---------
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* chore: Improve formatting of DisaggExecutorTest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Typed InstanceRole param in DisaggExecutorTest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Skip DisaggExecutorTest based on device count
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* first commit of cpp moe loadbalance code
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python bindings for moe load balance
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python wrapper, ut and bug fixes
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add binding for layerId and update binding test
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add host tensor sharing and ut
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
* Fix padded vocab size for Llama
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Refactor multi GPU llama executor tests, and reuse the built model engines
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Fix test list typo
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Further WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Update test lists and readme
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Try parametrize for asymmetric
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Parametrize + skip unsupported combinations
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
* Update test list
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
* Reduce environment duplicated code
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
* refactor: Copy sequence lengths once in decoder setup
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update DecoderInputBuffers to remove duplicated buffers
- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move getEmbeddingBias to anonymous namespace
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Filter context requests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: GenerateRequestOptions using more fine-grained functions
- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update TRTLLMDecoder
- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove const where we modify input buffers
- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Copy sequence lengths once in decoder setup
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator.
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* Down the gcc toolset version from 13 to 11
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
* Update rocky8 images
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
---------
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
* chore: Remove GptSession/V1 from TRT workflow
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove stateful decoders
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession buffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession utils
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession kernels
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove V1 GPT models from tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSessionBenchmark from scripts and docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSession IO classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from test lists
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove useless encoder test
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove mActualBatchSize from DecoderState
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove static batching from ExecutorTest
- Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter.
- Adjusted related test functions to reflect the changes in parameter lists.
- Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
Support DeepSeek-R1 W4A8 on Hopper
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
* feat: Add heuristic for GroupRMSNorm kernel selection.
Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
(better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
(better block scheduling in large batch scenarios)
Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.
The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Address the comments.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Refactor: Restructure C++ tests for better modularisation of non-shared code
Start cleanup of pytest code for C++ tests
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Clean up names and remove references to test_cpp.py
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Move multi-GPU code
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Update doc and try un-waiving
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Update multi GPU file check
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Address minor multi-GPU setup bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Fallback to NCCL for various patterns when input size is large.
Move the previous implementation to cpp side.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Revising.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* feat: Reduce branch overhead in groupRMSNorm kernels
* Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Properly get decoding mode according to same logic as cpp.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Cross reference getDecodingMode implementations in pytorch - cpp.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Better bindings for DecodingMode.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Revert to version in main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Revert configuration.py.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* disable overlap in encoder
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: invokeGatherBatch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: overlap same batch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: add enableTrtOverlap to ExecutorConfig
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* disable overlap for beam search and spec decode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* skip overlap tests with beam search or speculative decoding
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* enable overlap in GptChunkedLongContextTests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Enable overlap in gptManagerBenchmark
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Improve early exit
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use OptionalRef for newOutputTokens tensor
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add overlap scheduling support to TRTLLMDecoder
- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: allNewTokens in PP
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Remove stdout pipe for genai-perf and make stress time as public parameter.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update llmRequest based on comment.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* launch process function refactor.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* feat: Implement synchronous request termination in batch manager
- Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call.
- Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately.
- Enhanced logging for clarity in token management during request processing.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: MockedModelCancelRequest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: terminate with timeout
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! feat: Implement synchronous request termination in batch manager
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Update doc string for allottedTimeMs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move executor recv functions into classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance MPI logging and error handling
- Updated MPI logging to include destination and tag information for better traceability during send and receive operations.
- Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests.
- Improved code structure for clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce MpiTag enumeration and update MPI function signatures
- Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability.
- Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags.
- Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity.
- Removed redundant MPI tag constants from several classes, streamlining the code.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Rename tags for consistency
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from tests to core library
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from runtime to separatedir
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use new bindings path and clean up
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Updated licenses
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove script_dir from path
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Move all casters to customCasters.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use customCasters in all bindings.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added customCasters to userbuffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Squash of dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add timer + waive test with suspected GptSession bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Respond to reviewer comments
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes.
Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches.
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:
1. NVRTC glue code
2. XQA kernel code
During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.
Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
* Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp.
For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Run pre-commit.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Fix compile error.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
---------
Signed-off-by: Bo Li <bobboli0202@gmail.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>