Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
- Moved sorting related logic to a dedicated function for better clarity and maintainability.
- Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID.
- Updated function documentation to reflect the sorting behavior and its purpose.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: CreateNewDecoderRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Consolidate request generation in CreateNewDecoderRequests
- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Simplify request handling in CreateNewDecoderRequests
- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests
- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests
- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters
- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests
- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* MoE TRTLLM backend for Qwen3
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* add extra moe_backend to test
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* address comments
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* conditionally compile kernels on newer archs
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* missing positional arg
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* Update the routing kernels
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Revise usage of TLLM_LOG_ERROR
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Add unit test for Qwen3 moe (trtllm_gen backend)
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* improve weight processing speed of moe_backend=TRTLLM; roughly 2x
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* tidy and minor fix
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* temporarily disable accuracy test that has known issue
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
---------
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
* Add Julien's origina kernel.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Get rid of UpdateKVCache functionality.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add kernels.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add torch OP.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update cmake.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Torch OP must use double as argument dtype.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix misaligned access when head_dim=64.
In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Remove unroll (compiler can do that).
Cleanup code.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add switch for interleave.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Refactor vectorized load/store.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Implement is_neox. Result not correct yet.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix is_neox=True.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add q_weight and k_weight.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
---------
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* chore: Improve formatting of DisaggExecutorTest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Typed InstanceRole param in DisaggExecutorTest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Skip DisaggExecutorTest based on device count
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>