Commit Graph

1044 Commits

Author SHA1 Message Date
Yanchao Lu
20c15fc04f
Fix invalid testcase name (#4626)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-24 00:40:00 +08:00
Yao Yao
ef763b0ddc
fix: rename some terms (#4534)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2025-05-23 23:23:49 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
dominicshanshan
ca3eaf4070
[nvbug/5028235][fix]pytest bindings tokens logtis comparison. (#4424)
* fix bug 5028235.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

* fix bug 5028235 and update comments.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

* Update tests/unittest/bindings/test_executor_bindings.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>

* Remove redundant code.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

* Update based on review comments.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

---------

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-23 20:41:00 +08:00
juney-nvidia
7b2bb67491
Update CODEOWNERS for PyTorch backend - runtime component (#4620)
Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>
2025-05-23 20:40:44 +08:00
Robin Kobus
15a59e57f6
[nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router (#4617)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 20:14:28 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA

write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* support fp8 kv cache for MLA kv cache reuse

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend (#4530)
* MoE TRTLLM backend for Qwen3

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* add extra moe_backend to test

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* address comments

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* conditionally compile kernels on newer archs

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* missing positional arg

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* Update the routing kernels

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Revise usage of TLLM_LOG_ERROR

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Add unit test for Qwen3 moe (trtllm_gen backend)

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* improve weight processing speed of moe_backend=TRTLLM; roughly 2x

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* tidy and minor fix

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* temporarily disable accuracy test that has known issue

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

---------

Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
juney-nvidia
419151f358
Update the GH main page to expose tech blogs (#4610)
* Update the main page to expose the tech blogs

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* fix formating issue

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* fixing format issue

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>
2025-05-23 17:03:56 +08:00
Yiqing Yan
3ca05330f9
Waive L0 test (#4609)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-23 15:54:11 +08:00
Bo Li
9ae705af1b
perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482)
* Add Julien's origina kernel.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Get rid of UpdateKVCache functionality.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add kernels.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add torch OP.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Update cmake.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Torch OP must use double as argument dtype.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add unittest.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add unittest.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Fix misaligned access when head_dim=64.
In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Remove unroll (compiler can do that).
Cleanup code.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add switch for interleave.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Refactor vectorized load/store.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Implement is_neox. Result not correct yet.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Fix is_neox=True.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add q_weight and k_weight.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

---------

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-23 15:31:04 +08:00
bhsueh_NV
6527c055cf
chore: fix bug of llama lora test (#4566)
* fix bug of llama lora test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* Update test_llm.py

fix bug detected by pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-23 14:06:40 +08:00
Fanrong Li
862bde99b6
draft[doc]: add mtp tech blog (#4580)
* add mtp tech blog.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* update figure size.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* update the figure caption style and add some code/pr links.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix figure captions.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix figure size and perf data.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix based on comments

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix figure links.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-05-23 13:54:21 +08:00
bhsueh_NV
d69c662215
[Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP (#4575)
* fix bug of qwen3 fp4 workflow with EP

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of qwen3_moe with ep

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-23 13:34:05 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments (#3869)
* added a restricted pcikler and depickler in a sepparate serialization function.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* cleaned up a couple files to reduce conflicts with main.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix unit tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* reorder BASE_ZMQ_CLASSES list alphabetically

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* revert changes to import log of tensorrt_llm._torch.models

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* additional comments for multiprocess approved list sync

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* add dataclass from tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

---------

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
djns99
87f734b563
[https://nvbugs/5297775] fix: Correct memory guard for large MOE tests to account for TP space (#4553)
fix: Correct memory guard for large MOE tests to account for TP space

Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-05-23 14:57:49 +12:00
Yuxian Qiu
38241b2346
fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. (#4555)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-23 10:41:49 +08:00
CarstyYou
ef280e687e
[feat] support fp8 blockscale gemm on sm89 (#4481)
* [feat] integrate ada blockwise gemm

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [fix] align scale M

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [feat] swizzle mma output

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [test] add ut for sm89

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [delete] remove useless comments

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] codestyle

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [fix] fix review comments

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] fix license

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] fix license

Signed-off-by: CarstyYou <xiy@nvidia.com>

---------

Signed-off-by: CarstyYou <xiy@nvidia.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-05-23 10:39:10 +08:00
Enwei Zhu
d7443b6068
[https://nvbugspro.nvidia.com/bug/5181262] [test] Unwaive Mistral Nemo test (#4515)
unwaive

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-23 10:14:00 +08:00
nv-guomingz
e3a534d0ee
chore: guardword clean for header file. (#4540)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-23 10:08:14 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243)
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing formatting

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Kunyao Wu
60a6c20174
Scaffoldingllm supports MCP (#4410)
* support mcp

# Conflicts:
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* move all into contrib/mcp

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	tensorrt_llm/scaffolding/__init__.py
#	tensorrt_llm/scaffolding/contrib/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/task.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* support sandbox, websearch

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	examples/scaffolding/contrib/mcp/weather/weather.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* remove pics

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* pre-commit fix

# Conflicts:
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* fix spell

Signed-off-by: wu1du2 <wu1du2@gmail.com>

* rebase

Signed-off-by: wu1du2 <wu1du2@gmail.com>

---------

Signed-off-by: wu1du2 <wu1du2@gmail.com>
2025-05-23 01:54:49 +00:00
dongxuy04
338744fba6
fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573)
fix moe possible race cond and add bypass worker thread for no updates

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-23 09:24:23 +08:00
QI JUN
1e55d616da
Chore: clean up _gather_dp_requests_num method of PyExecutor (#4571)
clean up _gather_dp_requests_num method of PyExecutor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-23 08:37:39 +08:00
nv-guomingz
3549b68c1c
chroe:clean useless flag (#4567)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-23 07:05:15 +08:00
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels (#4330)
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix cache key

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix lint

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

---------

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
Mike Iovine
14fc48ada7
[nvbug/5285881][fix] Fix chunked prefill + overlap scheduler (#4402)
[fix] Fix chunked prefill + overlap scheduler

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-23 04:38:22 +08:00
Venky
c713eb5799
test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) (#4446)
ultra

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-05-22 13:07:33 -07:00
Robin Kobus
e5c90883a9
fix: Move cv2 import to load_video function (#4541)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-22 17:56:07 +02:00
Chuang Zhu
558eaecf16
fix sequence data race (#4565)
stash for debug broken promise

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 23:13:48 +08:00
QI JUN
1e5d526db4
Chore: clean up _merge_dummy_request method of PyExecutor (#4438)
* clean up _merge_dummy_request method of PyExecutor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update comment

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-22 18:19:07 +08:00
xinhe-nv
22c01d5b21
test: [CI] Add failed cases into waives.txt (#4549)
* update waive list

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

* fix test issues

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

---------

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-22 17:18:53 +08:00
ruodil
1a45890dae
test: waive hanging cases for perf test (#4562)
waive hanging cases

Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-22 15:50:05 +08:00
Chuang Zhu
3410508020
cache_transceiver_config (#4556)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 13:59:51 +08:00
Iman Tabrizian
e741d2b8d0
Add tritonrelease container (#4455)
* Add tritonrelease container

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Review comments

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Update docker/Makefile

Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
2025-05-21 23:47:50 -04:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856) (#4349)
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
HuiGao-NV
bc9f1dbede
fix[nvbug-5228840]: Remove test cases of feature not supported anymore (#3972)
* Remove waived cases
* Remove test cases of not supported feature

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-22 11:18:58 +08:00
Aurelien Chartier
f491244c84
feat: add dataset support for benchmark_core_model with LLMAPI (#4457)
* feat: add dataset support for benchmark_core_model with LLMAPI

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-21 19:18:43 -07:00
Kaiyu Xie
099cd3ce07
chore: Add all_reduce.py benchmark script to test (#4537)
Add all_reduce.py script to test

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-22 10:13:27 +08:00
Michal Guzek
9033dd987d
[TRTLLM-4932] Add CLI accuracy tests for Phi-4-mini-instruct (#4415)
Add phi-4-mini CLI acc test

Signed-off-by: moraxu <mguzek@nvidia.com>
2025-05-22 09:56:48 +08:00
Yan Chunwei
4798d088d9
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs (#3823)
* partition LlmArgs

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* update backend

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-22 09:40:56 +08:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL (#4125)
* agentConnection

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

recv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

agentState

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

NIXL interfaces

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update cmakelists

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

nixl improve

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

remove cppzmq

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

transferAgent remove register

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

work for cache Test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

reduce sleep time

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

intergarte

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

nixl env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix rebase error

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

cpp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

stash for send metaData

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

test_env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

avoid port conflict in test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use std::string

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* typo

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Aurelien Chartier
1681e9fd1e
chore: remove extra PYTHONPATH (#4453)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-21 17:38:01 -07:00
Nikita Korobov
e1b42be3d1
fix: TRT-LLM Gen dtype declaration (#4503)
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-21 23:56:37 +02:00
Dom Brown
1cffa99792
test: Split test_simple into mpi_utils and cache transceiver tests for DGX (#4451)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-22 04:26:21 +08:00
Zongfei Jing
dbaddb3a29
Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216)
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe

Signed-off-by: Shiyu Li <shili@nvidia.com>

Adding comments

Signed-off-by: Shiyu Li <shili@nvidia.com>

Add unittest of the twoshot kernel.

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update dispatch logic

Signed-off-by: Shiyu Li <shili@nvidia.com>

Use cpu barrier instead of GPU at init

Signed-off-by: Shiyu Li <shili@nvidia.com>

Merge dispatch logic fix

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update the kernel to use GPU-managed buffer

Signed-off-by: Shiyu Li <shili@nvidia.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean up

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Simplify AllReduce interface

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix warning

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Skip ut for no_fusion

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Shiyu Li <shili@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Shiyu Li <shili@nvidia.com>
2025-05-22 03:42:36 +08:00
Venky
0a8461d54c
test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) (#4499)
add low concurrency perf tests

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-21 10:46:48 -07:00
Kevin Chen
b80b78f87c
Add pytorch backend team (#4405)
* Add pytorch backend team

Signed-off-by: Kevin Chen 

* Update .github/CODEOWNERS

Co-authored-by: Yanchao Lu 
Signed-off-by: juney-nvidia <143764042+juney-nvidia@users.noreply.github.com>

---------

Signed-off-by: Kevin Chen 
Signed-off-by: juney-nvidia <143764042+juney-nvidia@users.noreply.github.com>
Co-authored-by: juney-nvidia <143764042+juney-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu
2025-05-21 21:10:35 +08:00
nv-guomingz
3b12e460e7
chore: clean ucx and nixl mirror. (#4531)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-21 19:45:20 +08:00
Robin Kobus
cd0c826417
refactor: DisaggExecutorTest (#4398)
* chore: Improve formatting of DisaggExecutorTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Typed InstanceRole param in DisaggExecutorTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Skip DisaggExecutorTest based on device count

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-21 18:01:45 +08:00