Commit Graph

557 Commits

Author SHA1 Message Date
Ivy Zhang
47d2f16bb8
waive gemma on L20 (#3767)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-04-22 17:52:49 +08:00
ruodil
9223000765
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3657)
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-04-22 14:51:45 +08:00
Enwei Zhu
353699a3b3
fix: fnmatch usage in modeling_utils.py (#3754)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-22 13:13:53 +08:00
Robin Kobus
8340657ae4
refactor: Introduce DecoderOutputBuffers per batch (#3506)
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend

- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend

- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes enhance the structure and readability of the batch manager's decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Log DecodingMode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce DecoderOutputBuffers and update related classes

- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.

These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update MPI communicator handling

- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.

These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Replace shared_ptr with unique_ptr for buffer management

- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.

These modifications contribute to a cleaner and more efficient architecture in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-22 12:25:53 +08:00
xinhe-nv
ba216341f4
update waive list (#3683)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-22 11:09:41 +08:00
Yi Zhang
98966cb45e
test: Unwaive Llama 3.1 with torch compile test (#3475)
* Fix log info

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Revert "test: Waive torch compile tests (#3471)"

This reverts commit 410f56357e.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

* Update test_llm_api_pytorch.py

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

---------

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-04-22 10:41:56 +08:00
Kaiyu Xie
a32389b4cd
fix: Remove unnecessary max call (#3574)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-22 10:33:50 +08:00
rakib-hasan
74c13ea84f
datasets API change : datasets.load_metric => evaluate.load (#3741)
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-04-22 08:23:48 +08:00
Enwei Zhu
3fa19ffa4e
test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483)
* add gsm8k

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix gsm8k

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add gpqa

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* conditional import lm_eval

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* gpqa in lm_eval

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* system prompt

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* shuffle

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update AA prompt and regex

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* revert AA prompt and regex

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* integration to tests

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add DS-R1

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix and clean

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update tests

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* clean up

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* free_gpu_memory_fraction=0.8

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-22 07:38:16 +08:00
bhsueh_NV
0c07d4dc21
Fix/executor bugs (#3681)
* fix bugs of py executor

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs of py executor

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* revert changes about mpi_barrier()

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-22 07:23:27 +08:00
Kaiyu Xie
943f3ff8f6
Revert "Report number of context tokens in one iteration (#3691)" (#3740)
This reverts commit e0446a4dc0.

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-22 01:21:43 +08:00
Yan Chunwei
231b39015c
unwaive multi_node test (#3715)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-21 21:26:07 +08:00
Barry Kang
d87b009d8d
Fix ModelOpt Mixtral AWQ OOM (#3714)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-04-21 19:14:14 +08:00
Zheng Duan
ae48abefc1
bind block key and hasher (#3712) 2025-04-21 18:50:57 +08:00
Iman Tabrizian
af04b6f6aa
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
* Fix hang bug when KV cache is low

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Review comments

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Fix attentiondp typo

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add CI test for this case

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* fix: Fix the insertion order for responder futures

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* fix: Fix disagg CPP

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-21 15:16:55 +08:00
Stanley Sun
852dd0c1be
test: add llama3.2 ptp test case (#3363)
* add llama3.2 ptp test case

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

* update test list

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

---------

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-04-21 15:15:45 +08:00
Jinyang Yuan
bc2b01d1dd
chore: update FMHA cubin files (#3680)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-21 15:04:04 +08:00
Zhenhuan Chen
2672f13d77
test: fix cublas_scaled_mm with aligned workspace size (#3600)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-04-21 14:51:42 +08:00
katec846
eeb605abd6
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380)
* Feat: Offload ptable to cpu if enable_chunk_context

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Feat: offload ptable to cpu for chunk context mode

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix and add comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update Readme for multimodal and add a new param mm_embedding_offloading

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* fix: Correct prompt table offloading condition in PromptTuningBuffers

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up the code

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Add commits to explain copy from cpu <-> gpu using pinned memory

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix namings based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix format based on precommit

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify --mm_embedding_offloading flag

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-21 14:31:01 +08:00
yuxianq
faef37782a
fix: Remove ParallelConfig. (#3678)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-21 14:14:08 +08:00
HuiGao-NV
e0446a4dc0
Report number of context tokens in one iteration (#3691)
Report number of context tokens in one iteration
2025-04-21 13:45:28 +08:00
yuxianq
591f3d2be8
fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. (#3679)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-21 12:28:56 +08:00
liji-nv
a51f7559a3
fix: update test_user_buffers_mm_add_prologue atol (#3711) (#3713)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-21 11:24:20 +08:00
Yiqing Yan
6f7f262779
Waive L0 tests (#3709)
* Waive L0 tests

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* the test is fixed in PR 3711

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

---------

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-04-21 11:24:00 +08:00
hlu1
31624b079a
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387)
* Add TRT-LLM Gen MOE to Deepseek

fix fused moe rebase bug.

Fix atol in test_fp4_gemm_quantize.py

fix fused moe rebase bug.

Fix FusedMoe.

Disable 2nd routing kernel preexit

Bump routing reduction to fp32

Disable PDL for fc1

[DEBUG] Lift token limit to 16k

[Bugfix] Token limit to 16k + fp32 routing + tanh

Make fp8 tileN 8

Fix FP8 MoE + Remove redundent temp output for FP4

[FP8-only] Avoid wasting CTAs for activation kernel

fix: unblock FP8 weightloading with trtllm-gen

Remove max_token limit for trtllm-gen path

perf: avoid type-conversion and fill_ from aten

Minor fix

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix rebase issues

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix compile issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* CI clean

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Hao Lu <haolu@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-21 10:01:33 +08:00
Emma Qiao
48db263d9a
infra: Add test list name check (#3097)
* Add steps to check test names

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Correct test-db command

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Switch to use a trt-llm image

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Update go path

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Correct go path

Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Move the test list check to test ci

Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Correct file path

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix path again

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix get path

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix typo

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Skip test list check for ARM

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix expression

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Change back unrelated file

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Correct qa test names

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Remove a stage

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Update jenkins/L0_Test.groovy

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Move some steps to a python script

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix script path

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Split commands and debug

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix typo

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix typo

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Also correct case name in waives list

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Move check script to another folder

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Update qa list after rebase

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Fix rebase

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Remove the perf tests under QA

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Some tests already fixed after rebase to TOT

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

---------

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-04-20 23:02:16 +08:00
Naveassaf
f7c2eb4fa2
Update Nemotron Super and Ultra in Supported Models and add an example (#3632)
* Update Nemotron Super and Ultra in Supported Models and add an example

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

* Update README link to match new examples structure

Signed-off-by: Nave Assaf <nassaf@nvidia.com>

---------

Signed-off-by: Nave Assaf <nassaf@nvidia.com>
2025-04-20 21:14:33 +08:00
hlu1
17eba98445
Refactor Deepseek tp_size calculation (#3695)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-19 23:55:19 -07:00
QI JUN
d51ae53940
move the reset models into examples/models/core directory (#3555)
* move rest models to examples/models/core directory

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update multimodal readme

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix example path

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix cpp test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix tensorrt test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-19 20:48:59 -07:00
brb-nv
c35d2a7532
test: Get Eagle tests working (#3593)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-20 00:50:57 +08:00
nv-guomingz
e70961f541
test:update waives.txt for nvbug 5219532 (#3672)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-19 18:57:39 +08:00
yuxianq
5346f53250
feat: Introduce feature properties for attention backend. (#3659)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-19 12:37:27 +08:00
Iman Tabrizian
61ee983488
fix: Fix disaggregated load balance test (#3689)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-19 10:40:40 +08:00
hlu1
c861b6cf17
Clean up modeling_deepseek.py (#3640)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-18 17:54:33 -07:00
Iman Tabrizian
a2f190f306
chore: Waive disaggregated load balance (#3687)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-18 16:04:33 -07:00
Yechan Kim
5460d18b10
feat: trtllm-serve multimodal support (#3590)
* feat: trtllm-serve multimodal support

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove disable argument

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove disable

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add and separate tests and move the doc

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* remove block_resue arg from serve.py

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-19 05:01:28 +08:00
mayani-nv
ce8329646f
Update run.py for draft_target_model (#3615)
This change makes the draft target model works without mismatch in the vocab size

Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
2025-04-19 01:01:50 +08:00
pcastonguay
ae5671644a
feat: Disaggregated router class (#3584)
* Add draft scheduler class

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor the design

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* feat: Introduce router class for disaggregated server

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Add unit tests for router class

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding tests for disagg_utils

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing missing import

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg integration tests

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Addressing MR review comments

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-19 00:34:12 +08:00
QI JUN
b9fce42717
enable test_ptp_quickstart_advanced_mixed_precision (#3667)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-18 05:06:24 -07:00
Zheng Duan
bce7ea8c38
test: add kv cache event tests for disagg workers (#3602) 2025-04-18 18:30:19 +08:00
Yan Chunwei
2a09826ec4
fix hmac in remote mpi session (#3649)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-04-18 17:47:51 +08:00
HuiGao-NV
d3608d6818
Remove dummy forward path (#3669)
Remove dummy forward path
2025-04-18 16:17:50 +08:00
Dom Brown
dbd9a83b0d
feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582)
* feat: Integrate GPUDirect Storage (GDS) into Executor API

Squash of several dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-18 15:59:21 +08:00
Zheyu Fu
90a28b917f
feat: Add Dynasor-CoT in scaffolding examples. (#3501)
Signed-off-by: Zheyu Fu <zheyufu2@gmail.com>
Co-authored-by: Junda Chen <32371474+GindaChen@users.noreply.github.com>
Co-authored-by: Yichao Fu <57950249+fuyichao2000@users.noreply.github.com>
Co-authored-by: Andy Dai <zhongdongmin@nvidia.com>
2025-04-18 07:48:01 +00:00
Erin
4fedf0be5c
unwaive test for nvbug_5150466 (#3552)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-18 15:15:58 +08:00
Yuan Tong
0b0e6d8a0a
refactor: Clean up CMakeLists.txt (#3479)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-18 14:39:29 +08:00
Emma Qiao
2f48985b9c
infra: Add step to generate new duration file (#3298)
* Add step to generate new duration file

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Install python in earlier step

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Clone repo and add debug info

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Remove debug info and only generate duration for post-merge

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Test for the new duration file

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Update the duration file format

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

* Move generate_duration.py to scripts folder and add try-catch avoiding any broken

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

---------

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
2025-04-18 12:56:31 +08:00
peaceh-nv
88cff61fa1
chore : Split more tests out of gpt tests (#3524)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-04-18 12:04:57 +08:00
dongfengy
b71a0f76b4
test: Add llama 4 to ci (#3520)
* Add llama 4 to ci

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

* Only test trtllm

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

* Disable marverick

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

---------

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-04-18 11:25:52 +08:00
Iman Tabrizian
fc88d67675
chore: Refactor test_disaggregated.py (#3154)
* Refactor test_disaggregated.py

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Address review comments

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Remove waived tests

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* fix: Fix streaming endpoint chat completions

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-18 11:04:06 +08:00