TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Fridah-nv	a5f32f46fd	fix: [AutoDeploy] Update README.md (#3072 ) * update support matrix and add toggle list Signed-off-by: fridah <201670829+Fridah-nv@users.noreply.github.com> * Update README.md Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> * Update README.md Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> --------- Signed-off-by: fridah <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2025-04-01 16:16:36 -07:00
Chang Liu	1d3a5d38af	fix: Update FP8 sf layout for Blackwell and relax blockwise GEMM assertions (#3144 ) * Update fp8 sf layout for blackwell and enable fp8 gemm e2e * Add test case when m needs to be padded * Better comment Signed-off-by: Chang Liu <liuc@nvidia.com> * Add TODO for fp8 quant kernel Signed-off-by: Chang Liu <liuc@nvidia.com> * Enable DCO check Signed-off-by: Chang Liu <liuc@nvidia.com> * Fix lint --------- Signed-off-by: Chang Liu <liuc@nvidia.com>	2025-04-01 13:08:29 -07:00
Robin Kobus	d880f4a7c6	chore: Cursor ignore cubin in headers (#3202 ) Add `*cubin.h` to ignore-file. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-01 23:42:19 +08:00
Enwei Zhu	b2f69db507	test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of `trtllm-eval` (#3167 ) * add eval_llmapi Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> tmp commit port to CLI tool Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> move Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> setup llmapi Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> fix spec_dec_algo Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> _update_from_hf_quant_config Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> migrate test_pytorch.py Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> fix fp8 block scales Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> fix fp8 rowwise Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> adj alpha Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> move test_pytorch.py cases Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> move Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> rename test_accuracy.py to test_cli.py Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> clean Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix cnn_dailymail Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * renaming to cli flow Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * rename MMLU Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * rename Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add error Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-01 22:20:29 +08:00
amirkl94	bf02b9144f	feature: Add LoRA support for gemma (#3068 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-04-01 19:15:55 +08:00
Robin Kobus	d7386d14a8	refactor: Simplify disableLookahead and improve numDecodingEngineTokens handling (#3103 ) * refactor: Simplifiy disableLookahead method Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * Update DecoderBuffers comments Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move numDecodingEngineTokens to DecoderState This commit introduces new methods in the DecoderState class to manage the number of tokens for each request in a batch. The following changes were made: - Added `getNumDecodingEngineTokens()` to retrieve the number of tokens for all requests. - Added `getNumDecodingEngineTokens(SizeType32 batchIdx)` to get the token count for a specific request. - Added `setNumDecodingEngineTokens(SizeType32 batchIdx, SizeType32 numTokens)` to set the token count for a specific request. - Updated the setup method to initialize the token count vector based on the maximum batch size. - Refactored the `CreateNewDecoderRequests` class to utilize the new token management methods, improving clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Improve shape variables in DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-01 18:47:31 +08:00
WeiHaocheng	ff35af77ea	feat: refactor scaffolding worker and support openai api worker (#3166 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-01 18:31:52 +08:00
bhsueh_NV	d34202273b	fix bug of glm-4-9b ci (#3184 ) bug nvbug_5196515 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-01 16:58:42 +08:00
Yiteng Niu	c725f1043f	update user list (#3193 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-04-01 16:41:15 +08:00
Jinyang Yuan	992d513bc6	feat: Optionally split MoE inputs into chunks to reduce GPU memory usage (#3104 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-04-01 16:07:02 +08:00
brb-nv	727d78e785	Support prequantized fp8 ckpt for nemotron-mini-4b-instruct (#3046 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-01 14:52:09 +08:00
Yan Chunwei	7575dd00e7	add slurm script examples for llm-api (#3135 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-01 14:31:57 +08:00
Yuan Tong	2994527110	chore: cutlass cleanup (#3165 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-01 13:57:38 +08:00
dongjiyingdjy	22ff81b047	fix：fix illeagel memory access when mtp >= 2 (#3006 ) * fix - fix illeagel memory access when mtp > 2 --------- Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-01 13:36:45 +08:00
QI JUN	75495730bc	Revert "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) This reverts commit `3ee4332fb1`. Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-01 12:49:27 +08:00
Shunkangz	dda7354d1a	Refactor return of first gen token in PD (#2986 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-04-01 12:28:27 +08:00
brb-nv	1901bfcf76	test: Add Eagle tests with untrained heads (#2991 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-01 11:41:59 +08:00
jiahanc	c4ee14e43a	fix: Reverse cuda graph size order (#3116 ) Signed-off-by: jiahanc <jiahanc@nvidia.com>	2025-04-01 11:28:36 +08:00
Erin	68bcd0ac07	doc: update README (#3162 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-04-01 10:37:06 +08:00
Aurelien Chartier	14e194433c	chore: cleanup py_executor code (#3132 ) * chore: cleanup py_executor code * Add common loop cleanup function * Remove checks for attention DP if nothing to queue * Remove extra return statements * Remove extra variables * Remove commented debug print Signed-off-by: Aurelien Chartier <achartier@nvidia.com> * rename cleanup function Signed-off-by: Aurelien Chartier <achartier@nvidia.com> --------- Signed-off-by: Aurelien Chartier <achartier@nvidia.com>	2025-04-01 09:27:04 +08:00
Anurag Mukkara	435cd2983d	perf: Optimisations for PP + attention DP (#3134 ) * Minor tp_rank fix Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> * Delete unused function Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> * PP broadcast for ADP new requests Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> * Sync request finish point for intermediate and last pp ranks Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> * Use local PP layers only for KV cache estimation Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> --------- Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>	2025-04-01 08:59:16 +08:00
Frank	8bb3eea285	perf: Readd iteration logging for trtllm-bench. (#3039 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-01 08:13:09 +08:00
Iman Tabrizian	e8731ba3b7	fix: disable cuda graph and MTP for overlap tests (#3155 ) Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>	2025-03-31 11:35:35 -07:00
WeiHaocheng	f665f83256	feat: improve scaffolding shutdown process (#3084 )	2025-03-31 20:39:20 +08:00
Zhanrui Sun	36ac5e78ed	chore: bump version to 0.19.0.dev2025040100 (#3152 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-03-31 16:36:06 +08:00
Quanfeng Li	839aad4d6e	fix: Add missing parameter for WeightOnlyQuantRowLinear module (#2768 ) Signed-off-by: Quanfeng Li <liquanfeng7@foxmail.com>	2025-03-31 16:20:30 +08:00
QI JUN	9560fcd5ec	Chore: waive tests and fix multi-GPU tests (#3157 ) * waive tests Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * update Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean up Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-03-31 16:05:45 +08:00
bhsueh_NV	322ac565fc	chore: clean some ci of qa test (#3083 ) * move some models to examples/models/contrib Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update the document Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove arctic, blip2, cogvlm, dbrx from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove tests of dit, mmdit and stdit from qa test Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * remove grok, jais, sdxl, skywork, smaug from qa test list Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * re-organize the glm examples Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix issues after running pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix some typo in glm_4_9b readme Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-03-31 14:30:41 +08:00
Zhanrui Sun	1e1116ccfc	infra: Switch to urm.nvidia.com as a WAR for urm-rn.nvidia.com connection issue Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-03-31 13:05:29 +08:00
xinhe-nv	86f3b59f81	update waive list (#3094 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: larryx <larryx@nvidia.com>	2025-03-31 11:42:45 +08:00
liji-nv	e0d0dde058	None - Add one-shot version for UB AR NORM FP16/BF16 (#2995 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-03-31 11:16:03 +08:00
Yan Chunwei	794f61c997	fix: fix single-node cannot quit issue on slurm (#3140 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-03-31 10:15:27 +08:00
musvaage	88e1c90fd0	doc: use alert formatting (#3153 ) Signed-off-by: musvaage <musvaage@users.noreply.github.com> Co-authored-by: musvaage <musvaage@users.noreply.github.com>	2025-03-31 07:30:52 +08:00
Yiteng Niu	3aae124a00	infra: update concurrency control (#3120 ) * update concurrency control Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> * Update .github/workflows/blossom-ci.yml Co-authored-by: tburt-nv <195370667+tburt-nv@users.noreply.github.com> Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> --------- Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> Co-authored-by: tburt-nv <195370667+tburt-nv@users.noreply.github.com>	2025-03-30 23:28:50 +08:00
Mike Iovine	5416966ddb	Add initial EAGLE-3 implementation (#3035 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-03-29 22:31:24 +08:00
William Tambellini	9c484b24e6	fix #3109 : early exit cmake if find_library() does not find any lib (#3113 ) Early exit if find_library() does not find any lib. As today, the find_library_create_target() cmake macro blindly continues even if the lib is not found, adding LIB_PATH-NOTFOUND to the target and making the build failing anyway later with non obvious reasons. This change just early exits if the lib is simply not found with a proper error message. Fix github issue #3109 Signed-off-by: William Tambellini <wtambellini@sdl.com>	2025-03-29 19:59:03 +08:00
Erin	c75d7cd684	move BuildConfig functional args to llmargs (#3036 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-03-29 02:20:18 +08:00
Robin Kobus	3ee4332fb1	refactor: Replace DecoderFinishedEvent with CudaEvent in decoder classes (#3078 ) - Updated the `forwardAsync` method in `GptDecoderBatched` and `iGptDecoderBatched` to return `CudaEvent` instead of `DecoderFinishedEventPtr`, simplifying event handling. - Removed the `DecoderFinishedEvent` class and its associated usage across various files, streamlining the codebase. - Adjusted related methods and Python bindings to accommodate the new event structure, ensuring compatibility and maintaining functionality. These changes enhance the clarity and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:50:52 +08:00
Robin Kobus	45134d7095	refactor: Improve decoder finalize function (#3077 ) * refactor: Update gatherTree function to accept CUDA stream parameter This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update GptDecoderBatched finalize This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity: - Modified finalize method to accept DecoderState as a parameter - Updated method signatures to work with the new DecoderState approach - Improved code organization and readability The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:33:59 +08:00
BatshevaBlack	3e37531c6a	feat: Add BW measurement (#3070 )	2025-03-28 10:53:00 +08:00
Aurelien Chartier	3de82c41cd	Pytorch PP + attention DP support (#3044 ) Signed-off-by: Aurelien Chartier <achartier@nvidia.com>	2025-03-28 00:11:19 +08:00
Fanrong Li	ec03159e60	fix: Waive twoshot to fix acc issue (#3066 ) * waive twoshot to fix acc issue Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-03-27 21:38:52 +08:00
Fanrong Li	644a01cbbe	test: Add gpqa tests for DeepSeek models (#3063 ) * Add gpqa accuracy test script * Add gpqa accuracy tests * Update DeepSeek-v3 doc * Update qa test list --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-03-27 19:47:06 +08:00
Yan Chunwei	87ab794aa2	fix: fix hang in mgmn with trtllm-llmapi-launch command (#3119 ) * init Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * restore Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-03-27 18:45:43 +08:00
Fanrong Li	0976360204	add support for MTP+cuda_graph_padding. (#3096 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-03-27 16:06:14 +08:00
xiweny	6979afa6f2	test: reorganize tests folder hierarchy (#2996 ) 1. move TRT path tests to 'trt' folder 2. optimize some import usage	2025-03-27 12:07:53 +08:00
Yan Chunwei	82edd90350	fix gpus_per_node in trtllm-bench when world_size < device_count (#3007 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-03-27 09:31:40 +08:00
Dom Brown	60d4dacc47	Port multi GPU changes to GitHub (#3027 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-03-27 05:55:03 +08:00
Suyog Gupta	047f2b234d	perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench (#3041 ) * Enable AutoDeploy as a backend in trtllm-bench Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * update how caches are resized Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * fix: files permission from 100755 to 100644 Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some comments Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Fix function name Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * refactor Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Remove spurious change Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Add cursor generated doc strings Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * re-enable ad test Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * some perf cleanup Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * debug ci Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * ensure that overlap scheduler is enabled Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * Reorder the tests Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-03-26 14:33:14 -07:00
wili	3e035f2219	v1.2 (#3082 ) Signed-off-by: wili <wili@nvidia.com>	2025-03-26 23:31:29 +08:00

1 2 3 4 5

241 Commits