TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Fanrong Li	862bde99b6	draft[doc]: add mtp tech blog (#4580 ) * add mtp tech blog. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * update figure size. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * update the figure caption style and add some code/pr links. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix figure captions. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix figure size and perf data. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix based on comments Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix figure links. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-05-23 13:54:21 +08:00
Shi Xiaowei	a98e7ea26b	fix: replace the image links in the blog (#4489 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-05-20 22:39:25 +08:00
juney-nvidia	ddf01f6266	refine doc (#4422 )	2025-05-19 06:06:22 +08:00
juney-nvidia	58e2d6ffa7	Refine doc (#4421 )	2025-05-19 06:03:05 +08:00
juney-nvidia	ac610b394a	Refine doc (#4420 )	2025-05-19 05:05:24 +08:00
Kefeng-Duan	f5b6d453aa	doc： DS r1 min latency blog (#4386 ) * add best perf practice on DSR1 Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * add ds-r1 min latency tech blog Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * rm redundant doc Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * refine table content Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * refine table content Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * relative path for images Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * refine precommit Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * pr4280 is merged Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> --------- Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-05-16 20:20:28 +08:00
Daniel Cámpora	df19430629	chore: Mass Integration 0.19 (#4255 ) * fix: Fix/fused moe 0.19 (#3799) * fix bug of stream init Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix: Add pre-download of checkpoint before benchmark. (#3772) * Add pre-download of checkpoint before benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add missing remote code flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move from_pretrained to throughput benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move download and use snapshot_download. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Removed trusted flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix benchmark command in iteration log test. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fuse Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * TRTLLM-4875 feat: Add version switcher to doc (#3871) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * waive a test (#3897) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> * fix: remote mpi session abort (#3884) * fix remote mpi session Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * skip fp8 gemm for pre-hopper (#3931) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update multigpu list Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix namings Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * Doc: Fix H200 DeepSeek R1 perf doc (#4006) * fix doc Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * update perf number Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * Fix the perf regression caused by insufficient cache warmup. (#4042) Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * doc: Update 0.19.0 release notes (#3976) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * Optimize the AutoTuner cache access code to reduce host code overhead. (#4060) The NVFP4 Linear op is very sensitive to the host overhead. This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Update switcher (#4098) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * doc: update release notes (#4108) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * docs:update 0.19 doc. (#4120) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * docs:add torch flow supported model list. (#4129) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * doc: Release V0.19 Perf Overview Update (#4166) Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> * Fix readme of autodeploy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update tensorrt_llm/_torch/pyexecutor/llm_request.py Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert mgmn worker node. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Change to disable_overlap_scheduler. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>	2025-05-16 10:53:25 +02:00
Kaiyu Xie	b4e5df0ee0	Breaking change: perf: Enable scheduling overlap by default (#4174 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-05-15 14:27:36 +08:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
QI JUN	d0d19e81ca	chore: fix some invalid paths of contrib models (#3818 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-24 05:36:16 +08:00
Kaiyu Xie	dfbcb543ce	doc: fix path after examples migration (#3814 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-24 02:36:45 +08:00
Zongfei Jing	7eee9a9d28	doc: Update doc for Deepseek min latency (#3717 ) * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Update doc for min latency deepseek Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Throw exception for RouterKernel when not running on sm90+ Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-22 23:07:59 +08:00
Kefeng-Duan	67949f7c39	Update README and add benchmarking blog for DeepSeek-R1 (#3232 ) - Added a new entry in the README for the published benchmarking best practices for DeepSeek-R1. - Introduced a new blog post detailing performance benchmarking configurations and procedures for DeepSeek-R1 in TensorRT-LLM, including installation, dataset preparation, and benchmarking steps for both B200 and H200 GPUs. Signed-off-by: taoli <litaotju@users.noreply.github.com> Co-authored-by: taoli <litaotju@users.noreply.github.com>	2025-04-10 17:00:49 +08:00
Kaiyu Xie	2ea17cdad2	Update TensorRT-LLM (#2792 ) * Update TensorRT-LLM --------- Co-authored-by: jlee <jungmoolee@clika.io>	2025-02-18 21:27:39 +08:00
Kaiyu Xie	c629546ce4	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
Kaiyu Xie	8681b3a4c0	open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297 ) * Update TensorRT-LLM --------- Co-authored-by: Bhuvanesh Sridharan <bhuvanesh.sridharan@sprinklr.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com>	2024-10-08 12:19:19 +02:00
Kaiyu Xie	bca9a33b02	Update TensorRT-LLM (#2008 ) * Update TensorRT-LLM --------- Co-authored-by: Timur Abishev <abishev.timur@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: Saeyoon Oh <saeyoon.oh@furiosa.ai> Co-authored-by: hattizai <hattizai@gmail.com>	2024-07-23 23:05:09 +08:00
Kaiyu Xie	66ef1df492	Update TensorRT-LLM (#1492 ) * Update TensorRT-LLM --------- Co-authored-by: Loki <lokravi@amazon.com>	2024-04-24 14:44:22 +08:00
石晓伟	850b6fa1e7	Update TensorRT-LLM (#1358 ) Co-authored-by: Kaiyu <26294424+kaiyux@users.noreply.github.com>	2024-03-26 20:47:14 +08:00
Kaiyu Xie	0ab9d17a59	Update TensorRT-LLM (#1055 ) * Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2024-02-06 18:38:07 +08:00
juney-nvidia	a40dbae30d	Doc update 20240130 (#1009 ) * doc updates	2024-01-31 03:40:22 +08:00
Kaiyu Xie	deaae40bd7	Update TensorRT-LLM (#787 ) * Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2024-01-02 17:54:32 +08:00
Kaiyu Xie	f7eca56161	Update TensorRT-LLM (#613 ) * Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: zhang-ge-hao <842720660@qq.com>	2023-12-08 17:49:24 +08:00
石晓伟	e093e48459	Update latest news (#549 )	2023-12-04 22:04:00 +08:00
Kaiyu Xie	71f60f6df0	Update TensorRT-LLM (#524 )	2023-12-01 22:27:51 +08:00
Kaiyu Xie	6755a3f077	Update TensorRT-LLM (#422 ) * Update TensorRT-LLM --------- Co-authored-by: Tltin <TltinDeng01@gmail.com> Co-authored-by: zhaohb <zhaohbcloud@126.com> Co-authored-by: Bradley Heilbrun <brad@repl.it> Co-authored-by: nqbao11 <nqbao11.01@gmail.com> Co-authored-by: Nikhil Varghese <nikhil@bot-it.ai>	2023-11-18 00:05:54 +08:00
Kaiyu Xie	ab7b4614b8	Update latest news (#378 ) * Update latest news * Minor modification	2023-11-14 17:15:36 +08:00
石晓伟	ec769d63f9	Add Latest News section (#365 )	2023-11-13 20:56:22 +08:00
石晓伟	24cf8de078	Add Latest News section (#362 ) Co-authored-by: Shi Xiaowei <xiaoweis@nvidia.com>	2023-11-13 15:17:23 +08:00
石晓伟	cd6bbab0b3	Add Latest News section (#315 )	2023-11-08 15:04:33 +08:00

30 Commits