Enqueue context-only requests to context executors.
-
-
Parameters:
-
-
requests – A vector of context-only requests.
-
selectContextId – The index of the context executor to use. If std::nullopt, the executor that has the smallest number of inflight requests will be used.
-
batch – If true,enqueue requests in same context executor.If false, will try to use a different executor for each request.
-
-
-
Returns:
-
A vector of global request ids, corresponding to the order of the requests in requests, the id returned may be different from the request id in each executor.
Enqueue generation-only requests to generation executors.
-
-
Parameters:
-
-
requests – A vector of generation-only requests.
-
globalRequestIds – A vector of global request ids, corresponding to the order of the requests,and must be the ids returned by the enqueueContext function.
-
selectGenIdx – The index of the generation executor to use. If std::nullopt, the executor that has the smallest number of inflight requests will be used.
-
batch – If true,enqueue requests in same generation executor.If false, will try to use a different executor for each request.
timeout – The maximum time to wait for new responses
-
contextIdx – The index of the context executor to use. If std::nullopt, return ready responses in all context executors,if hasContextAwaitThreads is true, then this parameter must be std::nullopt.
-
-
-
Returns:
-
A vector of responses with corresponding global request ids
timeout – The maximum time to wait for new responses.
-
genIdx – The index of the generation executor to use. If std::nullopt, return ready responses in all generation executors,if hasGenAwaitThreads is true, then this parameter must be std::nullopt.
-
-
-
Returns:
-
A vector of responses with corresponding global request ids.
Enqueue context-only requests to context executors.
+
+
Parameters:
+
+
requests – A vector of context-only requests.
+
selectContextId – The index of the context executor to use. If std::nullopt, the executor that has the smallest number of inflight requests will be used.
+
batch – If true,enqueue requests in same context executor.If false, will try to use a different executor for each request.
+
+
+
Returns:
+
A vector of global request ids, corresponding to the order of the requests in requests, the id returned may be different from the request id in each executor.
Enqueue generation-only requests to generation executors.
+
+
Parameters:
+
+
requests – A vector of generation-only requests.
+
globalRequestIds – A vector of global request ids, corresponding to the order of the requests,and must be the ids returned by the enqueueContext function.
+
selectGenIdx – The index of the generation executor to use. If std::nullopt, the executor that has the smallest number of inflight requests will be used.
+
batch – If true,enqueue requests in same generation executor.If false, will try to use a different executor for each request.
timeout – The maximum time to wait for new responses
+
contextIdx – The index of the context executor to use. If std::nullopt, return ready responses in all context executors,if hasContextAwaitThreads is true, then this parameter must be std::nullopt.
+
+
+
Returns:
+
A vector of responses with corresponding global request ids
timeout – The maximum time to wait for new responses.
+
genIdx – The index of the generation executor to use. If std::nullopt, return ready responses in all generation executors,if hasGenAwaitThreads is true, then this parameter must be std::nullopt.
+
+
+
Returns:
+
A vector of responses with corresponding global request ids.
The tokens computed during the gatherTree step, [BS, BM, MSL] Necessary for “Streaming + Beam Search” mode since beam search kernels store ungathered tokens in ids.
Creates a new cuda stream on the current device. The stream will be destroyed in the destructor.
Parameters:
-
cudaStream – [in] The cuda stream to use for all operations on GPU (allocation, de-allocation, copying, etc.).
+
+
flags – Flags for stream creation. See cudaStreamCreateWithFlags for a list of valid flags that can be passed.
+
priority – Priority of the stream. Lower numbers represent higher priorities. See cudaDeviceGetStreamPriorityRange for more information about the meaningful stream priorities that can be passed.
Release this CUDAVirtualMemoryChunk. Shall be called only when status() == MATERIALIZED, or materialize() throws. Will be called automatically by destructor if necessary.
-
Calls configurator.teardown() for each configurator that setup() succeed in materialize() in reversed order, and then creator.release().
-
Never stops early upon exception. The last thrown exception will be propagated, and others logged.
CUDAVirtualMemoryChunk::Creator is the interface to obtain a CUmemGenericAllocationHandle, either by creating one locally, or importing one from remote.
If any CUDAVirtualMemoryChunk threw an exception during release, it will be removed from the manager. Call retrieveBadHandles to retrieve handles of all CUDAVirtualMemoryChunk that got removed due to exception.
If any CUDAVirtualMemoryChunk threw an exception during materialize or release, it will be removed from the manager. Successfully roll backed CUDAVirtualMemoryChunk will not be removed. Call retrieveBadHandles to retrieve handles of all CUDAVirtualMemoryChunk that got removed due to exception.
Retrieve handles of all CUDAVirtualMemoryChunk that got removed due to exception and reset the list. The returned list may not include all removed CUDAVirtualMemoryChunk handles if OOM happened. This method is only for diagnostic purpose, and should not be called concurrently with other methods.
Retrieves a T typed pointer to the underlying data of the tensor pointed to by the tensor pointer contained in the optionalBufferPtr, or nullptr if the optional doesn’t have a value.
-
This overload has to be declared to avoid ambiguity when an implicit conversion to IBuffer is involved.
Retrieves a T const typed pointer to the underlying data of the tensor pointed to by the tensor pointer contained in the optionalBufferPtr, or nullptr if the optional doesn’t have a value.
-
This overload has to be declared to avoid ambiguity when an implicit conversion to IBuffer is involved.
Returns the tensor n-th dimension. If n is negative, returns the (nbDims - n)th dimension. TODO: replace with constexpr parameter when moving to C++20.
The output of the model forward computation, a probability distribution over the vocabulary [batchSize][numGenTokens, beamWidth, vocabSizePadded] on gpu
Creates a new cuda stream on the current device. The stream will be destroyed in the destructor.
-
-
Parameters:
-
-
flags – Flags for stream creation. See cudaStreamCreateWithFlags for a list of valid flags that can be passed.
-
priority – Priority of the stream. Lower numbers represent higher priorities. See cudaDeviceGetStreamPriorityRange for more information about the meaningful stream priorities that can be passed.
The tokens computed during the gatherTree step, [BS, BM, MSL] Necessary for “Streaming + Beam Search” mode since beam search kernels store ungathered tokens in ids.
The output of the model forward computation, a probability distribution over the vocabulary [batchSize][numGenTokens, beamWidth, vocabSizePadded] on gpu
Retrieves a T typed pointer to the underlying data of the buffer pointed to by the buffer pointer contained in the optionalBufferPtr, or nullptr if the optional doesn’t have a value.
Retrieves a T const typed pointer to the underlying data of the buffer pointed to by the buffer pointer contained in the optionalBufferPtr, or nullptr if the optional doesn’t have a value.
+[docs]
+ @model_validator(mode="after")
+ defvalidate_build_config_with_runtime_params(self):
+ # Note: max_batch_size and max_num_tokens in LlmArgs are for runtime,
+ # which will be passed to the C++ Executor API, overwriting the values
+ # from an built engine. In order to set build configuration, it is
+ # recommended to use build_config instead.
+ assertisinstance(
+ self.build_config,BuildConfig
+ ),f"build_config is not initialized: {self.build_config}"
+
+ ifself.max_batch_sizeisnotNone:
+ ifself.max_batch_size>self.build_config.max_batch_size:
+ self.max_batch_size=self.build_config.max_batch_size
+ logger.warning(
+ f"max_batch_size [{self.max_batch_size}] is overridden by build_config.max_batch_size [{self.build_config.max_batch_size}] in build_config"
+ )
+ ifself.max_num_tokensisnotNone:
+ ifself.max_num_tokens>self.build_config.max_num_tokens:
+ self.max_num_tokens=self.build_config.max_num_tokens
+ logger.warning(
+ f"max_num_tokens [{self.max_num_tokens}] is overridden by build_config.max_num_tokens [{self.build_config.max_num_tokens}] in build_config"
+ )
+ ifself.max_seq_lenisnotNone:
+ ifself.max_seq_len!=self.build_config.max_seq_len:
+ logger.warning(
+ f"max_seq_len [{self.max_seq_len}] is overridden by build_config.max_seq_len [{self.build_config.max_seq_len}] in build_config"
+ )
+ ifself.max_beam_widthisnotNone:
+ ifself.max_beam_width!=self.build_config.max_beam_width:
+ logger.warning(
+ f"max_beam_width [{self.max_beam_width}] is overridden by build_config.max_beam_width [{self.build_config.max_beam_width}] in build_config"
+ )
+ ifself.max_input_lenisnotNone:
+ ifself.max_input_len!=self.build_config.max_input_len:
+ logger.warning(
+ f"max_input_len [{self.max_input_len}] is overridden by build_config.max_input_len [{self.build_config.max_input_len}] in build_config"
+ )
+
+ returnself
+[docs]
+ @model_validator(mode="after")
+ defvalidate_speculative_config(self):
+ ifself.speculative_config:
+ ifnotself.speculative_config.supports_backend(self.backend):
+ raiseValueError(
+ f"Speculation type {self.speculative_config.decoding_type} does not "
+ f"support backend {self.backend}")
+
+ # Below, we only need to set speculative_decoding_mode/decoding_config for speculation
+ # on the TRT backend.
+ ifisinstance(self.speculative_config,LookaheadDecodingConfig):
+ max_draft_len=self.speculative_config.calculate_speculative_resource(
+ )[2]
+ assertmax_draft_len>0
+ self.build_config.speculative_decoding_mode=SpeculativeDecodingMode.LOOKAHEAD_DECODING
+ self.build_config.max_draft_len=max(
+ self.build_config.max_draft_len,max_draft_len)
+ self.decoding_config=DecodingConfig(
+ decoding_mode=DecodingMode.Lookahead(),
+ lookahead_decoding_config=PybindMirror.maybe_to_pybind(
+ self.speculative_config))
+
+ elifisinstance(self.speculative_config,MedusaDecodingConfig):
+ assertself.speculative_config.max_draft_len>0
+ self.build_config.speculative_decoding_mode=SpeculativeDecodingMode.MEDUSA
+ self.build_config.max_draft_len=self.speculative_config.max_draft_len
+ self.decoding_config=DecodingConfig(
+ decoding_mode=DecodingMode.Medusa(),
+ medusa_choices=self.speculative_config.medusa_choices)
+
+ elifisinstance(self.speculative_config,EagleDecodingConfig):
+ assertself.speculative_config.max_draft_len>0
+ assertself.speculative_config.speculative_model_dirisnotNone,"Path to EAGLE3 weights must be specified."
+ self.build_config.max_draft_len=self.speculative_config.max_draft_len
+ self.build_config.speculative_decoding_mode=SpeculativeDecodingMode.EAGLE
+ eagle_config=_EagleConfig(
+ self.speculative_config.eagle_choices,
+ self.speculative_config.greedy_sampling,
+ self.speculative_config.posterior_threshold,
+ self.speculative_config.use_dynamic_tree,
+ self.speculative_config.dynamic_tree_max_topK)
+ self.decoding_config=DecodingConfig(
+ decoding_mode=DecodingMode.Eagle(),
+ eagle_config=eagle_config)
+ else:
+ raiseValueError(
+ f"Unrecognized speculative config type {type(self.speculative_config)}"
+ )
+
+ else:
+ self.decoding_config=None
+
+ self._speculative_model=getattr(self.speculative_config,
+ "speculative_model_dir",None)
+ speculative_model_obj=_ModelWrapper(
+ self._speculative_model
+ )ifself._speculative_modelisnotNoneelseNone
+ ifself._speculative_modelandspeculative_model_obj.is_local_model:
+ self._speculative_model_format=_ModelFormatKind.HF
+
+ returnself
+
+
+ def_load_config_from_engine(self,engine_dir:Path):
+ engine_config=EngineConfig.from_json_file(engine_dir/"config.json")
+ self._pretrained_config=engine_config.pretrained_config
+ self.build_config=engine_config.build_config
+
+ # load and check parallel_config
+ mapping=self._pretrained_config.mapping
+ ifself.parallel_config.tp_sizenotin(1,mapping.tp_size):
+ raiseValueError(
+ f"tp_size {self.parallel_config.tp_size} is not consistent with the engine's tp_size {mapping.tp_size}"
+ )
+ ifself.parallel_config.pp_sizenotin(1,mapping.pp_size):
+ raiseValueError(
+ f"pp_size {self.parallel_config.pp_size} is not consistent with the engine's pp_size {mapping.pp_size}"
+ )
+ ifself.parallel_config.cp_sizenotin(1,mapping.cp_size):
+ raiseValueError(
+ f"cp_size {self.parallel_config.cp_size} is not consistent with the engine's cp_size {mapping.cp_size}"
+ )
+ self._parallel_config=_ParallelConfig(
+ tp_size=mapping.tp_size,
+ pp_size=mapping.pp_size,
+ cp_size=mapping.cp_size,
+ gpus_per_node=mapping.gpus_per_node,
+ moe_cluster_size=mapping.moe_cluster_size,
+ moe_tp_size=mapping.moe_tp_size,
+ moe_ep_size=mapping.moe_ep_size)
+
+ def_load_config_from_ckpt(self,ckpt_dir:Path):
+ pretrained_config=PretrainedConfig.from_json_file(ckpt_dir/
+ "config.json")
+ tp_size=pretrained_config.mapping.tp_size
+ pp_size=pretrained_config.mapping.pp_size
+ cp_size=pretrained_config.mapping.cp_size
+ moe_cluster_size=pretrained_config.mapping.moe_cluster_size
+ moe_tp_size=pretrained_config.mapping.moe_tp_size
+ moe_ep_size=pretrained_config.mapping.moe_ep_size
+ gpus_per_node=pretrained_config.mapping.gpus_per_node
+ # load parallel_config
+ ifself.parallel_config.tp_size!=1andself.parallel_config.tp_size!=tp_size:
+ raiseValueError(
+ f"tp_size {self.parallel_config.tp_size} is not consistent with the checkpoint's tp_size {tp_size}"
+ )
+ ifself.parallel_config.pp_size!=1andself.parallel_config.pp_size!=pp_size:
+ raiseValueError(
+ f"pp_size {self.parallel_config.pp_size} is not consistent with the checkpoint's pp_size {pp_size}"
+ )
+ ifself.parallel_config.cp_size!=1andself.parallel_config.cp_size!=cp_size:
+ raiseValueError(
+ f"cp_size {self.parallel_config.cp_size} is not consistent with the checkpoint's cp_size {cp_size}"
+ )
+ self._parallel_config=_ParallelConfig(
+ tp_size=tp_size,
+ pp_size=pp_size,
+ cp_size=cp_size,
+ gpus_per_node=gpus_per_node,
+ moe_cluster_size=moe_cluster_size,
+ moe_tp_size=moe_tp_size,
+ moe_ep_size=moe_ep_size)
+
+
+[docs]
+ @model_validator(mode="after")
+ defvalidate_model_format_misc(self):
+'''
+ Load the model format, and do the following:
+
+ 1. Load the build_config if got an engine.
+ 2. Load the parallel_config if got a checkpoint.
+ '''
+ model_obj=_ModelWrapper(self.model)
+
+ ifmodel_obj.is_local_modelandself.backendnotin[
+ 'pytorch','_autodeploy'
+ ]:
+ # Load parallel_config from the engine.
+ model_format=get_model_format(
+ self.model,trust_remote_code=self.trust_remote_code)
+
+ ifmodel_formatis_ModelFormatKind.TLLM_ENGINE:
+ ifself.build_configisnotNone:
+ logger.warning(
+ "The build_config is ignored for model format of TLLM_ENGINE."
+ )
+ self._load_config_from_engine(model_obj.model_dir)
+ runtime_defaults=self._pretrained_config.runtime_defaults
+ ifruntime_defaults:
+ self.kv_cache_config.fill_empty_fields_from_runtime_defaults(
+ runtime_defaults)
+
+ # Load parallel_config from the checkpoint.
+ elifmodel_formatis_ModelFormatKind.TLLM_CKPT:
+ # We need to create a temporary instance to call _load_config_from_ckpt
+ self._load_config_from_ckpt(model_obj.model_dir)
+ else:
+ model_format=_ModelFormatKind.HF
+
+ # Store the model format in the values
+ self._model_format=model_format
+ returnself
[docs]classTorchLlmArgs(BaseLlmArgs):
- # Just a dummy BuildConfig to allow code reuse with the TrtLlmArgs
- build_config:Optional[BuildConfig]=Field(
- default=None,
- description="Build config.",
- exclude_from_json=True,
- status="deprecated",
- )
-
# PyTorch backend specific configurationsgarbage_collection_gen0_threshold:int=Field(default=20000,
@@ -3360,6 +3351,11 @@
description="MoE config.",status="beta")
+ nvfp4_gemm_config:Nvfp4GemmConfig=Field(
+ default_factory=Nvfp4GemmConfig,
+ description="NVFP4 GEMM backend config.",
+ status="beta")
+
attn_backend:str=Field(default='TRTLLM',description="Attention backend to use.",status="beta")
@@ -3512,8 +3508,12 @@
# PrivateVars_quant_config:Optional[QuantConfig]=PrivateAttr(default=None)
- _disable_flash_infer_sampling:bool=PrivateAttr(default=True)
-"""Unless this is set to False, FlashInfer.sampling is not used, even if available."""
+ disable_flashinfer_sampling:bool=Field(
+ default=False,
+ description=
+ "Disable the use of FlashInfer.sampling. This option is likely to be removed in the future.",
+ status="prototype",
+ )@propertydefquant_config(self)->QuantConfig:
@@ -3564,6 +3564,73 @@
defextra_resource_managers(self,value:Dict[str,object])->None:self._extra_resource_managers=value
+
+[docs]
+ @model_validator(mode="after")
+ defvalidate_speculative_config(self):
+ ifself.speculative_config:
+ ifnotself.speculative_config.supports_backend(self.backend):
+ raiseValueError(
+ f"Speculation type {self.speculative_config.decoding_type} does not "
+ f"support backend {self.backend}")
+
+ ifisinstance(self.speculative_config,EagleDecodingConfig):
+ assertself.speculative_config.max_draft_len>0
+ assertself.speculative_config.speculative_model_dirisnotNone,"Path to EAGLE3 weights must be specified."
+ elifisinstance(self.speculative_config,NGramDecodingConfig):
+ assertself.speculative_config.max_draft_len>0andself.speculative_config.max_matching_ngram_size>0
+ elifisinstance(self.speculative_config,DraftTargetDecodingConfig):
+ assertself.speculative_config.max_draft_len>0
+ assertself.speculative_config.speculative_model_dirisnotNone,"Path to draft model must be specified."
+ elifisinstance(self.speculative_config,MTPDecodingConfig):
+ assertself.speculative_config.num_nextn_predict_layers>0
+ self.speculative_config.max_draft_len=self.speculative_config.num_nextn_predict_layers
+ elifisinstance(self.speculative_config,
+ UserProvidedDecodingConfig):
+ pass
+ elifisinstance(self.speculative_config,AutoDecodingConfig):
+ pass
+ elifisinstance(self.speculative_config,
+ SaveHiddenStatesDecodingConfig):
+ assertself.backendin['pytorch']
+ logger.warning(
+ "SaveHiddenStatesDecodingConfig is active, setting max_batch_size to 1, disabling overlap scheduler, and setting cuda_graph_config to None"
+ )
+ self.max_batch_size=1
+ self.disable_overlap_scheduler=True
+ self.cuda_graph_config=None
+ self.speculative_config.max_draft_len=1
+ else:
+ raiseValueError(
+ f"Unrecognized speculative config type {type(self.speculative_config)}"
+ )
+
+ else:
+ self.decoding_config=None
+
+ self._speculative_model=getattr(self.speculative_config,
+ "speculative_model_dir",None)
+ speculative_model_obj=_ModelWrapper(
+ self._speculative_model
+ )ifself._speculative_modelisnotNoneelseNone
+ ifself._speculative_modelandspeculative_model_obj.is_local_model:
+ self._speculative_model_format=_ModelFormatKind.HF
+
+ returnself
+
+
[docs]@model_validator(mode="after")
@@ -3807,6 +3874,15 @@
llm_args_dict:Dict,extra_llm_api_options:Optional[str]=None)->Dict:
+ # Deep merge kv_cache_config to prevent partial YAML kv_cache_config from replacing the complete kv_cache_config
+ if'kv_cache_config'inllm_argsand'kv_cache_config'inllm_args_dict:
+ # Convert KvCacheConfig object to dict if necessary
+ base_kv_config=llm_args['kv_cache_config']
+ ifisinstance(base_kv_config,KvCacheConfig):
+ base_kv_config=base_kv_config.model_dump(exclude_unset=True)
+ llm_args_dict['kv_cache_config']=base_kv_config|llm_args_dict[
+ 'kv_cache_config']
+
field_mapping={"quant_config":QuantConfig,"calib_config":CalibConfig,
@@ -3816,8 +3892,10 @@
"speculative_config":DecodingBaseConfig,"lora_config":LoraConfig,"moe_config":MoeConfig,
+ "nvfp4_gemm_config":Nvfp4GemmConfig,"attention_dp_config":AttentionDpConfig,"sparse_attention_config":BaseSparseAttentionConfig,
+ "kv_cache_config":KvCacheConfig,}forfield_name,field_typeinfield_mapping.items():iffield_nameinllm_args_dict:
@@ -3833,8 +3911,7 @@
llm_args=llm_args|llm_args_dict
- # For trtllm-bench or trtllm-serve, build_config may be passed for the PyTorch
- # backend, overwriting the knobs there since build_config always has the highest priority
+ # build_config only works for TensorRT backend, it will be ignored in PyTorch backendif"build_config"inllm_args:# Ensure build_config is a BuildConfig object, not a dictifisinstance(llm_args["build_config"],dict):
@@ -4010,9 +4087,9 @@
diff --git a/latest/_modules/tensorrt_llm/llmapi/mm_encoder.html b/latest/_modules/tensorrt_llm/llmapi/mm_encoder.html
index 12f947cfff..75c2a35120 100644
--- a/latest/_modules/tensorrt_llm/llmapi/mm_encoder.html
+++ b/latest/_modules/tensorrt_llm/llmapi/mm_encoder.html
@@ -60,7 +60,7 @@
@@ -73,7 +73,7 @@
-
+
@@ -369,7 +369,9 @@
@@ -916,6 +921,25 @@
strs=[self.stop]ifisinstance(self.stop,str)elseself.stopself._stop_word_ids=[_encode(tokenizer,s,add_special_tokens)forsinstrs]
+ # add generation_config to stop word list, only in qwen3-next now
+ if(
+ hf_model_configisnotNone
+ andhf_model_config.model_type=="qwen3_next"
+ andgeneration_configisnotNone
+ andisinstance(generation_config.eos_token_id,List)
+ andall(isinstance(i,int)foriingeneration_config.eos_token_id)
+ ):
+ ifself._stop_word_ids:
+ all_stop_tokens_id=set(iforsublistinself._stop_word_idsforiinsublist)
+ from_generation_stop_tokens=[
+ iforiingeneration_config.eos_token_idifinotinall_stop_tokens_id
+ ]
+
+ iffrom_generation_stop_tokens:
+ self._stop_word_ids.append(from_generation_stop_tokens)
+ else:
+ self._stop_word_ids=[generation_config.eos_token_id]
+
returnselfdef_get_bad_words(self)->List[List[int]]:
@@ -1168,9 +1192,9 @@
diff --git a/latest/_sources/_cpp_gen/executor.rst.txt b/latest/_sources/_cpp_gen/executor.rst.txt
index d3ca9cd473..39b9a6f5a4 100644
--- a/latest/_sources/_cpp_gen/executor.rst.txt
+++ b/latest/_sources/_cpp_gen/executor.rst.txt
@@ -4,6 +4,24 @@ Executor
.. Here are files in the cpp/include/executor
.. We manually add subsection to enable detailed description in the future
.. It is also doable to automatically generate this file and list all the modules in the conf.py
+transferAgent.h
+_______________
+
+.. doxygenfile:: transferAgent.h
+ :project: TensorRT-LLM
+
+types.h
+_______
+
+.. doxygenfile:: types.h
+ :project: TensorRT-LLM
+
+cacheCommunicator.h
+___________________
+
+.. doxygenfile:: cacheCommunicator.h
+ :project: TensorRT-LLM
+
disaggServerUtil.h
__________________
@@ -16,24 +34,6 @@ ________
.. doxygenfile:: tensor.h
:project: TensorRT-LLM
-transferAgent.h
-_______________
-
-.. doxygenfile:: transferAgent.h
- :project: TensorRT-LLM
-
-serialization.h
-_______________
-
-.. doxygenfile:: serialization.h
- :project: TensorRT-LLM
-
-types.h
-_______
-
-.. doxygenfile:: types.h
- :project: TensorRT-LLM
-
executor.h
__________
@@ -46,9 +46,9 @@ ______________________
.. doxygenfile:: dataTransceiverState.h
:project: TensorRT-LLM
-cacheCommunicator.h
-___________________
+serialization.h
+_______________
-.. doxygenfile:: cacheCommunicator.h
+.. doxygenfile:: serialization.h
:project: TensorRT-LLM
diff --git a/latest/_sources/_cpp_gen/runtime.rst.txt b/latest/_sources/_cpp_gen/runtime.rst.txt
index 536188f7ce..b8dd953966 100644
--- a/latest/_sources/_cpp_gen/runtime.rst.txt
+++ b/latest/_sources/_cpp_gen/runtime.rst.txt
@@ -4,148 +4,22 @@ Runtime
.. Here are files in the cpp/include/runtime
.. We manually add subsection to enable detailed description in the future
.. It is also doable to automatically generate this file and list all the modules in the conf.py
-lookaheadBuffers.h
-__________________
-
-.. doxygenfile:: lookaheadBuffers.h
- :project: TensorRT-LLM
-
-lookaheadModule.h
-_________________
-
-.. doxygenfile:: lookaheadModule.h
- :project: TensorRT-LLM
-
-iBuffer.h
-_________
-
-.. doxygenfile:: iBuffer.h
- :project: TensorRT-LLM
-
-modelConfig.h
-_____________
-
-.. doxygenfile:: modelConfig.h
- :project: TensorRT-LLM
-
-decodingOutput.h
-________________
-
-.. doxygenfile:: decodingOutput.h
- :project: TensorRT-LLM
-
-promptTuningParams.h
-____________________
-
-.. doxygenfile:: promptTuningParams.h
- :project: TensorRT-LLM
-
-bufferManager.h
-_______________
-
-.. doxygenfile:: bufferManager.h
- :project: TensorRT-LLM
-
-gptJsonConfig.h
-_______________
-
-.. doxygenfile:: gptJsonConfig.h
- :project: TensorRT-LLM
-
-runtimeDefaults.h
-_________________
-
-.. doxygenfile:: runtimeDefaults.h
- :project: TensorRT-LLM
-
-loraCache.h
-___________
-
-.. doxygenfile:: loraCache.h
- :project: TensorRT-LLM
-
-rawEngine.h
-___________
-
-.. doxygenfile:: rawEngine.h
- :project: TensorRT-LLM
-
-gptDecoder.h
-____________
-
-.. doxygenfile:: gptDecoder.h
- :project: TensorRT-LLM
-
-eagleBuffers.h
-______________
-
-.. doxygenfile:: eagleBuffers.h
- :project: TensorRT-LLM
-
-medusaModule.h
-______________
-
-.. doxygenfile:: medusaModule.h
- :project: TensorRT-LLM
-
-virtualMemory.h
-_______________
-
-.. doxygenfile:: virtualMemory.h
- :project: TensorRT-LLM
-
-explicitDraftTokensBuffers.h
-____________________________
-
-.. doxygenfile:: explicitDraftTokensBuffers.h
- :project: TensorRT-LLM
-
iTensor.h
_________
.. doxygenfile:: iTensor.h
:project: TensorRT-LLM
-common.h
-________
-
-.. doxygenfile:: common.h
- :project: TensorRT-LLM
-
-loraCachePageManagerConfig.h
-____________________________
-
-.. doxygenfile:: loraCachePageManagerConfig.h
- :project: TensorRT-LLM
-
-worldConfig.h
-_____________
-
-.. doxygenfile:: worldConfig.h
- :project: TensorRT-LLM
-
-loraModule.h
-____________
-
-.. doxygenfile:: loraModule.h
- :project: TensorRT-LLM
-
-speculativeDecodingMode.h
-_________________________
-
-.. doxygenfile:: speculativeDecodingMode.h
- :project: TensorRT-LLM
-
cudaEvent.h
___________
.. doxygenfile:: cudaEvent.h
:project: TensorRT-LLM
-decodingInput.h
+virtualMemory.h
_______________
-.. doxygenfile:: decodingInput.h
+.. doxygenfile:: virtualMemory.h
:project: TensorRT-LLM
speculativeDecodingModule.h
@@ -154,40 +28,10 @@ ___________________________
.. doxygenfile:: speculativeDecodingModule.h
:project: TensorRT-LLM
-iGptDecoderBatched.h
-____________________
+common.h
+________
-.. doxygenfile:: iGptDecoderBatched.h
- :project: TensorRT-LLM
-
-eagleModule.h
-_____________
-
-.. doxygenfile:: eagleModule.h
- :project: TensorRT-LLM
-
-tllmLogger.h
-____________
-
-.. doxygenfile:: tllmLogger.h
- :project: TensorRT-LLM
-
-gptDecoderBatched.h
-___________________
-
-.. doxygenfile:: gptDecoderBatched.h
- :project: TensorRT-LLM
-
-cudaStream.h
-____________
-
-.. doxygenfile:: cudaStream.h
- :project: TensorRT-LLM
-
-ipcNvlsMemory.h
-_______________
-
-.. doxygenfile:: ipcNvlsMemory.h
+.. doxygenfile:: common.h
:project: TensorRT-LLM
samplingConfig.h
@@ -196,16 +40,136 @@ ________________
.. doxygenfile:: samplingConfig.h
:project: TensorRT-LLM
+tllmLogger.h
+____________
+
+.. doxygenfile:: tllmLogger.h
+ :project: TensorRT-LLM
+
+lookaheadModule.h
+_________________
+
+.. doxygenfile:: lookaheadModule.h
+ :project: TensorRT-LLM
+
+modelConfig.h
+_____________
+
+.. doxygenfile:: modelConfig.h
+ :project: TensorRT-LLM
+
+iGptDecoderBatched.h
+____________________
+
+.. doxygenfile:: iGptDecoderBatched.h
+ :project: TensorRT-LLM
+
+cudaStream.h
+____________
+
+.. doxygenfile:: cudaStream.h
+ :project: TensorRT-LLM
+
+loraCache.h
+___________
+
+.. doxygenfile:: loraCache.h
+ :project: TensorRT-LLM
+
+medusaModule.h
+______________
+
+.. doxygenfile:: medusaModule.h
+ :project: TensorRT-LLM
+
decoderState.h
______________
.. doxygenfile:: decoderState.h
:project: TensorRT-LLM
-ipcUtils.h
-__________
+lookaheadBuffers.h
+__________________
-.. doxygenfile:: ipcUtils.h
+.. doxygenfile:: lookaheadBuffers.h
+ :project: TensorRT-LLM
+
+eagleModule.h
+_____________
+
+.. doxygenfile:: eagleModule.h
+ :project: TensorRT-LLM
+
+runtimeDefaults.h
+_________________
+
+.. doxygenfile:: runtimeDefaults.h
+ :project: TensorRT-LLM
+
+decodingOutput.h
+________________
+
+.. doxygenfile:: decodingOutput.h
+ :project: TensorRT-LLM
+
+decodingInput.h
+_______________
+
+.. doxygenfile:: decodingInput.h
+ :project: TensorRT-LLM
+
+worldConfig.h
+_____________
+
+.. doxygenfile:: worldConfig.h
+ :project: TensorRT-LLM
+
+gptDecoderBatched.h
+___________________
+
+.. doxygenfile:: gptDecoderBatched.h
+ :project: TensorRT-LLM
+
+explicitDraftTokensBuffers.h
+____________________________
+
+.. doxygenfile:: explicitDraftTokensBuffers.h
+ :project: TensorRT-LLM
+
+bufferManager.h
+_______________
+
+.. doxygenfile:: bufferManager.h
+ :project: TensorRT-LLM
+
+loraModule.h
+____________
+
+.. doxygenfile:: loraModule.h
+ :project: TensorRT-LLM
+
+eagleBuffers.h
+______________
+
+.. doxygenfile:: eagleBuffers.h
+ :project: TensorRT-LLM
+
+speculativeDecodingMode.h
+_________________________
+
+.. doxygenfile:: speculativeDecodingMode.h
+ :project: TensorRT-LLM
+
+promptTuningParams.h
+____________________
+
+.. doxygenfile:: promptTuningParams.h
+ :project: TensorRT-LLM
+
+gptDecoder.h
+____________
+
+.. doxygenfile:: gptDecoder.h
:project: TensorRT-LLM
memoryCounters.h
@@ -214,3 +178,39 @@ ________________
.. doxygenfile:: memoryCounters.h
:project: TensorRT-LLM
+ipcNvlsMemory.h
+_______________
+
+.. doxygenfile:: ipcNvlsMemory.h
+ :project: TensorRT-LLM
+
+rawEngine.h
+___________
+
+.. doxygenfile:: rawEngine.h
+ :project: TensorRT-LLM
+
+ipcUtils.h
+__________
+
+.. doxygenfile:: ipcUtils.h
+ :project: TensorRT-LLM
+
+iBuffer.h
+_________
+
+.. doxygenfile:: iBuffer.h
+ :project: TensorRT-LLM
+
+gptJsonConfig.h
+_______________
+
+.. doxygenfile:: gptJsonConfig.h
+ :project: TensorRT-LLM
+
+loraCachePageManagerConfig.h
+____________________________
+
+.. doxygenfile:: loraCachePageManagerConfig.h
+ :project: TensorRT-LLM
+
diff --git a/latest/_sources/blogs/H100vsA100.md.txt b/latest/_sources/blogs/H100vsA100.md.txt
index 06edd81620..9359863b54 100644
--- a/latest/_sources/blogs/H100vsA100.md.txt
+++ b/latest/_sources/blogs/H100vsA100.md.txt
@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x
FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1
-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
+The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
Stay tuned for a highlight on Llama coming soon!
diff --git a/latest/_sources/blogs/H200launch.md.txt b/latest/_sources/blogs/H200launch.md.txt
index 6fd0737c33..3946399036 100644
--- a/latest/_sources/blogs/H200launch.md.txt
+++ b/latest/_sources/blogs/H200launch.md.txt
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
*(1) Largest batch supported on given TP configuration by power of 2.**(2) TP = Tensor Parallelism*
-Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
+Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
### H200 vs H100
diff --git a/latest/_sources/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md.txt b/latest/_sources/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md.txt
index f0d7647d00..fef8dcc93a 100644
--- a/latest/_sources/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md.txt
+++ b/latest/_sources/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md.txt
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
### Triton Inference Server
diff --git a/latest/_sources/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md.txt b/latest/_sources/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md.txt
index a8f3313bee..81c296fe9d 100644
--- a/latest/_sources/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md.txt
+++ b/latest/_sources/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md.txt
@@ -25,7 +25,7 @@ TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalo
You can launch the container using the following command:
```bash
-docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4
+docker run --rm -it --ipc host -p 8000:8000 --gpus all --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5
```
@@ -161,34 +161,36 @@ P99 E2EL (ms): 1643.44
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
-```math
-\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
-```
+$$
+\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
+$$
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
-```math
+$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
-```
+$$
-```math
-\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
-```
+$$
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
+$$
#### End-to-End (E2E) Latency
* The typical total time from when a request is submitted until the final token of the response is received.
#### Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
-```math
-\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
#### Tokens Per Second (TPS) or Output Token Throughput
* how many output tokens the system generates each second.
-```math
-\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
### Request Time Breakdown
diff --git a/latest/_sources/commands/trtllm-serve/trtllm-serve.rst.txt b/latest/_sources/commands/trtllm-serve/trtllm-serve.rst.txt
index 8b7d25e735..25ed2bc394 100644
--- a/latest/_sources/commands/trtllm-serve/trtllm-serve.rst.txt
+++ b/latest/_sources/commands/trtllm-serve/trtllm-serve.rst.txt
@@ -41,13 +41,13 @@ Chat API
You can query Chat API with any http clients, a typical example is OpenAI Python client:
-.. literalinclude:: ../../../examples/serve/openai_chat_client.py
+.. literalinclude:: ../../../../examples/serve/openai_chat_client.py
:language: python
:linenos:
Another example uses ``curl``:
-.. literalinclude:: ../../../examples/serve/curl_chat_client.sh
+.. literalinclude:: ../../../../examples/serve/curl_chat_client.sh
:language: bash
:linenos:
@@ -56,13 +56,13 @@ Completions API
You can query Completions API with any http clients, a typical example is OpenAI Python client:
-.. literalinclude:: ../../../examples/serve/openai_completion_client.py
+.. literalinclude:: ../../../../examples/serve/openai_completion_client.py
:language: python
:linenos:
Another example uses ``curl``:
-.. literalinclude:: ../../../examples/serve/curl_completion_client.sh
+.. literalinclude:: ../../../../examples/serve/curl_completion_client.sh
:language: bash
:linenos:
@@ -97,13 +97,13 @@ Multimodal Chat API
You can query Completions API with any http clients, a typical example is OpenAI Python client:
-.. literalinclude:: ../../../examples/serve/openai_completion_client_for_multimodal.py
+.. literalinclude:: ../../../../examples/serve/openai_completion_client_for_multimodal.py
:language: python
:linenos:
Another example uses ``curl``:
-.. literalinclude:: ../../../examples/serve/curl_chat_client_for_multimodal.sh
+.. literalinclude:: ../../../../examples/serve/curl_chat_client_for_multimodal.sh
:language: bash
:linenos:
@@ -254,7 +254,23 @@ Example output:
}
]
+Configuring with YAML Files
+----------------------------
+You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--extra_llm_api_options`` option to the path of a YAML file, the arguments in the file will override the corresponding command line arguments.
+
+The yaml file is configuration of `tensorrt_llm.llmapi.LlmArgs `_, the class has multiple levels of hierarchy, to configure the top level arguments like ``max_batch_size``, the yaml file should be like:
+
+.. code-block:: yaml
+
+ max_batch_size: 8
+
+To configure the nested level arguments like ``moe_config.backend``, the yaml file should be like:
+
+.. code-block:: yaml
+
+ moe_config:
+ backend: CUTLASS
Syntax
------
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md.txt
index 8782681ef5..55deeb94fe 100644
--- a/latest/_sources/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md.txt
+++ b/latest/_sources/deployment-guide/deployment-guide-for-deepseek-r1-on-trtllm.md.txt
@@ -47,7 +47,7 @@ docker run --rm -it \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4 \
+nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
/bin/bash
```
@@ -250,7 +250,7 @@ Here is an example response, showing that the TensorRT LLM server returns “New
### Troubleshooting Tips
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
- * For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:8192`. For more details, please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
+ * For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable `PYTORCH_ALLOC_CONF=max_split_size_mb:8192`. For more details, please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
@@ -399,31 +399,33 @@ P99 E2EL (ms): [result]
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
-```math
-\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
-```
+$$
+\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
+$$
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
-```math
+$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
-```
+$$
-```math
-\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
-```
+$$
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
+$$
#### End-to-End (E2E) Latency
* The typical total time from when a request is submitted until the final token of the response is received.
#### Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
-```math
-\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
#### Tokens Per Second (TPS) or Output Token Throughput
* how many output tokens the system generates each second.
-```math
-\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md.txt
index 2a8a80b41b..ae34c5b3ce 100644
--- a/latest/_sources/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md.txt
+++ b/latest/_sources/deployment-guide/deployment-guide-for-gpt-oss-on-trtllm.md.txt
@@ -43,7 +43,7 @@ docker run --rm -it \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4 \
+nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
/bin/bash
```
@@ -212,8 +212,6 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
Here is an example response, showing that the TensorRT LLM server reasons and answers the questions.
-TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
-
```json
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
```
@@ -349,31 +347,33 @@ P99 E2EL (ms): [result]
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
-```math
-\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
-```
+$$
+\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
+$$
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
-```math
+$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
-```
+$$
-```math
-\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
-```
+$$
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
+$$
#### End-to-End (E2E) Latency
* The typical total time from when a request is submitted until the final token of the response is received.
#### Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
-```math
-\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
#### Tokens Per Second (TPS) or Output Token Throughput
* how many output tokens the system generates each second.
-```math
-\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md.txt
new file mode 100644
index 0000000000..d8ec17daff
--- /dev/null
+++ b/latest/_sources/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md.txt
@@ -0,0 +1,308 @@
+# Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell
+
+## Introduction
+
+This is a quickstart guide for running the Kimi K2 Thinking model on TensorRT LLM. It focuses on a working setup with recommended defaults.
+
+## Prerequisites
+
+* GPU: NVIDIA Blackwell Architecture
+* OS: Linux
+* Drivers: CUDA Driver 575 or Later
+* Docker with NVIDIA Container Toolkit installed
+* Python3 and python3-pip (Optional, for accuracy evaluation only)
+
+## Models
+
+* NVFP4 model: [Kimi-K2-Thinking-NVFP4](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4)
+
+
+## Deploy Kimi K2 Thinking on DGX B200 through Docker
+
+### Prepare Docker image
+
+Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
+```bash
+cd TensorRT-LLM
+
+make -C docker release_build IMAGE_TAG=kimi-k2-thinking-local
+
+make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=kimi-k2-thinking-local LOCAL_USER=1
+```
+
+### Launch the TensorRT LLM Server
+
+Prepare an `EXTRA_OPTIONS_YAML_FILE` that specifies LLM API arguments when deploying the model. An example YAML file is as follows:
+
+```yaml
+max_batch_size: 128
+max_num_tokens: 8448
+max_seq_len: 8212
+tensor_parallel_size: 8
+moe_expert_parallel_size: 8
+enable_attention_dp: true
+pipeline_parallel_size: 1
+print_iter_log: true
+kv_cache_config:
+ free_gpu_memory_fraction: 0.75
+ dtype: fp8
+cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 8448
+trust_remote_code: true
+```
+
+This YAML file specifies configurations that deploy the model with 8-way expert parallelism for the MoE part and 8-way attention data parallelism. It also enables `trust_remote_code`, so that it works with the Kimi K2 Thinking customized [tokenizer](https://huggingface.co/nvidia/Kimi-K2-Thinking-NVFP4/blob/main/tokenization_kimi.py).
+
+
+With the `EXTRA_OPTIONS_YAML_FILE`, use the following example command to launch the TensorRT LLM server with the Kimi-K2-Thinking-NVFP4 model from within the container.
+
+```bash
+trtllm-serve nvidia/Kimi-K2-Thinking-NVFP4 \
+ --host 0.0.0.0 --port 8000 \
+ --extra_llm_api_options ${EXTRA_OPTIONS_YAML_FILE}
+```
+
+TensorRT LLM will load weights and select the best kernels during startup. The server is successfully launched when the following log is shown:
+
+```log
+INFO: Started server process [xxxxx]
+INFO: Waiting for application startup.
+INFO: Application startup complete.
+INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
+```
+
+You can query the health/readiness of the server using:
+
+```shell
+curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
+```
+
+When the `Status: 200` code is returned, the server is ready for queries.
+
+## Deploy Kimi K2 Thinking on GB200 NVL72 through SLURM with wide EP and disaggregated serving
+
+TensorRT LLM provides a set of SLURM scripts that can be easily configured through YAML files and automatically launch SLURM jobs on GB200 NVL72 clusters for deployment, benchmarking, and accuracy testing purposes. The scripts are located at `examples/disaggregated/slurm/benchmark`. Refer to [this page](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) for more details and example wide EP config files.
+
+For Kimi K2 Thinking, an example configuration for SLURM arguments and the scripts is as follows:
+
+```yaml
+# SLURM Configuration
+slurm:
+ script_file: "disaggr_torch.slurm"
+ partition: ""
+ account: ""
+ job_time: "02:00:00"
+ job_name: ""
+ extra_args: "" # Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
+ numa_bind: true # Only enable for GB200 NVL72
+
+# Benchmark Mode
+benchmark:
+ mode: "e2e" # Options: e2e, gen_only
+ use_nv_sa_benchmark: false # Whether to use NVIDIA SA benchmark script
+ multi_round: 8 # Number of benchmark rounds
+ benchmark_ratio: 0.8 # Benchmark ratio
+ streaming: true # Enable streaming mode
+ concurrency_list: "16"
+ input_length: 1024 # Input sequence length
+ output_length: 1024 # Output sequence length
+ dataset_file: ""
+
+# Hardware Configuration
+hardware:
+ gpus_per_node: 4 # Modify this with your hardware configuration
+ num_ctx_servers: 4 # Number of context servers
+ num_gen_servers: 1 # Number of generation servers
+
+# Environment Configuration
+environment:
+ container_mount: "" # Format: path1:path1,path2:path2
+ container_image: ""
+ model_path: ""
+ trtllm_repo: ""
+ build_wheel: false # Don't build the wheel when launching multiple jobs
+ trtllm_wheel_path: "" # Path to pre-built TensorRT-LLM wheel. If provided, install from this wheel instead
+ work_dir: ""
+ worker_env_var: "TLLM_LOG_LEVEL=INFO TRTLLM_SERVER_DISABLE_GC=1 TRTLLM_WORKER_DISABLE_GC=1 TRTLLM_ENABLE_PDL=1 ENROOT_ALLOW_DEV=yes"
+ server_env_var: "TRTLLM_SERVER_DISABLE_GC=1"
+
+# Worker Configuration
+worker_config:
+ gen:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ pipeline_parallel_size: 1
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 9236
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 32
+ - 64
+ - 128
+ - 256
+ - 512
+ - 768
+ - 1024
+ - 2048
+ print_iter_log: true
+ kv_cache_config:
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ dtype: fp8
+ moe_config:
+ backend: WIDEEP
+ use_low_precision_moe_combine: true
+ load_balancer:
+ num_slots: 416
+ layer_updates_per_iter: 1
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 8448
+ stream_interval: 20
+ num_postprocess_workers: 4
+ trust_remote_code: true
+ ctx:
+ max_batch_size: 1
+ max_num_tokens: 8448
+ max_seq_len: 8212
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ enable_attention_dp: true
+ pipeline_parallel_size: 1
+ print_iter_log: true
+ cuda_graph_config: null
+ disable_overlap_scheduler: true
+ kv_cache_config:
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ dtype: fp8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 8448
+ trust_remote_code: true
+```
+
+It includes SLURM-specific configurations, benchmark and hardware details, and environment settings. The `worker_config` field includes detailed settings for context and generation servers when deploying a disaggregated server, with each specified as a list of LLM API arguments.
+
+To launch SLURM jobs with the YAML config file, execute the following command:
+```shell
+cd /examples/disaggregated/slurm/benchmark
+python3 submit.py -c config.yaml
+```
+
+## Query the OpenAI-compatible API Endpoint
+
+After the TensorRT LLM server is set up and shows `Application startup complete`, you can send requests to the server.
+
+```shell
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+ "model": "nvidia/Kimi-K2-Thinking-NVFP4",
+ "messages": [
+ {
+ "role": "user",
+ "content": "Where is New York?"
+ }
+ ],
+ "max_tokens": 128,
+ "top_p": 1.0
+}' -w "\n"
+```
+
+Example response:
+
+```json
+{
+ "id": "chatcmpl-5907ed752eb44d11a12893b19f79f8ca",
+ "object": "chat.completion",
+ "created": 1764866686,
+ "model": "nvidia/Kimi-K2-Thinking-NVFP4",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": " The user is asking a very simple question: \"Where is New York?\" This could be interpreted in a few ways:\n\n1. Where is New York State located?\n2. Where is New York City located?\n3. Where is New York located in relation to something else?\n\nGiven the ambiguity, I should provide a comprehensive answer that covers the main interpretations. I should be clear and direct.\n\nLet me structure my answer:\n- First, clarify that \"New York\" can refer to either New York State or New York City\n- For New York State: It's located in the northeastern United States, bordered by New Jersey, Pennsylvania, Connecticut",
+ "reasoning_content": "",
+ "reasoning": null,
+ "tool_calls": []
+ },
+ "logprobs": null,
+ "finish_reason": "length",
+ "stop_reason": null,
+ "mm_embedding_handle": null,
+ "disaggregated_params": null,
+ "avg_decoded_tokens_per_iter": 1.0
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 12,
+ "total_tokens": 140,
+ "completion_tokens": 128,
+ "prompt_tokens_details": {
+ "cached_tokens": 0
+ }
+ },
+ "prompt_token_ids": null
+}
+```
+
+## Benchmark
+
+To benchmark the performance of your TensorRT LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.
+
+```shell
+cat <<'EOF' > bench.sh
+#!/usr/bin/env bash
+set -euo pipefail
+
+concurrency_list="1 2 4 8 16 32 64 128 256"
+multi_round=5
+isl=1024
+osl=1024
+result_dir=/tmp/kimi_k2_thinking_output
+
+for concurrency in ${concurrency_list}; do
+ num_prompts=$((concurrency * multi_round))
+ python -m tensorrt_llm.serve.scripts.benchmark_serving \
+ --model nvidia/Kimi-K2-Thinking-NVFP4 \
+ --backend openai \
+ --dataset-name "random" \
+ --random-input-len ${isl} \
+ --random-output-len ${osl} \
+ --random-prefix-len 0 \
+ --random-ids \
+ --num-prompts ${num_prompts} \
+ --max-concurrency ${concurrency} \
+ --ignore-eos \
+ --tokenize-on-client \
+ --percentile-metrics "ttft,tpot,itl,e2el"
+done
+EOF
+chmod +x bench.sh
+```
+
+If you want to save the results to a file, add the following options:
+
+```shell
+--save-result \
+--result-dir "${result_dir}" \
+--result-filename "concurrency_${concurrency}.json"
+```
+
+For more benchmarking options, see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py).
+
+Run `bench.sh` to begin a serving benchmark.
+
+```shell
+./bench.sh
+```
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md.txt
index 07fd29fb3e..d227b2f440 100644
--- a/latest/_sources/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md.txt
+++ b/latest/_sources/deployment-guide/deployment-guide-for-llama3.3-70b-on-trtllm.md.txt
@@ -39,7 +39,7 @@ docker run --rm -it \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4 \
+nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
/bin/bash
```
@@ -354,31 +354,33 @@ P99 E2EL (ms): [result]
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
-```math
-\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
-```
+$$
+\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
+$$
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
-```math
+$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
-```
+$$
-```math
-\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
-```
+$$
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
+$$
#### End-to-End (E2E) Latency
* The typical total time from when a request is submitted until the final token of the response is received.
#### Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
-```math
-\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
#### Tokens Per Second (TPS) or Output Token Throughput
* how many output tokens the system generates each second.
-```math
-\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md.txt
index 090f9d9b13..509a5cf00f 100644
--- a/latest/_sources/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md.txt
+++ b/latest/_sources/deployment-guide/deployment-guide-for-llama4-scout-on-trtllm.md.txt
@@ -38,7 +38,7 @@ docker run --rm -it \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
-nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4 \
+nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
/bin/bash
```
@@ -346,31 +346,33 @@ P99 E2EL (ms): [result]
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
-```math
-\text{TPOT (1\ request)} = \text{Avg(ITL)} = \frac{\text{E2E\ latency} - \text{TTFT}}{\text{\#Output\ Tokens} - 1}
-```
+$$
+\text{TPOT (1 request)} = \text{Avg(ITL)} = \frac{\text{E2E latency} - \text{TTFT}}{\text{Num Output Tokens} - 1}
+$$
Across different requests, **average TPOT** is the mean of each request's TPOT (all requests weighted equally), while **average ITL** is token-weighted (all tokens weighted equally):
-```math
+$$
\text{Avg TPOT (N requests)} = \frac{\text{TPOT}_1 + \text{TPOT}_2 + \cdots + \text{TPOT}_N}{N}
-```
+$$
-```math
-\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{\#Output Tokens across requests}}
-```
+$$
+\text{Avg ITL (N requests)} = \frac{\text{Sum of all ITLs across requests}}{\text{Num Output Tokens across requests}}
+$$
#### End-to-End (E2E) Latency
* The typical total time from when a request is submitted until the final token of the response is received.
#### Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
-```math
-\text{Total\ TPS} = \frac{\text{\#Input\ Tokens}+\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{Total TPS} = \frac{\text{Num Input Tokens}+\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
#### Tokens Per Second (TPS) or Output Token Throughput
* how many output tokens the system generates each second.
-```math
-\text{TPS} = \frac{\text{\#Output\ Tokens}}{T_{last} - T_{first}}
-```
+
+$$
+\text{TPS} = \frac{\text{Num Output Tokens}}{T_{last} - T_{first}}
+$$
diff --git a/latest/_sources/deployment-guide/deployment-guide-for-qwen3-on-trtllm.md.txt b/latest/_sources/deployment-guide/deployment-guide-for-qwen3-on-trtllm.md.txt
new file mode 100644
index 0000000000..190740ebd8
--- /dev/null
+++ b/latest/_sources/deployment-guide/deployment-guide-for-qwen3-on-trtllm.md.txt
@@ -0,0 +1,256 @@
+# Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware
+
+## Introduction
+
+This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.
+
+## Prerequisites
+
+* GPU: NVIDIA Blackwell or Hopper Architecture
+* OS: Linux
+* Drivers: CUDA Driver 575 or Later
+* Docker with NVIDIA Container Toolkit installed
+* Python3 and python3-pip (Optional, for accuracy evaluation only)
+
+## Models
+
+* [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)
+* [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)
+* [Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8)
+* [Qwen3-30B-A3B-NVFP4](https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4)
+* [Qwen3-235B-A22B-NVFP4](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4)
+
+## Deployment Steps
+
+### Run Docker Container
+
+Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
+
+```shell
+cd TensorRT-LLM
+
+make -C docker release_build IMAGE_TAG=qwen3-local
+
+make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-local LOCAL_USER=1
+```
+
+### Recommended Performance Settings
+
+We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
+
+```shell
+TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
+```
+
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+
+````{admonition} Show code
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/qwen3.yaml
+---
+language: shell
+prepend: |
+ EXTRA_LLM_API_FILE=/tmp/config.yml
+
+ cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
+```
+````
+
+
+### Launch the TensorRT LLM Server
+
+Below is an example command to launch the TensorRT LLM server with the Qwen3 model from within the container.
+
+```shell
+trtllm-serve Qwen/Qwen3-30B-A3B --host 0.0.0.0 --port 8000 --extra_llm_api_options ${EXTRA_LLM_API_FILE}
+```
+
+After the server is set up, the client can now send prompt requests to the server and receive results.
+
+### LLM API Options (YAML Configuration)
+
+
+
+These options provide control over TensorRT LLM's behavior and are set within the YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.
+
+#### `tensor_parallel_size`
+
+* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
+
+#### `moe_expert_parallel_size`
+
+* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
+
+#### `kv_cache_free_gpu_memory_fraction`
+
+* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
+* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
+
+
+#### `max_batch_size`
+
+* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
+
+#### `max_num_tokens`
+
+* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
+
+#### `max_seq_len`
+
+* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens. We won't specifically set it. It will be inferred from model config.
+
+#### `trust_remote_code`
+* **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
+
+#### `cuda_graph_config`
+
+* **Description**: A section for configuring CUDA graphs to optimize performance.
+
+* **Options**:
+
+ * `enable_padding`: If `true`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.
+
+ **Default**: `false`
+
+ * `batch_sizes`: List of batch sizes for which CUDA graphs will be pre-captured.
+
+ **Recommendation**: Set this to cover the range of batch sizes you expect in production.
+
+#### `moe_config`
+
+* **Description**: Configuration for Mixture-of-Experts (MoE) models.
+
+* **Options**:
+
+ * `backend`: The backend to use for MoE operations.
+
+ **Default**: `CUTLASS`
+
+See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `extra_llm_api_options`.
+
+## Testing API Endpoint
+
+### Basic Test
+
+Start a new terminal on the host to test the TensorRT LLM server you just launched.
+
+You can query the health/readiness of the server using:
+
+```shell
+curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
+```
+
+When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
+
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
+
+```shell
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+ "model": "Qwen/Qwen3-30B-A3B",
+ "messages": [
+ {
+ "role": "user",
+ "content": "What is the capital of France?"
+ }
+ ],
+ "max_tokens": 512,
+ "temperature": 0.7,
+ "top_p": 0.95
+}' -w "\n"
+```
+
+Here is an example response:
+
+```json
+{
+ "id": "chatcmpl-abc123def456",
+ "object": "chat.completion",
+ "created": 1759022940,
+ "model": "Qwen/Qwen3-30B-A3B",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its rich history, culture, art, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 15,
+ "completion_tokens": 58,
+ "total_tokens": 73
+ }
+}
+```
+
+### Troubleshooting Tips
+
+* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_num_tokens`, or `kv_cache_free_gpu_memory_fraction`.
+* Ensure your model checkpoints are compatible with the expected format.
+* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
+* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
+* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
+* For MoE models (Qwen3-30B-A3B, Qwen3-235B-A22B), ensure `moe_expert_parallel_size` is properly configured.
+
+## Benchmarking Performance
+
+To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first create a wrapper `bench.sh` script.
+
+```shell
+cat <<'EOF' > bench.sh
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Adjust the model name based on which Qwen3 model you're benchmarking
+MODEL_NAME="Qwen/Qwen3-30B-A3B"
+
+concurrency_list="1 2 4 8 16 32 64 128"
+multi_round=5
+isl=1024
+osl=1024
+result_dir=/tmp/qwen3_output
+
+for concurrency in ${concurrency_list}; do
+ num_prompts=$((concurrency * multi_round))
+ python -m tensorrt_llm.serve.scripts.benchmark_serving \
+ --model ${MODEL_NAME} \
+ --backend openai \
+ --dataset-name "random" \
+ --random-input-len ${isl} \
+ --random-output-len ${osl} \
+ --random-prefix-len 0 \
+ --random-ids \
+ --num-prompts ${num_prompts} \
+ --max-concurrency ${concurrency} \
+ --ignore-eos \
+ --tokenize-on-client \
+ --percentile-metrics "ttft,tpot,itl,e2el"
+done
+EOF
+chmod +x bench.sh
+```
+
+To achieve max through-put, with attention DP on, one needs to sweep up to `concurrency = max_batch_size * num_gpus`.
+
+If you want to save the results to a file add the following options.
+
+```shell
+--save-result \
+--result-dir "${result_dir}" \
+--result-filename "concurrency_${concurrency}.json"
+```
+
+For more benchmarking options see [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
+
+Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
+
+```shell
+./bench.sh
+```
diff --git a/latest/_sources/deployment-guide/index.rst.txt b/latest/_sources/deployment-guide/index.rst.txt
index a5a085d6e2..ed7fd9c536 100644
--- a/latest/_sources/deployment-guide/index.rst.txt
+++ b/latest/_sources/deployment-guide/index.rst.txt
@@ -91,4 +91,6 @@ The deployment guides below provide more detailed instructions for serving speci
deployment-guide-for-llama3.3-70b-on-trtllm.md
deployment-guide-for-llama4-scout-on-trtllm.md
deployment-guide-for-gpt-oss-on-trtllm.md
+ deployment-guide-for-qwen3-on-trtllm.md
deployment-guide-for-qwen3-next-on-trtllm.md
+ deployment-guide-for-kimi-k2-thinking-on-trtllm.md
diff --git a/latest/_sources/examples/curl_chat_client.rst.txt b/latest/_sources/examples/curl_chat_client.rst.txt
index 69e2fbc308..d3709ccd9c 100644
--- a/latest/_sources/examples/curl_chat_client.rst.txt
+++ b/latest/_sources/examples/curl_chat_client.rst.txt
@@ -2,7 +2,7 @@ Curl Chat Client
================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/curl_chat_client.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/curl_chat_client.sh.
.. literalinclude:: ../../../examples/serve/curl_chat_client.sh
:lines: 1-11
diff --git a/latest/_sources/examples/curl_chat_client_for_multimodal.rst.txt b/latest/_sources/examples/curl_chat_client_for_multimodal.rst.txt
index 0d27f990b9..73760884c2 100644
--- a/latest/_sources/examples/curl_chat_client_for_multimodal.rst.txt
+++ b/latest/_sources/examples/curl_chat_client_for_multimodal.rst.txt
@@ -2,7 +2,7 @@ Curl Chat Client For Multimodal
===============================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/curl_chat_client_for_multimodal.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/curl_chat_client_for_multimodal.sh.
.. literalinclude:: ../../../examples/serve/curl_chat_client_for_multimodal.sh
:lines: 1-88
diff --git a/latest/_sources/examples/curl_completion_client.rst.txt b/latest/_sources/examples/curl_completion_client.rst.txt
index ab346513d1..c2f4e9a14e 100644
--- a/latest/_sources/examples/curl_completion_client.rst.txt
+++ b/latest/_sources/examples/curl_completion_client.rst.txt
@@ -2,7 +2,7 @@ Curl Completion Client
======================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/curl_completion_client.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/curl_completion_client.sh.
.. literalinclude:: ../../../examples/serve/curl_completion_client.sh
:lines: 1-10
diff --git a/latest/_sources/examples/deepseek_r1_reasoning_parser.rst.txt b/latest/_sources/examples/deepseek_r1_reasoning_parser.rst.txt
index 126dac768c..4e0a039fe1 100644
--- a/latest/_sources/examples/deepseek_r1_reasoning_parser.rst.txt
+++ b/latest/_sources/examples/deepseek_r1_reasoning_parser.rst.txt
@@ -2,9 +2,9 @@ Deepseek R1 Reasoning Parser
============================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/deepseek_r1_reasoning_parser.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/deepseek_r1_reasoning_parser.sh.
.. literalinclude:: ../../../examples/serve/deepseek_r1_reasoning_parser.sh
- :lines: 1-10
+ :lines: 1-23
:language: bash
:linenos:
diff --git a/latest/_sources/examples/genai_perf_client.rst.txt b/latest/_sources/examples/genai_perf_client.rst.txt
index 88a709f897..4f222352aa 100644
--- a/latest/_sources/examples/genai_perf_client.rst.txt
+++ b/latest/_sources/examples/genai_perf_client.rst.txt
@@ -2,7 +2,7 @@ Genai Perf Client
=================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/genai_perf_client.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/genai_perf_client.sh.
.. literalinclude:: ../../../examples/serve/genai_perf_client.sh
:lines: 1-16
diff --git a/latest/_sources/examples/genai_perf_client_for_multimodal.rst.txt b/latest/_sources/examples/genai_perf_client_for_multimodal.rst.txt
index adec2529d1..6ae821dace 100644
--- a/latest/_sources/examples/genai_perf_client_for_multimodal.rst.txt
+++ b/latest/_sources/examples/genai_perf_client_for_multimodal.rst.txt
@@ -2,7 +2,7 @@ Genai Perf Client For Multimodal
================================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/genai_perf_client_for_multimodal.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/genai_perf_client_for_multimodal.sh.
.. literalinclude:: ../../../examples/serve/genai_perf_client_for_multimodal.sh
:lines: 1-19
diff --git a/latest/_sources/examples/llm_guided_decoding.rst.txt b/latest/_sources/examples/llm_guided_decoding.rst.txt
index 5df1749dfa..c7a50512da 100644
--- a/latest/_sources/examples/llm_guided_decoding.rst.txt
+++ b/latest/_sources/examples/llm_guided_decoding.rst.txt
@@ -1,6 +1,6 @@
Generate text with guided decoding
==================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_guided_decoding.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_guided_decoding.py.
.. literalinclude:: ../../../examples/llm-api/llm_guided_decoding.py
:lines: 4-47
diff --git a/latest/_sources/examples/llm_inference.rst.txt b/latest/_sources/examples/llm_inference.rst.txt
index 06286e6cc1..be80e456eb 100644
--- a/latest/_sources/examples/llm_inference.rst.txt
+++ b/latest/_sources/examples/llm_inference.rst.txt
@@ -1,6 +1,6 @@
Generate text
=============
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_inference.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_inference.py.
.. literalinclude:: ../../../examples/llm-api/llm_inference.py
:lines: 4-35
diff --git a/latest/_sources/examples/llm_inference_async.rst.txt b/latest/_sources/examples/llm_inference_async.rst.txt
index e6568843d7..f7ff40a646 100644
--- a/latest/_sources/examples/llm_inference_async.rst.txt
+++ b/latest/_sources/examples/llm_inference_async.rst.txt
@@ -1,6 +1,6 @@
Generate text asynchronously
============================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_inference_async.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_inference_async.py.
.. literalinclude:: ../../../examples/llm-api/llm_inference_async.py
:lines: 4-43
diff --git a/latest/_sources/examples/llm_inference_async_streaming.rst.txt b/latest/_sources/examples/llm_inference_async_streaming.rst.txt
index e03865efe9..0736586f2f 100644
--- a/latest/_sources/examples/llm_inference_async_streaming.rst.txt
+++ b/latest/_sources/examples/llm_inference_async_streaming.rst.txt
@@ -1,6 +1,6 @@
Generate text in streaming
==========================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_inference_async_streaming.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_inference_async_streaming.py.
.. literalinclude:: ../../../examples/llm-api/llm_inference_async_streaming.py
:lines: 4-64
diff --git a/latest/_sources/examples/llm_inference_distributed.rst.txt b/latest/_sources/examples/llm_inference_distributed.rst.txt
index 3066b886a0..a04aa99313 100644
--- a/latest/_sources/examples/llm_inference_distributed.rst.txt
+++ b/latest/_sources/examples/llm_inference_distributed.rst.txt
@@ -1,6 +1,6 @@
Distributed LLM Generation
==========================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_inference_distributed.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_inference_distributed.py.
.. literalinclude:: ../../../examples/llm-api/llm_inference_distributed.py
:lines: 4-44
diff --git a/latest/_sources/examples/llm_kv_cache_connector.rst.txt b/latest/_sources/examples/llm_kv_cache_connector.rst.txt
index 7440314240..0a150c4a36 100644
--- a/latest/_sources/examples/llm_kv_cache_connector.rst.txt
+++ b/latest/_sources/examples/llm_kv_cache_connector.rst.txt
@@ -1,8 +1,8 @@
KV Cache Connector
==================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_kv_cache_connector.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_kv_cache_connector.py.
.. literalinclude:: ../../../examples/llm-api/llm_kv_cache_connector.py
- :lines: 4-247
+ :lines: 4-326
:language: python
:linenos:
diff --git a/latest/_sources/examples/llm_kv_cache_offloading.rst.txt b/latest/_sources/examples/llm_kv_cache_offloading.rst.txt
index bcac6c002c..a64445a962 100644
--- a/latest/_sources/examples/llm_kv_cache_offloading.rst.txt
+++ b/latest/_sources/examples/llm_kv_cache_offloading.rst.txt
@@ -1,6 +1,6 @@
KV Cache Offloading
===================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_kv_cache_offloading.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_kv_cache_offloading.py.
.. literalinclude:: ../../../examples/llm-api/llm_kv_cache_offloading.py
:lines: 4-134
diff --git a/latest/_sources/examples/llm_logits_processor.rst.txt b/latest/_sources/examples/llm_logits_processor.rst.txt
index 21211a4a23..b739b44ca9 100644
--- a/latest/_sources/examples/llm_logits_processor.rst.txt
+++ b/latest/_sources/examples/llm_logits_processor.rst.txt
@@ -1,6 +1,6 @@
Control generated text using logits processor
=============================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_logits_processor.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_logits_processor.py.
.. literalinclude:: ../../../examples/llm-api/llm_logits_processor.py
:lines: 4-128
diff --git a/latest/_sources/examples/llm_mgmn_llm_distributed.rst.txt b/latest/_sources/examples/llm_mgmn_llm_distributed.rst.txt
index 0122c5fdab..0a84a19a28 100644
--- a/latest/_sources/examples/llm_mgmn_llm_distributed.rst.txt
+++ b/latest/_sources/examples/llm_mgmn_llm_distributed.rst.txt
@@ -1,8 +1,8 @@
Run LLM-API with pytorch backend on Slurm
=========================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_mgmn_llm_distributed.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_mgmn_llm_distributed.sh.
.. literalinclude:: ../../../examples/llm-api/llm_mgmn_llm_distributed.sh
- :lines: 1-10,14-55
+ :lines: 1-48,52-94
:language: bash
:linenos:
diff --git a/latest/_sources/examples/llm_mgmn_trtllm_bench.rst.txt b/latest/_sources/examples/llm_mgmn_trtllm_bench.rst.txt
index 66c7eb17be..ddfa9f47ca 100644
--- a/latest/_sources/examples/llm_mgmn_trtllm_bench.rst.txt
+++ b/latest/_sources/examples/llm_mgmn_trtllm_bench.rst.txt
@@ -1,8 +1,8 @@
Run trtllm-bench with pytorch backend on Slurm
==============================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_mgmn_trtllm_bench.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_mgmn_trtllm_bench.sh.
.. literalinclude:: ../../../examples/llm-api/llm_mgmn_trtllm_bench.sh
- :lines: 1-10,14-95
+ :lines: 1-46,50-131
:language: bash
:linenos:
diff --git a/latest/_sources/examples/llm_mgmn_trtllm_serve.rst.txt b/latest/_sources/examples/llm_mgmn_trtllm_serve.rst.txt
index a0dbdc4e7a..18e6c10c8c 100644
--- a/latest/_sources/examples/llm_mgmn_trtllm_serve.rst.txt
+++ b/latest/_sources/examples/llm_mgmn_trtllm_serve.rst.txt
@@ -1,8 +1,8 @@
Run trtllm-serve with pytorch backend on Slurm
==============================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_mgmn_trtllm_serve.sh.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_mgmn_trtllm_serve.sh.
.. literalinclude:: ../../../examples/llm-api/llm_mgmn_trtllm_serve.sh
- :lines: 1-10,14-56
+ :lines: 1-46,50-92
:language: bash
:linenos:
diff --git a/latest/_sources/examples/llm_multilora.rst.txt b/latest/_sources/examples/llm_multilora.rst.txt
index 4a6b355a75..b0f9fdf5ec 100644
--- a/latest/_sources/examples/llm_multilora.rst.txt
+++ b/latest/_sources/examples/llm_multilora.rst.txt
@@ -1,6 +1,6 @@
Generate text with multiple LoRA adapters
=========================================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_multilora.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_multilora.py.
.. literalinclude:: ../../../examples/llm-api/llm_multilora.py
:lines: 4-89
diff --git a/latest/_sources/examples/llm_runtime.rst.txt b/latest/_sources/examples/llm_runtime.rst.txt
index 8780627a51..c7405bcbe5 100644
--- a/latest/_sources/examples/llm_runtime.rst.txt
+++ b/latest/_sources/examples/llm_runtime.rst.txt
@@ -1,8 +1,8 @@
Runtime Configuration Examples
==============================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_runtime.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_runtime.py.
.. literalinclude:: ../../../examples/llm-api/llm_runtime.py
- :lines: 4-96
+ :lines: 4-144
:language: python
:linenos:
diff --git a/latest/_sources/examples/llm_sampling.rst.txt b/latest/_sources/examples/llm_sampling.rst.txt
index e45fa3aa5b..bc4c60a7ce 100644
--- a/latest/_sources/examples/llm_sampling.rst.txt
+++ b/latest/_sources/examples/llm_sampling.rst.txt
@@ -1,8 +1,8 @@
Sampling Techniques Showcase
============================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_sampling.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_sampling.py.
.. literalinclude:: ../../../examples/llm-api/llm_sampling.py
- :lines: 4-229
+ :lines: 4-248
:language: python
:linenos:
diff --git a/latest/_sources/examples/llm_sparse_attention.rst.txt b/latest/_sources/examples/llm_sparse_attention.rst.txt
index 140b5bb971..1c398bb1f0 100644
--- a/latest/_sources/examples/llm_sparse_attention.rst.txt
+++ b/latest/_sources/examples/llm_sparse_attention.rst.txt
@@ -1,8 +1,8 @@
Sparse Attention
================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_sparse_attention.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_sparse_attention.py.
.. literalinclude:: ../../../examples/llm-api/llm_sparse_attention.py
- :lines: 4-209
+ :lines: 4-229
:language: python
:linenos:
diff --git a/latest/_sources/examples/llm_speculative_decoding.rst.txt b/latest/_sources/examples/llm_speculative_decoding.rst.txt
index b813ec1c2d..689d6af530 100644
--- a/latest/_sources/examples/llm_speculative_decoding.rst.txt
+++ b/latest/_sources/examples/llm_speculative_decoding.rst.txt
@@ -1,6 +1,6 @@
Speculative Decoding
====================
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/llm-api/llm_speculative_decoding.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/llm-api/llm_speculative_decoding.py.
.. literalinclude:: ../../../examples/llm-api/llm_speculative_decoding.py
:lines: 4-95
diff --git a/latest/_sources/examples/openai_chat_client.rst.txt b/latest/_sources/examples/openai_chat_client.rst.txt
index 0ca4755e82..29cf974ab0 100644
--- a/latest/_sources/examples/openai_chat_client.rst.txt
+++ b/latest/_sources/examples/openai_chat_client.rst.txt
@@ -2,7 +2,7 @@ OpenAI Chat Client
==================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/openai_chat_client.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/openai_chat_client.py.
.. literalinclude:: ../../../examples/serve/openai_chat_client.py
:lines: 2-21
diff --git a/latest/_sources/examples/openai_chat_client_for_multimodal.rst.txt b/latest/_sources/examples/openai_chat_client_for_multimodal.rst.txt
index af141494bc..b3fb0a07bc 100644
--- a/latest/_sources/examples/openai_chat_client_for_multimodal.rst.txt
+++ b/latest/_sources/examples/openai_chat_client_for_multimodal.rst.txt
@@ -2,7 +2,7 @@ OpenAI Chat Client for Multimodal
=================================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/openai_chat_client_for_multimodal.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/openai_chat_client_for_multimodal.py.
.. literalinclude:: ../../../examples/serve/openai_chat_client_for_multimodal.py
:lines: 2-129
diff --git a/latest/_sources/examples/openai_completion_client.rst.txt b/latest/_sources/examples/openai_completion_client.rst.txt
index 4a5d96ac94..7b60afc04d 100644
--- a/latest/_sources/examples/openai_completion_client.rst.txt
+++ b/latest/_sources/examples/openai_completion_client.rst.txt
@@ -2,7 +2,7 @@ OpenAI Completion Client
========================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/openai_completion_client.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/openai_completion_client.py.
.. literalinclude:: ../../../examples/serve/openai_completion_client.py
:lines: 2-15
diff --git a/latest/_sources/examples/openai_completion_client_for_lora.rst.txt b/latest/_sources/examples/openai_completion_client_for_lora.rst.txt
index 0439ec1f47..4eabf04fea 100644
--- a/latest/_sources/examples/openai_completion_client_for_lora.rst.txt
+++ b/latest/_sources/examples/openai_completion_client_for_lora.rst.txt
@@ -2,7 +2,7 @@ Openai Completion Client For Lora
=================================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/openai_completion_client_for_lora.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/openai_completion_client_for_lora.py.
.. literalinclude:: ../../../examples/serve/openai_completion_client_for_lora.py
:lines: 1-30
diff --git a/latest/_sources/examples/openai_completion_client_json_schema.rst.txt b/latest/_sources/examples/openai_completion_client_json_schema.rst.txt
index 7d17f88423..8ed397f1cd 100644
--- a/latest/_sources/examples/openai_completion_client_json_schema.rst.txt
+++ b/latest/_sources/examples/openai_completion_client_json_schema.rst.txt
@@ -2,7 +2,7 @@ OpenAI Completion Client with JSON Schema
=========================================
Refer to the `trtllm-serve documentation `_ for starting a server.
-Source https://github.com/NVIDIA/TensorRT-LLM/blob/a761585d9c15b4c1249aaf65a8f90764efa83a3c/examples/serve/openai_completion_client_json_schema.py.
+Source https://github.com/NVIDIA/TensorRT-LLM/blob/e4c707845ff58fcc0b1d87afb4dd0e64885c780a/examples/serve/openai_completion_client_json_schema.py.
.. literalinclude:: ../../../examples/serve/openai_completion_client_json_schema.py
:lines: 2-52
diff --git a/latest/_sources/features/auto_deploy/support_matrix.md.txt b/latest/_sources/features/auto_deploy/support_matrix.md.txt
index a41090932f..26c07b308b 100644
--- a/latest/_sources/features/auto_deploy/support_matrix.md.txt
+++ b/latest/_sources/features/auto_deploy/support_matrix.md.txt
@@ -83,6 +83,7 @@ In addition, the following models have been officially validated using the defau
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
- nvidia/Mistral-NeMo-Minitron-8B-Base
+- nvidia/Nemotron-Flash-3B-Instruct
- perplexity-ai/r1-1776-distill-llama-70b
diff --git a/latest/_sources/features/checkpoint-loading.md.txt b/latest/_sources/features/checkpoint-loading.md.txt
index 4a37ef7623..41699b4900 100644
--- a/latest/_sources/features/checkpoint-loading.md.txt
+++ b/latest/_sources/features/checkpoint-loading.md.txt
@@ -31,7 +31,7 @@ The `BaseCheckpointLoader` is the central base interface for all checkpoint load
**Key Methods:**
- `load_config(checkpoint_dir, **kwargs)`: Loads and returns a `ModelConfig` object
-- `load_weights(checkpoint_dir, **kwargs)`: Loads and returns a dictionary of weights
+- `load_weights(checkpoint_dir, mapping, **kwargs)`: Loads and returns a dictionary of weights
- `get_initialized_weight_mapper(model, config)`: Returns a runtime initialized weight mapper for the model
- `cleanup()`: Releases resources and cleans up internal state
@@ -63,7 +63,7 @@ Handles the loading of model weights from storage:
from tensorrt_llm._torch.models.checkpoints.base_weight_loader import BaseWeightLoader
class CustomWeightLoader(BaseWeightLoader):
- def load_weights(self, checkpoint_dir: str) -> dict[str, Any]:
+ def load_weights(self, checkpoint_dir: str, mapping: Mapping) -> dict[str, Any]:
# Load weights from your custom format
# Return a dictionary mapping parameter names to tensors
return weights_dict
@@ -186,11 +186,12 @@ from tensorrt_llm._torch.models.modeling_utils import register_checkpoint_weight
@register_checkpoint_weight_loader("CUSTOM_FORMAT")
class CustomWeightLoader(BaseWeightLoader):
- def load_weights(self, checkpoint_dir: str, **kwargs) -> dict[str, Any]:
+ def load_weights(self, checkpoint_dir: str, mapping: Mapping, **kwargs) -> dict[str, Any]:
"""
Load weights from your custom format.
Args:
checkpoint_dir: Directory containing checkpoint files
+ mapping: A mapping object containing the distributed configuration.
**kwargs: Additional loading parameters
Returns:
Dictionary mapping parameter names to tensors
diff --git a/latest/_sources/features/disagg-serving.md.txt b/latest/_sources/features/disagg-serving.md.txt
index cbeea3cc50..ce52b9a3d5 100644
--- a/latest/_sources/features/disagg-serving.md.txt
+++ b/latest/_sources/features/disagg-serving.md.txt
@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
### trtllm-serve
diff --git a/latest/_sources/features/guided-decoding.md.txt b/latest/_sources/features/guided-decoding.md.txt
new file mode 100644
index 0000000000..110efc8e51
--- /dev/null
+++ b/latest/_sources/features/guided-decoding.md.txt
@@ -0,0 +1,583 @@
+# Guided Decoding
+
+Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are amenable to a user-specified grammar (e.g., JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) grammar).
+
+TensorRT LLM supports two grammar backends:
+* [XGrammar](https://github.com/mlc-ai/xgrammar/blob/v0.1.21/python/xgrammar/matcher.py#L341-L350): Supports JSON schema, regular expression, EBNF and [structural tag](https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html).
+* [LLGuidance](https://github.com/guidance-ai/llguidance/blob/v1.1.1/python/llguidance/_lib.pyi#L363-L366): Supports JSON schema, regular expression, EBNF.
+
+
+## Online API: `trtllm-serve`
+
+If you are using `trtllm-serve`, enable guided decoding by specifying `guided_decoding_backend` with `xgrammar` or `llguidance` in the YAML configuration file, and pass it to `--extra_llm_api_options`. For example,
+
+```bash
+cat > extra_llm_api_options.yaml <{{parameters}}{{end_tag}}
+where
+start_tag => ` a JSON dict with the function argument name as key and function argument value as value.
+end_tag => ``
+Here is an example,
+{{"example_name": "example_value"}}
+Reminder:
+- Function calls MUST follow the specified format
+- Required parameters MUST be specified
+- Only call one function at a time
+- Put the entire function call reply on one line
+- Always add your sources when using search results to answer the user query
+You are a helpful assistant."""
+user_prompt = "You are in New York. Please get the current date and time, and the weather."
+
+messages = [
+ {
+ "role": "system",
+ "content": system_prompt,
+ },
+ {
+ "role": "user",
+ "content": user_prompt,
+ },
+]
+
+chat_completion = client.chat.completions.create(
+ model="nvidia/Llama-3.1-8B-Instruct-FP8",
+ messages=messages,
+ max_completion_tokens=256,
+ response_format={
+ "type": "structural_tag",
+ "format": {
+ "type": "triggered_tags",
+ "triggers": ["",
+ "content": {
+ "type": "json_schema",
+ "json_schema": tool_get_current_weather["function"]["parameters"]
+ },
+ "end": "",
+ },
+ {
+ "begin": "",
+ "content": {
+ "type": "json_schema",
+ "json_schema": tool_get_current_date["function"]["parameters"]
+ },
+ "end": "",
+ },
+ ],
+ },
+ },
+)
+
+message = chat_completion.choices[0].message
+print(message.content)
+```
+
+The output would look like:
+```txt
+{"timezone": "America/New_York"}
+{"city": "New York", "state": "NY", "unit": "fahrenheit"}
+```
+
+
+## Offline API: LLM API
+
+If you are using LLM API, enable guided decoding by specifying `guided_decoding_backend` with `xgrammar` or `llguidance` when creating the LLM instance. For example,
+
+```python
+from tensorrt_llm import LLM
+
+llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
+```
+
+### JSON Schema
+
+Create a `GuidedDecodingParams` with the `json` field specified with a JSON schema, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`. Alternatively, the JSON schema can be created using [pydantic](https://docs.pydantic.dev/latest/).
+
+```python
+from tensorrt_llm import LLM
+from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams
+
+if __name__ == "__main__":
+ llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
+
+ json_schema = {
+ "type": "object",
+ "properties": {
+ "name": {
+ "type": "string",
+ "pattern": "^[\\w]+$"
+ },
+ "population": {
+ "type": "integer"
+ },
+ },
+ "required": ["name", "population"],
+ }
+ messages = [
+ {
+ "role": "system",
+ "content": "You are a helpful assistant.",
+ },
+ {
+ "role": "user",
+ "content": "Give me the information of the capital of France in the JSON format.",
+ },
+ ]
+ prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+ output = llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(json=json_schema)),
+ )
+ print(output.outputs[0].text)
+```
+
+The output would look like:
+```txt
+{
+ "name": "Paris",
+ "population": 2145206
+}
+```
+
+
+### Regular expression
+
+Create a `GuidedDecodingParams` with the `regex` field specified with a regular expression, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.
+
+```python
+from tensorrt_llm import LLM
+from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams
+
+if __name__ == "__main__":
+ llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
+
+ messages = [
+ {
+ "role": "system",
+ "content": "You are a helpful assistant.",
+ },
+ {
+ "role": "user",
+ "content": "What is the capital of France?",
+ },
+ ]
+ prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+ output = llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(regex="(Paris|London)")),
+ )
+ print(output.outputs[0].text)
+```
+
+The output would look like:
+```txt
+Paris
+```
+
+### EBNF grammar
+
+Create a `GuidedDecodingParams` with the `grammar` field specified with an EBNF grammar, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.
+
+```python
+from tensorrt_llm import LLM
+from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams
+
+if __name__ == "__main__":
+ llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
+
+ ebnf_grammar = """root ::= description
+city ::= "London" | "Paris" | "Berlin" | "Rome"
+description ::= city " is " status
+status ::= "the capital of " country
+country ::= "England" | "France" | "Germany" | "Italy"
+"""
+ messages = [
+ {
+ "role": "system",
+ "content": "You are a helpful geography bot."
+ },
+ {
+ "role": "user",
+ "content": "Give me the information of the capital of France.",
+ },
+ ]
+ prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+ output = llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(grammar=ebnf_grammar)),
+ )
+ print(output.outputs[0].text)
+```
+
+The output would look like:
+```txt
+Paris is the capital of France
+```
+
+### Structural tag
+
+Create a `GuidedDecodingParams` with the `structural_tag` field specified with a structural tag string, use it to create `SamplingParams`, and then pass to `llm.generate` or `llm.generate_async`.
+
+Structural tag is supported by `xgrammar` backend only. It is a powerful and flexible tool to represent the LLM output constraints. Please see [structural tag usage](https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html) for a comprehensive tutorial. Below is an example of function calling with customized function call format for `Llama-3.1-8B-Instruct`.
+
+```python
+import json
+from tensorrt_llm import LLM
+from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams
+
+if __name__ == "__main__":
+ llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
+
+ tool_get_current_weather = {
+ "type": "function",
+ "function": {
+ "name": "get_current_weather",
+ "description": "Get the current weather in a given location",
+ "parameters": {
+ "type": "object",
+ "properties": {
+ "city": {
+ "type": "string",
+ "description": "The city to find the weather for, e.g. 'San Francisco'",
+ },
+ "state": {
+ "type": "string",
+ "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'",
+ },
+ "unit": {
+ "type": "string",
+ "description": "The unit to fetch the temperature in",
+ "enum": ["celsius", "fahrenheit"],
+ },
+ },
+ "required": ["city", "state", "unit"],
+ },
+ },
+ }
+
+ tool_get_current_date = {
+ "type": "function",
+ "function": {
+ "name": "get_current_date",
+ "description": "Get the current date and time for a given timezone",
+ "parameters": {
+ "type": "object",
+ "properties": {
+ "timezone": {
+ "type": "string",
+ "description": "The timezone to fetch the current date and time for, e.g. 'America/New_York'",
+ }
+ },
+ "required": ["timezone"],
+ },
+ },
+ }
+
+ system_prompt = f"""# Tool Instructions
+- Always execute python code in messages that you share.
+- When looking for real time information use relevant functions if available else fallback to brave_search
+You have access to the following functions:
+Use the function 'get_current_weather' to: Get the current weather in a given location
+{tool_get_current_weather["function"]}
+Use the function 'get_current_date' to: Get the current date and time for a given timezone
+{tool_get_current_date["function"]}
+If a you choose to call a function ONLY reply in the following format:
+<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
+where
+start_tag => ` a JSON dict with the function argument name as key and function argument value as value.
+end_tag => ``
+Here is an example,
+{{"example_name": "example_value"}}
+Reminder:
+- Function calls MUST follow the specified format
+- Required parameters MUST be specified
+- Only call one function at a time
+- Put the entire function call reply on one line
+- Always add your sources when using search results to answer the user query
+You are a helpful assistant."""
+ user_prompt = "You are in New York. Please get the current date and time, and the weather."
+ structural_tag = {
+ "type": "structural_tag",
+ "format": {
+ "type": "triggered_tags",
+ "triggers": ["",
+ "content": {
+ "type": "json_schema",
+ "json_schema": tool_get_current_weather["function"]["parameters"]
+ },
+ "end": "",
+ },
+ {
+ "begin": "",
+ "content": {
+ "type": "json_schema",
+ "json_schema": tool_get_current_date["function"]["parameters"]
+ },
+ "end": "",
+ },
+ ],
+ },
+ }
+
+ messages = [
+ {
+ "role": "system",
+ "content": system_prompt,
+ },
+ {
+ "role": "user",
+ "content": user_prompt,
+ },
+ ]
+ prompt = llm.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+
+ output = llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256, guided_decoding=GuidedDecodingParams(structural_tag=json.dumps(structural_tag))),
+ )
+ print(output.outputs[0].text)
+```
+
+The output would look like:
+```txt
+{"timezone": "America/New_York"}
+{"city": "New York", "state": "NY", "unit": "fahrenheit"}
+```
diff --git a/latest/_sources/features/helix.md.txt b/latest/_sources/features/helix.md.txt
new file mode 100644
index 0000000000..c09bfc2537
--- /dev/null
+++ b/latest/_sources/features/helix.md.txt
@@ -0,0 +1,82 @@
+# Helix Parallelism
+
+Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional attention-FFN disaggregation (AFD) techniques, which spatially separate attention and FFN blocks onto different GPUs, Helix temporally separates them by reconfiguring the same GPUs.
+
+For all details, see the original paper:
+[Helix Parallelism: Rethinking Sharding Strategies for
+Interactive Multi-Million-Token LLM Decoding](https://arxiv.org/pdf/2507.07120)
+
+## How Helix Works
+
+In Helix parallelism:
+
+- **KV cache distribution**: The KV cache is partitioned across CP ranks during generation, with each rank responsible for a portion of the cached context
+- **Attention computation**: Each rank computes partial attention over its local KV cache shard
+- **Attention postprocessing**: Partial results are combined / corrected across ranks to produce the final attention output
+- **FFN layers**: CP ranks are repurposed as tensor parallelism (TP) ranks for FFN/MoE layers, maximizing GPU utilization
+
+## When to Use Helix
+
+Helix parallelism provides performance benefits when **all** of the following conditions apply:
+
+1. **Disaggregated serving**: Helix is designed for generation servers in a disaggregated (prefill/decode split) deployment architecture
+2. **Long input sequences**: Performance gains typically appear with input sequence lengths **>64K tokens** or more
+3. **Low batch sizes**: Optimal for latency-sensitive workloads with high tokens/second/user requirements
+
+On a typical latency vs. throughput Pareto curve, Helix targets operating points toward the right side (low latency, high per-user throughput).
+
+## Supported Models
+
+Helix parallelism currently supports models using **Multi-head Latent Attention (MLA)** on Blackwell GPU architecture:
+
+- DeepSeek-V3 / DeepSeek-V3-Lite
+
+## Configuration
+
+### Configuration Parameters
+
+Please set the following parameters for the generation servers in disaggregated mode. Example can be seen in the e2e accuracy test mentioned below.
+
+| Parameter | Description | Required |
+|-----------|-------------|----------|
+| `context_parallel_size` | Number of GPUs for context parallelism (≥2 for Helix) | Yes |
+| `cp_config.cp_type` | Must be `"HELIX"` or `CpType.HELIX` | Yes |
+| `cp_config.tokens_per_block` | Tokens per KV cache block | Yes |
+| `kv_cache_config.tokens_per_block` | Must match `cp_config.tokens_per_block` | Yes |
+
+### JSON Configuration (for YAML/JSON configs)
+
+```json
+{
+ "context_parallel_size": 2,
+ "cp_config": {
+ "cp_type": "HELIX",
+ "tokens_per_block": 32
+ },
+ "kv_cache_config": {
+ "tokens_per_block": 32
+ }
+}
+```
+
+## Testing Helix with TensorRT-LLM
+
+### Unit Test: MLA Module Correctness
+
+The simplest correctness test validates the [MLA attention module](../../../tensorrt_llm/_torch/modules/attention.py) with Helix enabled:
+
+```bash
+# Run the MLA Helix unit test
+pytest tests/unittest/_torch/modules/test_mla_helix.py -v
+```
+
+This test verifies that attention outputs match between single-GPU and Helix-parallelized execution.
+
+### End-to-End Accuracy test
+
+For end-to-end validation, the accuracy benchmark evaluates DeepSeek-V3-Lite in disaggregated mode on MMLU and GSM8K benchmarks:
+
+Test location: `tests/integration/defs/accuracy/test_disaggregated_serving.py`
+Test name: `TestDeepSeekV3Lite::test_auto_dtype_with_helix`
+
+This test demonstrates proper disaggregated server configuration with Helix.
diff --git a/latest/_sources/features/kv-cache-connector.md.txt b/latest/_sources/features/kv-cache-connector.md.txt
new file mode 100644
index 0000000000..743c9282d6
--- /dev/null
+++ b/latest/_sources/features/kv-cache-connector.md.txt
@@ -0,0 +1,113 @@
+# KV Cache Connector
+
+The KV Cache Connector is a flexible interface in TensorRT-LLM that enables remote or external access to the Key-Value (KV) cache. It allows developers to implement custom logic for loading, saving, and managing KV cache blocks, extending the capabilities of the standard KV cache manager.
+
+This document explains the KV Cache Connector architecture, common use cases, and provides a detailed walkthrough of the included example.
+
+## Use Cases
+
+The KV Cache Connector is designed to support a variety of advanced serving scenarios:
+
+1. **KV Cache Offloading**: Move KV cache blocks from GPU memory to cheaper/larger storage (CPU RAM, NVMe SSD, or network storage) when they are not immediately needed, and reload them when required.
+2. **Custom Disaggregated Serving**: Separate the prefill (context processing) and decode (token generation) phases onto different instances or machines. The connector can be used to transmit the KV cache generated during prefill to the decode instances.
+3. **KV Cache Sharing / P2P Transfer**: Share KV cache states between different model instances or across peer-to-peer connections.
+
+## Architecture
+
+The connector architecture is split into two main components:
+
+* **Scheduler (Leader)**: Responsible for orchestration. It decides *what* needs to be loaded or saved and builds metadata instructions. It runs only on the leader rank (rank 0).
+* **Worker**: Responsible for execution. It receives metadata from the scheduler and performs the actual data transfers (loading/saving) on the KV cache tensors. It runs on all ranks.
+
+### API Reference
+
+To implement a custom connector, you must subclass `KvCacheConnectorScheduler` and `KvCacheConnectorWorker`.
+
+#### 1. Scheduler (Leader) Interface (`KvCacheConnectorScheduler`)
+
+These methods run on the leader process and drive the connector's behavior.
+
+* **`build_connector_meta(self, scheduler_output: SchedulerOutput) -> object`**
+ * **Description**: The core orchestration method. Called during the scheduling phase. It examines the current requests and decides which blocks need to be loaded from or saved to the external store.
+ * **Arguments**: `scheduler_output` contains information about new requests, blocks allocated, and current request states.
+ * **Returns**: An arbitrary metadata object (picklable) that describes the tasks for the workers. This object is broadcasted to all workers.
+
+* **`get_num_new_matched_tokens(self, request: LlmRequest, num_computed_tokens: int) -> tuple[int, bool]`**
+ * **Description**: Called when a new request arrives. It checks to see if any KV cache can be loaded from an external KV store.
+ * **Returns**: A tuple `(num_tokens, is_async)`. `num_tokens` is the number of tokens found in the external cache. `is_async` indicates if the loading will happen asynchronously (background) or requires blocking.
+
+* **`request_finished(self, request: LlmRequest, cache_block_ids: list[int]) -> bool`**
+ * **Description**: Called when a request completes generation.
+ * **Returns**: A boolean indicating if an asynchronous save operation is underway. If `True`, the system waits for the operation to complete before releasing the KV cache blocks.
+
+* **`update_state_after_alloc(self, request: LlmRequest, block_ids: list[int])`**
+ * **Description**: a callback to update internal state after KV cache blocks have been allocated for the prefill.
+
+#### 2. Worker Interface (`KvCacheConnectorWorker`)
+
+These methods run on all workers (GPU processes) and interact with the actual GPU data.
+
+* **`register_kv_caches(self, kv_cache_tensor: torch.Tensor)`**
+ * **Description**: Called at initialization. Provides the worker with the GPU KV cache tensors.
+ * **Arguments**: `kv_cache_tensor` is the underlying storage tensor for the KV cache.
+
+* **`start_load_kv(self, stream: torch.cuda.Stream)`**
+ * **Description**: Initiates the loading of KV blocks from the external source into the GPU memory.
+ * **Arguments**: `stream` is the CUDA stream where the forward pass is executed in.
+
+* **`wait_for_layer_load(self, layer_idx: int, stream: torch.cuda.Stream)`**
+ * **Description**: A synchronization point. Ensures that the KV cache for a specific layer is fully loaded before the model attempts to perform the forward pass on that layer.
+
+* **`save_kv_layer(self, layer_idx: int, stream: torch.cuda.Stream)`**
+ * **Description**: Triggers the saving of a specific layer's KV cache.
+
+* **`wait_for_save(self, stream: torch.cuda.Stream)`**
+ * **Description**: A synchronization point to ensure all save operations are enqueued or completed.
+
+* **`get_finished(self, finished_gen_req_ids, started_loading_req_ids) -> tuple[list[int], list[int]]`**
+ * **Description**: Polled by the runtime to check the status of asynchronous operations.
+ * **Returns**: Two lists of request IDs: those that have finished saving, and those that have finished loading.
+
+## Example Implementation
+
+The file `examples/llm-api/llm_kv_cache_connector.py` provides a reference implementation of a **Persistent KV Cache**.
+
+### Overview
+
+This example implements a file-system based KV cache.
+1.**Save**: When a request finishes or needs to be swapped out, its KV blocks are saved to disk as `.pt` files.
+2.**Load**: When a new request arrives with the same prompt prefix, the connector identifies the cached files and loads them back into GPU memory, skipping re-computation.
+
+### Implementation Details
+
+* **Metadata**: The example defines a `PersistentKvCacheConnectorMetadata` dataclass containing lists of `(file_path, block_id)` tuples for both loading and saving. This simple structure allows the Scheduler to tell the Worker exactly which file corresponds to which GPU block index.
+
+* **Hashing Strategy**: The `PersistentKvCacheConnectorLeader` hashes the token sequence of a block to generate a unique filename (e.g., `hash_value.pt`). This acts as the lookup key.
+
+* **Worker Logic**:
+ * `start_load_kv`: Iterates through the load list provided in the metadata, loads the `.pt` file to CPU, and copies it to the specific `block_id` in the GPU tensor.
+ * `wait_for_save`: Performs the reverse. It copies data from the GPU `block_id` to CPU and saves it to disk using `torch.save`.
+
+### Limitations & Patterns
+
+This example illustrates the API mechanics but has several limitations that make it unsuitable for high-performance production use without modification:
+
+1. **Blocking I/O**: The example uses `torch.load` and `torch.save` synchronously. In a real implementation, these should be offloaded to a background thread or asynchronous I/O handler to avoid stalling the GPU.
+2. **Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks. It does not handle partial cache hits.
+3. **FileSystem Latency**: Storing one file per block can create high filesystem overhead.
+
+### Usage
+
+To run the example:
+
+```bash
+python examples/llm-api/llm_kv_cache_connector.py
+```
+
+The script demonstrates:
+
+1. Generating text for a prompt (First run).
+2. Destroying the LLM instance.
+3. Creating a new LLM instance with the same connector config.
+4. Generating text for the same prompt (Second run).
+5. Asserting that the outputs match, proving the state was correctly restored from the disk cache.
diff --git a/latest/_sources/features/parallel-strategy.md.txt b/latest/_sources/features/parallel-strategy.md.txt
index b528c639d7..64b2b051be 100644
--- a/latest/_sources/features/parallel-strategy.md.txt
+++ b/latest/_sources/features/parallel-strategy.md.txt
@@ -80,6 +80,8 @@ enable_attention_dp: true
EOF
```
+then set `--extra_llm_api_options parallel_config.yaml` in `trtllm-serve` or `trtllm-bench`.
+
### FFN Module
#### Dense Models
diff --git a/latest/_sources/features/sampling.md.txt b/latest/_sources/features/sampling.md.txt
index e0a44c67d3..bac0cf355e 100644
--- a/latest/_sources/features/sampling.md.txt
+++ b/latest/_sources/features/sampling.md.txt
@@ -1,5 +1,5 @@
# Sampling
-The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability, guided decoding and logits processors
+The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability and logits processors
## General usage
@@ -60,42 +60,6 @@ llm.generate(["Hello, my name is",
"Hello, my name is"], sampling_params)
```
-## Guided decoding
-
-Guided decoding controls the generation outputs to conform to pre-defined structured formats, ensuring outputs follow specific schemas or patterns.
-
-The PyTorch backend supports guided decoding with the XGrammar and Low-level Guidance (llguidance) backends and the following formats:
-- JSON schema
-- JSON object
-- Regular expressions
-- Extended Backus-Naur form (EBNF) grammar
-- Structural tags
-
-To enable guided decoding, you must:
-
-1. Set the `guided_decoding_backend` parameter to `'xgrammar'` or `'llguidance'` in the `LLM` class
-2. Create a [`GuidedDecodingParams`](source:tensorrt_llm/sampling_params.py#L14) object with the desired format specification
- * Note: Depending on the type of format, a different parameter needs to be chosen to construct the object (`json`, `regex`, `grammar`, `structural_tag`).
-3. Pass the `GuidedDecodingParams` object to the `guided_decoding` parameter of the `SamplingParams` object
-
-The following example demonstrates guided decoding with a JSON schema:
-
-```python
-from tensorrt_llm import LLM, SamplingParams
-from tensorrt_llm.llmapi import GuidedDecodingParams
-
-llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8',
- guided_decoding_backend='xgrammar')
-structure = '{"title": "Example JSON", "type": "object", "properties": {...}}'
-guided_decoding_params = GuidedDecodingParams(json=structure)
-sampling_params = SamplingParams(
- guided_decoding=guided_decoding_params,
- )
-llm.generate("Generate a JSON response", sampling_params)
-```
-
-You can find a more detailed example on guided decoding [here](source:examples/llm-api/llm_guided_decoding.py).
-
## Logits processor
Logits processors allow you to modify the logits produced by the network before sampling, enabling custom generation behavior and constraints.
diff --git a/latest/_sources/index.rst.txt b/latest/_sources/index.rst.txt
index 58ef3c76df..49c5e1546c 100644
--- a/latest/_sources/index.rst.txt
+++ b/latest/_sources/index.rst.txt
@@ -71,11 +71,15 @@ Welcome to TensorRT LLM's Documentation!
features/quantization.md
features/sampling.md
features/additional-outputs.md
+ features/guided-decoding.md
features/speculative-decoding.md
features/checkpoint-loading.md
features/auto_deploy/auto-deploy.md
features/ray-orchestrator.md
features/torch_compile_and_piecewise_cuda_graph.md
+ features/helix.md
+ features/kv-cache-connector.md
+
.. toctree::
:maxdepth: 2
diff --git a/latest/_sources/installation/linux.md.txt b/latest/_sources/installation/linux.md.txt
index 2aae24e6af..a9704f9cad 100644
--- a/latest/_sources/installation/linux.md.txt
+++ b/latest/_sources/installation/linux.md.txt
@@ -9,8 +9,11 @@
Before the pre-built Python wheel can be installed via `pip`, a few
prerequisites must be put into place:
- Install CUDA Toolkit following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and
- make sure `CUDA_HOME` environment variable is properly set.
+ Install CUDA Toolkit 13.0 following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
+ and make sure `CUDA_HOME` environment variable is properly set.
+
+ The `cuda-compat-13-0` package may be required depending on your system's NVIDIA GPU
+ driver version. For additional information, refer to the [CUDA Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html).
```bash
# By default, PyTorch CUDA 12.8 package is installed. Install PyTorch CUDA 13.0 package to align with the CUDA version used for building TensorRT LLM wheels.
diff --git a/latest/_sources/legacy/advanced/kv-cache-reuse.md.txt b/latest/_sources/legacy/advanced/kv-cache-reuse.md.txt
index ee2ccf2581..5f3a5d73cf 100644
--- a/latest/_sources/legacy/advanced/kv-cache-reuse.md.txt
+++ b/latest/_sources/legacy/advanced/kv-cache-reuse.md.txt
@@ -64,7 +64,7 @@ There are a few pitfalls that can prevent kv cache reuse when that seems possibl
Kv cache state for system prompts will remain reusable until memory is needed for launching a new request or propagating an existing one. When this happens, reusable blocks are evicted based on LRU. System prompts that are frequently used have a better chance of remaining reusable, but there is no guarantee since launching new requests take priority over possible reuse. Running with a larger batch size, or larger output sequence lengths for example will reduce the probability of kv cache blocks being reused, since it increases memory needs.
-KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
+KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. Partially matched blocks can also be reused, but that creates a new copy of the block for each sequence. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
```trtllm-build --tokens_per_block 32 ...```
diff --git a/latest/_sources/legacy/reference/multimodal-feature-support-matrix.md.txt b/latest/_sources/legacy/reference/multimodal-feature-support-matrix.md.txt
index d0cf237268..b6d99e24ca 100644
--- a/latest/_sources/legacy/reference/multimodal-feature-support-matrix.md.txt
+++ b/latest/_sources/legacy/reference/multimodal-feature-support-matrix.md.txt
@@ -7,7 +7,7 @@
| VILA | Yes | No | No | No |
| LLaVA-NeXT | Yes | Yes | Yes | Yes |
| Llama 4 | Yes | Yes | No | No |
-| Mistral-Small-3.1 | Yes | Yes | No | No |
-| Phi-4-multimodal | Yes | Yes | No | No |
+| Mistral-Small-3.1 | Yes | Yes | Yes | Yes |
+| Phi-4-multimodal | Yes | Yes | Yes | Yes |
| Qwen2-VL | Yes | Yes | Yes | Yes |
| Qwen2.5-VL | Yes | Yes | Yes | Yes |
diff --git a/latest/_sources/llm-api/reference.rst.txt b/latest/_sources/llm-api/reference.rst.txt
index 5512353146..76a2c9f0e2 100644
--- a/latest/_sources/llm-api/reference.rst.txt
+++ b/latest/_sources/llm-api/reference.rst.txt
@@ -288,7 +288,7 @@ API Reference
:special-members: __init__
:member-order: groupwise
:inherited-members:
- :exclude-members: model_json_schema,parse_raw,update_forward_refs,model_validate,model_fields_set,model_construct,model_rebuild,schema_json,parse_file,model_extra,model_config,model_fields,dict,model_parametrized_name,model_validate_strings,from_orm,copy,model_dump,construct,model_post_init,model_copy,validate,json,model_computed_fields,model_validate_json,model_dump_json,parse_obj,schema
+ :exclude-members: model_parametrized_name,update_forward_refs,model_rebuild,parse_raw,from_orm,model_validate_strings,model_computed_fields,validate,model_post_init,model_copy,dict,schema,parse_obj,json,model_validate_json,copy,model_config,model_dump_json,model_fields,schema_json,construct,model_extra,model_json_schema,model_validate,model_dump,parse_file,model_fields_set,model_construct
.. autoclass:: tensorrt_llm.llmapi.TrtLlmArgs
:members:
@@ -297,7 +297,7 @@ API Reference
:special-members: __init__
:member-order: groupwise
:inherited-members:
- :exclude-members: model_json_schema,parse_raw,update_forward_refs,model_validate,model_fields_set,model_construct,model_rebuild,schema_json,parse_file,model_extra,model_config,model_fields,dict,model_parametrized_name,model_validate_strings,from_orm,copy,model_dump,construct,model_post_init,model_copy,validate,json,model_computed_fields,model_validate_json,model_dump_json,parse_obj,schema
+ :exclude-members: model_parametrized_name,update_forward_refs,model_rebuild,parse_raw,from_orm,model_validate_strings,model_computed_fields,validate,model_post_init,model_copy,dict,schema,parse_obj,json,model_validate_json,copy,model_config,model_dump_json,model_fields,schema_json,construct,model_extra,model_json_schema,model_validate,model_dump,parse_file,model_fields_set,model_construct
.. autoclass:: tensorrt_llm.llmapi.AutoDecodingConfig
:members:
diff --git a/latest/_sources/models/supported-models.md.txt b/latest/_sources/models/supported-models.md.txt
index 749cfcc21d..c6b6194b5d 100644
--- a/latest/_sources/models/supported-models.md.txt
+++ b/latest/_sources/models/supported-models.md.txt
@@ -50,13 +50,13 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
| `Gemma3ForConditionalGeneration` | Yes | Yes | N/A | Yes | Yes | N/A | Yes | No | L + I |
| `HCXVisionForCausalLM` | Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I |
| `LlavaLlamaModel (VILA)` | Yes | Yes | No | Yes | Yes | No | Yes | No | L + I + V |
-| `LlavaNextForConditionalGeneration` | Yes | Yes | No | Yes | Yes | No | Yes | No | L + I |
+| `LlavaNextForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | L + I |
| `Llama4ForConditionalGeneration` | Yes | Yes | No | Yes | Yes | No | Yes | No | L + I |
| `Mistral3ForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I |
-| `NemotronH_Nano_VL_V2` | Yes | Yes | Yes | Yes | Yes | No | Yes | No | L + I + V |
+| `NemotronH_Nano_VL_V2` | Yes | Yes | Yes | Yes | Yes | N/A | Yes | No | L + I + V |
| `Phi4MMForCausalLM` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I + A |
-| `Qwen2VLForConditionalGeneration` | Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I + V |
-| `Qwen2_5_VLForConditionalGeneration` | Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I + V |
+| `Qwen2VLForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I + V |
+| `Qwen2_5_VLForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I + V |
Note:
- L: Language
diff --git a/latest/_sources/overview.md.txt b/latest/_sources/overview.md.txt
index fe44002b16..0df4f72539 100644
--- a/latest/_sources/overview.md.txt
+++ b/latest/_sources/overview.md.txt
@@ -23,7 +23,7 @@ TensorRT LLM delivers breakthrough performance on the latest NVIDIA GPUs:
### 🎯 **Comprehensive Model Support**
-TensorRT LLM supports the latest and most popular LLM architectures:
+TensorRT LLM supports the latest and most popular LLM [architectures](https://nvidia.github.io/TensorRT-LLM/models/supported-models.html).
- **Language Models**: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4...
- **Multi-modal Models**: LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision...
@@ -38,18 +38,18 @@ TensorRT LLM strives to support the most popular models on **Day 0**.
TensorRT LLM strives to support the most popular models on **Day 0**.
### 🚀 **Advanced Optimization & Production Features**
-- **In-Flight Batching & Paged Attention**: {ref}`inflight-batching` eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
-- **Multi-GPU Multi-Node Inference**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
-- **Advanced Quantization**:
+- **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
+- **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
+- **[Advanced Quantization](./features/quantization.md)**:
- **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
- **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
-- **Speculative Decoding**: Multiple algorithms including EAGLE, MTP and NGram
-- **KV Cache Management**: Paged KV cache with intelligent block reuse and memory optimization
-- **Chunked Prefill**: Efficient handling of long sequences by splitting context into manageable chunks
-- **LoRA Support**: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
-- **Checkpoint Loading**: Flexible model loading from various formats (HuggingFace, NeMo, custom)
-- **Guided Decoding**: Advanced sampling with stop words, bad words, and custom constraints
-- **Disaggregated Serving (Beta)**: Separate context and generation phases across different GPUs for optimal resource utilization
+- **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
+- **[KV Cache Management](./features/kvcache.md)**: Paged KV cache with intelligent block reuse and memory optimization
+- **[Chunked Prefill](./features/paged-attention-ifb-scheduler.md)**: Efficient handling of long sequences by splitting context into manageable chunks
+- **[LoRA Support](./features/lora.md)**: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
+- **[Checkpoint Loading](./features/checkpoint-loading.md)**: Flexible model loading from various formats (HuggingFace, NeMo, custom)
+- **[Guided Decoding](./features/guided-decoding.md)**: Advanced sampling with stop words, bad words, and custom constraints
+- **[Disaggregated Serving (Beta)](./features/disagg-serving.md)**: Separate context and generation phases across different GPUs for optimal resource utilization
### 🔧 **Latest GPU Architecture Support**
diff --git a/latest/_sources/quick-start-guide.md.txt b/latest/_sources/quick-start-guide.md.txt
index 5ef481f5f0..088f70b3ea 100644
--- a/latest/_sources/quick-start-guide.md.txt
+++ b/latest/_sources/quick-start-guide.md.txt
@@ -10,7 +10,7 @@ This is the starting point to try out TensorRT LLM. Specifically, this Quick Sta
The [TensorRT LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) maintained by NVIDIA contains all of the required dependencies pre-installed. You can start the container on a machine with NVIDIA GPUs via:
```bash
-docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4
+docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5
```
@@ -24,6 +24,15 @@ To start the server, you can run a command like the following example inside a D
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```
+You may also deploy pre-quantized models to improve performance.
+Ensure your GPU supports FP8 quantization before running the following:
+
+```bash
+trtllm-serve "nvidia/Qwen3-8B-FP8"
+```
+
+For more options, browse the full [collection of generative models](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer) that have been quantized and optimized for inference with the TensorRT Model Optimizer.
+
```{note}
If you are running `trtllm-serve` inside a Docker container, you have two options for sending API requests:
1. Expose a port (e.g., 8000) to allow external access to the server from outside the container.
diff --git a/latest/_sources/torch/auto_deploy/advanced/expert_configurations.md.txt b/latest/_sources/torch/auto_deploy/advanced/expert_configurations.md.txt
index afc55d24f8..4df92f0cf7 100644
--- a/latest/_sources/torch/auto_deploy/advanced/expert_configurations.md.txt
+++ b/latest/_sources/torch/auto_deploy/advanced/expert_configurations.md.txt
@@ -153,6 +153,85 @@ python build_and_run_ad.py \
--args.world-size=8 # CLI override beats both YAML configs
```
+## Sharding configuration
+
+The `detect_sharding` transform automatically detects and applies sharding strategies to the model. It supports multiple sharding sources and dimensions, allowing flexible configuration for different model architectures and parallelism strategies.
+
+### Configuration Parameters
+
+The `detect_sharding` transform accepts the following configuration parameters:
+
+#### `simple_shard_only` (bool, default: `false`)
+
+When set to `true`, forces simple sharding (row-wise sharding with all-gather) for all linear layers, bypassing more sophisticated column/row sharding strategies. This is useful when you want a uniform sharding approach across all layers or when debugging sharding issues.
+
+#### `sharding_source` (list, default: `['manual', 'factory', 'heuristic']`)
+
+Specifies the priority order of sharding sources. The order matters: if multiple sources try to apply sharding to the same layer, only the first one in the list will be applied. The available sources are:
+
+- **`'manual'`**: Uses manually provided sharding configuration via `manual_config` parameter
+- **`'factory'`**: Uses factory-provided sharding configuration (e.g., from HuggingFace model configs)
+- **`'heuristic'`**: Uses automatic heuristic-based sharding detection based on layer patterns
+
+Example: If both `manual` and `heuristic` try to apply sharding to layer L, only the `manual` transformation will be applied since it appears first in the list.
+
+#### `support_partial_config` (bool, default: `true`)
+
+When `true`, allows partial sharding configurations where not all layers need to be specified in the manual or factory config. Layers not explicitly configured will be handled by heuristic sharding or left unsharded. When `false`, the configuration must specify all layers or it will be invalidated and skipped.
+
+#### `sharding_dims` (list, default: `['tp', 'ep', 'bmm']`)
+
+Specifies which sharding dimensions to apply during heuristic sharding. The available dimensions are:
+
+- **`'tp'`**: Tensor parallelism - applies column/row sharding for standard transformer layers
+- **`'ep'`**: Expert parallelism - shards experts across ranks for Mixture-of-Experts (MoE) models
+- **`'bmm'`**: Batch matrix multiplication sharding - shards batch matrix multiplication operations
+- **`'ssm'`**: State space model sharding - applies specialized sharding for Mamba/SSM layers
+
+You can enable multiple dimensions simultaneously. For example, `['tp', 'ep']` will apply both tensor parallelism and expert parallelism.
+
+#### `requires_shape_prop` (bool, default: `true`)
+
+Whether shape propagation is required before applying this transform. Shape propagation enables the transform to make informed decisions about sharding strategies based on tensor dimensions.
+
+### Manual TP Sharding Configuration
+
+For advanced users, you can provide a manual sharding configuration. An example of such setting:
+
+```yaml
+args:
+ transforms:
+ detect_sharding:
+ manual_config:
+ head_dim: 128
+ tp_plan:
+ # mamba SSM layers
+ in_proj: mamba
+ out_proj: rowwise
+ # attention layers
+ q_proj: colwise
+ k_proj: colwise
+ v_proj: colwise
+ o_proj: rowwise
+ # NOTE: for performance reason, consider not sharding the following
+ # layers at all. Commenting out the following layers will replicate
+ # them across ranks.
+ # MLP and shared experts in MoE layers
+ gate_proj: colwise
+ up_proj: colwise
+ down_proj: rowwise
+ # MoLE: latent projections: simple shard
+ fc1_latent_proj: gather
+ fc2_latent_proj: gather
+```
+
+The `tp_plan` dictionary maps layer names (using module paths with wildcard `*` support) to sharding strategies:
+
+- **`colwise`**: Column-wise sharding (splits the weight matrix along columns)
+- **`rowwise`**: Row-wise sharding (splits the weight matrix along rows)
+- **`mamba`**: Specialized sharding for Mamba SSM layers
+- **`gather`**: Simple shard with row-wise sharding and all-gather operation
+
## Built-in Default Configuration
Both `AutoDeployConfig` and `LlmArgs` classes automatically load a built-in `default.yaml` configuration file that provides defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the `_get_config_dict()` function in `tensorrt_llm._torch.auto_deploy.llm_args` and defines default transform configurations for graph optimization stages.
diff --git a/latest/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html b/latest/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html
index 5270e791e4..25eacbb65f 100644
--- a/latest/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html
+++ b/latest/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html
@@ -61,7 +61,7 @@
@@ -76,7 +76,7 @@
-
+
@@ -376,7 +376,9 @@
H200’s HBM3e larger capacity & faster memory enables up to 1.9x performance on LLMs compared to H100. Max throughput improves due to its dependence on memory capacity and bandwidth, benefitting from the new HBM3e. First token latency is compute bound for most ISLs, meaning H200 retains similar time to first token as H100.
@@ -900,9 +905,9 @@ However, since Q is in BF16 format, FMHA will also be performed in BF16 format,
diff --git a/latest/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html b/latest/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html
index 88733cf96b..82fe6ddea3 100644
--- a/latest/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html
+++ b/latest/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html
@@ -61,7 +61,7 @@
@@ -76,7 +76,7 @@
-
+
@@ -376,7 +376,9 @@
@@ -1460,9 +1465,9 @@ Based on our current performance analysis, when you plan to apply large-scale EP
diff --git a/latest/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html b/latest/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html
index ee6d3aba26..46c999ff17 100644
--- a/latest/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html
+++ b/latest/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html
@@ -63,7 +63,7 @@
@@ -78,7 +78,7 @@
-
+
@@ -378,7 +378,9 @@
@@ -642,7 +647,7 @@ trtllm-servedisaggregated-cFigure 4. Dynamo integration with disaggregated service
In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-
For more information on how to use Dynamo with TensorRT LLM, please refer to this documentation.
+
For more information on how to use Dynamo with TensorRT LLM, please refer to this documentation.
A JSON string specifying chat template arguments, used to enable features like thinking mode. Examples: ‘{“enable_thinking”: true}’ for Qwen3, or ‘{“thinking”: true}’ for DeepSeek-V3.2.
Across different requests, average TPOT is the mean of each request’s TPOT (all requests weighted equally), while average ITL is token-weighted (all tokens weighted equally):
You can configure various options of trtllm-serve using YAML files by setting the --extra_llm_api_options option to the path of a YAML file, the arguments in the file will override the corresponding command line arguments.
+
The yaml file is configuration of tensorrt_llm.llmapi.LlmArgs, the class has multiple levels of hierarchy, to configure the top level arguments like max_batch_size, the yaml file should be like:
+
max_batch_size:8
+
+
+
To configure the nested level arguments like moe_config.backend, the yaml file should be like:
If you encounter CUDA out-of-memory errors, try reducing max_batch_size or max_seq_len.
-
For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:8192. For more details, please refer to the PyTorch documentation on optimizing memory usage.
+
For running input/output sequence lengths of 8K/1K on H200, there is a known CUDA Out-Of-Memory issue caused by the PyTorch CUDA Caching Allocator fragmenting memory. As a workaround, you can set the environment variable PYTORCH_ALLOC_CONF=max_split_size_mb:8192. For more details, please refer to the PyTorch documentation on optimizing memory usage.
Ensure your model checkpoints are compatible with the expected format.
@@ -959,16 +966,19 @@ chmod+xbench.sh
Both TPOT and ITL ignore TTFT.
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
Across different requests, average TPOT is the mean of each request’s TPOT (all requests weighted equally), while average ITL is token-weighted (all tokens weighted equally):
Here is an example response, showing that the TensorRT LLM server reasons and answers the questions.
-
TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
@@ -922,16 +928,19 @@ chmod+xbench.sh
Both TPOT and ITL ignore TTFT.
For a single request, ITLs are the time intervals between tokens, while TPOT is the average of those intervals:
Across different requests, average TPOT is the mean of each request’s TPOT (all requests weighted equally), while average ITL is token-weighted (all tokens weighted equally):
This YAML file specifies configurations that deploy the model with 8-way expert parallelism for the MoE part and 8-way attention data parallelism. It also enables trust_remote_code, so that it works with the Kimi K2 Thinking customized tokenizer.
+
With the EXTRA_OPTIONS_YAML_FILE, use the following example command to launch the TensorRT LLM server with the Kimi-K2-Thinking-NVFP4 model from within the container.
TensorRT LLM will load weights and select the best kernels during startup. The server is successfully launched when the following log is shown:
+
INFO: Started server process [xxxxx]
+INFO: Waiting for application startup.
+INFO: Application startup complete.
+INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
+
+
+
You can query the health/readiness of the server using:
When the Status:200 code is returned, the server is ready for queries.
+
+
+
+
Deploy Kimi K2 Thinking on GB200 NVL72 through SLURM with wide EP and disaggregated serving#
+
TensorRT LLM provides a set of SLURM scripts that can be easily configured through YAML files and automatically launch SLURM jobs on GB200 NVL72 clusters for deployment, benchmarking, and accuracy testing purposes. The scripts are located at examples/disaggregated/slurm/benchmark. Refer to this page for more details and example wide EP config files.
+
For Kimi K2 Thinking, an example configuration for SLURM arguments and the scripts is as follows:
+
# SLURM Configuration
+slurm:
+script_file:"disaggr_torch.slurm"
+partition:"<partition>"
+account:"<account>"
+job_time:"02:00:00"
+job_name:"<job_name>"
+extra_args:""# Cluster specific arguments, e.g. "--gres=gpu:4 --exclude=node1,node2"
+numa_bind:true# Only enable for GB200 NVL72
+
+# Benchmark Mode
+benchmark:
+mode:"e2e"# Options: e2e, gen_only
+use_nv_sa_benchmark:false# Whether to use NVIDIA SA benchmark script
+multi_round:8# Number of benchmark rounds
+benchmark_ratio:0.8# Benchmark ratio
+streaming:true# Enable streaming mode
+concurrency_list:"16"
+input_length:1024# Input sequence length
+output_length:1024# Output sequence length
+dataset_file:"<dataset_file>"
+
+# Hardware Configuration
+hardware:
+gpus_per_node:4# Modify this with your hardware configuration
+num_ctx_servers:4# Number of context servers
+num_gen_servers:1# Number of generation servers
+
+# Environment Configuration
+environment:
+container_mount:"<container_mount>"# Format: path1:path1,path2:path2
+container_image:"<container_image>"
+model_path:"<model_path>"
+trtllm_repo:"<trtllm_repo>"
+build_wheel:false# Don't build the wheel when launching multiple jobs
+trtllm_wheel_path:""# Path to pre-built TensorRT-LLM wheel. If provided, install from this wheel instead
+work_dir:"<full_path_to_work_dir>"
+worker_env_var:"TLLM_LOG_LEVEL=INFOTRTLLM_SERVER_DISABLE_GC=1TRTLLM_WORKER_DISABLE_GC=1TRTLLM_ENABLE_PDL=1ENROOT_ALLOW_DEV=yes"
+server_env_var:"TRTLLM_SERVER_DISABLE_GC=1"
+
+# Worker Configuration
+worker_config:
+gen:
+tensor_parallel_size:32
+moe_expert_parallel_size:32
+enable_attention_dp:true
+enable_lm_head_tp_in_adp:true
+pipeline_parallel_size:1
+max_batch_size:128
+max_num_tokens:128
+max_seq_len:9236
+cuda_graph_config:
+enable_padding:true
+batch_sizes:
+-1
+-2
+-4
+-8
+-16
+-32
+-64
+-128
+-256
+-512
+-768
+-1024
+-2048
+print_iter_log:true
+kv_cache_config:
+enable_block_reuse:false
+free_gpu_memory_fraction:0.6
+dtype:fp8
+moe_config:
+backend:WIDEEP
+use_low_precision_moe_combine:true
+load_balancer:
+num_slots:416
+layer_updates_per_iter:1
+cache_transceiver_config:
+backend:UCX
+max_tokens_in_buffer:8448
+stream_interval:20
+num_postprocess_workers:4
+trust_remote_code:true
+ctx:
+max_batch_size:1
+max_num_tokens:8448
+max_seq_len:8212
+tensor_parallel_size:4
+moe_expert_parallel_size:4
+enable_attention_dp:true
+pipeline_parallel_size:1
+print_iter_log:true
+cuda_graph_config:null
+disable_overlap_scheduler:true
+kv_cache_config:
+enable_block_reuse:false
+free_gpu_memory_fraction:0.75
+dtype:fp8
+cache_transceiver_config:
+backend:UCX
+max_tokens_in_buffer:8448
+trust_remote_code:true
+
+
+
It includes SLURM-specific configurations, benchmark and hardware details, and environment settings. The worker_config field includes detailed settings for context and generation servers when deploying a disaggregated server, with each specified as a list of LLM API arguments.
+
To launch SLURM jobs with the YAML config file, execute the following command:
{
+"id":"chatcmpl-5907ed752eb44d11a12893b19f79f8ca",
+"object":"chat.completion",
+"created":1764866686,
+"model":"nvidia/Kimi-K2-Thinking-NVFP4",
+"choices":[
+{
+"index":0,
+"message":{
+"role":"assistant",
+"content":"<think> The user is asking a very simple question: \"Where is New York?\" This could be interpreted in a few ways:\n\n1. Where is New York State located?\n2. Where is New York City located?\n3. Where is New York located in relation to something else?\n\nGiven the ambiguity, I should provide a comprehensive answer that covers the main interpretations. I should be clear and direct.\n\nLet me structure my answer:\n- First, clarify that \"New York\" can refer to either New York State or New York City\n- For New York State: It's located in the northeastern United States, bordered by New Jersey, Pennsylvania, Connecticut",
+"reasoning_content":"",
+"reasoning":null,
+"tool_calls":[]
+},
+"logprobs":null,
+"finish_reason":"length",
+"stop_reason":null,
+"mm_embedding_handle":null,
+"disaggregated_params":null,
+"avg_decoded_tokens_per_iter":1.0
+}
+],
+"usage":{
+"prompt_tokens":12,
+"total_tokens":140,
+"completion_tokens":128,
+"prompt_tokens_details":{
+"cached_tokens":0
+}
+},
+"prompt_token_ids":null
+}
+
To benchmark the performance of your TensorRT LLM server, you can leverage the built-in benchmark_serving.py script. To do this, first create a wrapper bench.sh script.
Across different requests, average TPOT is the mean of each request’s TPOT (all requests weighted equally), while average ITL is token-weighted (all tokens weighted equally):
Across different requests, average TPOT is the mean of each request’s TPOT (all requests weighted equally), while average ITL is token-weighted (all tokens weighted equally):
This is a functional quick-start guide for running the Qwen3 model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.
We maintain YAML configuration files with recommended performance settings in the examples/configs directory. These config files are present in the TensorRT LLM container at the path /app/tensorrt_llm/examples/configs. You can use these out-of-the-box, or adjust them to your specific use case.
+
TRTLLM_DIR=/app/tensorrt_llm# change as needed to match your environment
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/qwen3.yaml
+
+
+
Note: if you don’t have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
These options provide control over TensorRT LLM’s behavior and are set within the YAML file passed to the trtllm-serve command via the --extra_llm_api_options argument.
Description: Sets the expert-parallel size for Mixture-of-Experts (MoE) models. Like tensor_parallel_size, this should generally match the number of GPUs you’re using. This setting has no effect on non-MoE models.
Description: A value between 0.0 and 1.0 that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
+
Recommendation: If you experience OOM errors, try reducing this value to 0.7 or lower.
Description: The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
Description: The maximum possible sequence length for a single request, including both input and generated output tokens. We won’t specifically set it. It will be inferred from model config.
When the Status:200 code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
+
curlhttp://localhost:8000/v1/chat/completions-H"Content-Type: application/json"-d'{
+ "model": "Qwen/Qwen3-30B-A3B",
+ "messages": [
+ {
+ "role": "user",
+ "content": "What is the capital of France?"
+ }
+ ],
+ "max_tokens": 512,
+ "temperature": 0.7,
+ "top_p": 0.95
+}'-w"\n"
+
+
+
Here is an example response:
+
{
+"id":"chatcmpl-abc123def456",
+"object":"chat.completion",
+"created":1759022940,
+"model":"Qwen/Qwen3-30B-A3B",
+"choices":[
+{
+"index":0,
+"message":{
+"role":"assistant",
+"content":"The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for its rich history, culture, art, and iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
+},
+"logprobs":null,
+"finish_reason":"stop"
+}
+],
+"usage":{
+"prompt_tokens":15,
+"completion_tokens":58,
+"total_tokens":73
+}
+}
+
To benchmark the performance of your TensorRT LLM server you can leverage the built-in benchmark_serving.py script. To do this first create a wrapper bench.sh script.
To run the benchmark with the generated data set, simply use the trtllm-benchthroughput subcommand. The benchmarker will
run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide
-a model name (HuggingFace reference or path to a local model), a generated dataset, and a file containing any desired extra options to the LLM APIs (details in tensorrt_llm/llmapi/llm_args.py:LlmArgs).
1
- 2importos
- 3importsys
- 4fromdataclassesimportdataclass,field
- 5frompathlibimportPath
- 6fromtempfileimportTemporaryDirectory
- 7
- 8importclick
- 9importtorch
- 10
- 11fromtensorrt_llmimportLLM,SamplingParams,logger
- 12fromtensorrt_llm._torch.pyexecutor.kv_cache_connectorimport(
- 13KvCacheConnectorScheduler,KvCacheConnectorWorker,SchedulerOutput)
- 14fromtensorrt_llm.bindings.internal.batch_managerimportLlmRequest
- 15fromtensorrt_llm.llmapi.llm_argsimportKvCacheConnectorConfig,TorchLlmArgs
- 16
- 17# This is a simple example of the use of the KV cache connector.
- 18# It persists KV cache contents into a folder, and can load them back on subsequent runs.
- 19# See tensorrt_llm/_torch/pyexecutor/connector.py for details about the KV cache connector interface.
- 20# NOTE: This example connector implementation is NOT suitable for production use.
- 21
- 22CONNECTOR_CACHE_FOLDER_KEY="CONNECTOR_CACHE_FOLDER"
- 23
- 24
- 25@dataclass
- 26classPersistentKvCacheConnectorMetadata:
- 27load:list[tuple[str,int]]=field(default_factory=list)
- 28save:list[tuple[str,int]]=field(default_factory=list)
- 29
- 30
- 31classPersistentKvCacheConnectorWorker(KvCacheConnectorWorker):
- 32
- 33def__init__(self,llm_args:TorchLlmArgs):
- 34super().__init__(llm_args)
+
1'''
+ 2This script demonstrates the KV cache connector feature in TensorRT-LLM, which enables
+ 3custom persistence and reuse of KV cache blocks across different LLM instances.
+ 4
+ 5**Scenario:**
+ 6The script implements a persistent KV cache connector that saves computed KV cache blocks
+ 7to disk and loads them back in subsequent runs, eliminating redundant computation for
+ 8recurring prompts.
+ 9
+ 10**What is a KV Cache Connector?**
+ 11
+ 12A KV cache connector is a customizable interface that allows you to:
+ 131. **Save KV Cache:** Persist computed KV cache blocks to an external storage
+ 14 (disk, database, distributed cache, etc.)
+ 152. **Load KV Cache:** Retrieve previously computed cache blocks instead of recomputing them
+ 163. **Share Cache Across Instances:** Reuse cache blocks across different LLM instances
+ 17 or sessions, unlike regular block reuse which is limited to a single instance
+ 18
+ 19**How It Works:**
+ 20
+ 21This example implements a `PersistentKvCacheConnector` with two key components:
+ 22
+ 23* **PersistentKvCacheConnectorLeader (Scheduler):**
+ 24 - Hashes token sequences to create unique identifiers for each cache block
+ 25 - Checks if cached blocks exist on disk for incoming requests
+ 26 - Schedules load operations for cache hits
+ 27 - Schedules save operations for newly computed blocks
+ 28
+ 29* **PersistentKvCacheConnectorWorker:**
+ 30 - Executes the actual load/save operations between GPU and disk
+ 31 - Loads cached blocks from disk files into GPU memory
+ 32 - Saves newly computed blocks from GPU to disk files
+ 33
+ 34**Demonstration:** 35
- 36self.kv_cache_tensor=None
+ 36The script processes the same prompt twice using two separate LLM instances: 37
- 38defregister_kv_caches(self,kv_cache_tensor:torch.Tensor):
- 39assertself.kv_cache_tensorisNone,"KV cache tensor already registered"
- 40self.kv_cache_tensor=kv_cache_tensor
- 41
- 42defstart_load_kv(self,stream:torch.cuda.Stream):
- 43# Do all loads synchronously, and blockwise.
- 44forpath,block_idinself._metadata.load:
- 45cpu_tensor=torch.load(path,map_location="cpu")
- 46
- 47# Copy into the device block.
- 48self.kv_cache_tensor[block_id].copy_(cpu_tensor,non_blocking=False)
+ 381. **First Run (Instance 1):**
+ 39 - The LLM computes the KV cache for the input prompt
+ 40 - The connector saves the computed cache blocks to disk (as .pt files)
+ 41 - The generation completes and the LLM instance is destroyed
+ 42
+ 432. **Second Run (Instance 2):**
+ 44 - A new LLM instance is created with the same connector configuration
+ 45 - When processing the same prompt, the connector finds matching cache blocks on disk
+ 46 - The cache is loaded from disk instead of being recomputed
+ 47 - **Expected Outcome:** Faster prefill as cache blocks are loaded rather than computed
+ 48 - Both outputs should be identical, demonstrating deterministic cache reuse 49
- 50defwait_for_layer_load(self,layer_idx:int,stream:torch.cuda.Stream):
- 51pass
- 52
- 53defsave_kv_layer(self,layer_idx:int,stream:torch.cuda.Stream):
- 54pass
- 55
- 56defwait_for_save(self,stream:torch.cuda.Stream):
- 57
- 58# Make sure the forward pass is complete before beginning our save.
- 59stream.synchronize()
- 60
- 61forpath,block_idinself._metadata.save:
- 62cpu_tensor=self.kv_cache_tensor[block_id].cpu()
- 63
- 64# Don't write anything if this specific block already exists.
- 65ifPath(path).exists():
- 66continue
+ 50**Key Benefits:**
+ 51
+ 52- **Cross-Instance Cache Sharing:** Share computed caches across multiple LLM instances
+ 53- **Persistent Storage:** Cache survives beyond the lifetime of a single LLM instance
+ 54- **Custom Storage Backends:** Implement any storage mechanism (shown here: disk files)
+ 55- **Reduced Computation:** Eliminate redundant KV cache computation for repeated prompts
+ 56
+ 57**How to Run:**
+ 58
+ 59```bash
+ 60python llm_kv_cache_connector.py <model_path>
+ 61```
+ 62
+ 63Example:
+ 64```bash
+ 65python llm_kv_cache_connector.py meta-llama/Llama-3.1-8B-Instruct
+ 66``` 67
- 68# Do a blocking save to the file. This way, we only return once all saves are complete.
- 69torch.save(cpu_tensor,path)
- 70
- 71defget_finished(
- 72self,finished_gen_req_ids:list[int],
- 73started_loading_req_ids:list[int])->tuple[list[int],list[int]]:
- 74
- 75return[],[]
- 76
- 77
- 78classPersistentKvCacheConnectorLeader(KvCacheConnectorScheduler):
+ 68**Implementation Notes:**
+ 69
+ 70- This example uses content-based hashing to identify cache blocks
+ 71- Cache files are stored in a temporary directory (cleaned up after the demo)
+ 72- The implementation is simplified and not optimized for production use
+ 73- Does not support chunked prefill in this example
+ 74- See `tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py` for the full connector interface
+ 75
+ 76**NOTE:** This example connector implementation is designed for demonstration purposes
+ 77and is NOT suitable for production use without additional optimizations and error handling.
+ 78''' 79
- 80def__init__(self,llm_args:TorchLlmArgs):
- 81super().__init__(llm_args)
- 82
- 83self.block_size=self._llm_args.kv_cache_config.tokens_per_block
- 84self.pending_loads={}
+ 80importos
+ 81importsys
+ 82fromdataclassesimportdataclass,field
+ 83frompathlibimportPath
+ 84fromtempfileimportTemporaryDirectory 85
- 86self.cache_folder=os.environ.get(CONNECTOR_CACHE_FOLDER_KEY,
- 87"./connector_cache")
+ 86importclick
+ 87importtorch 88
- 89os.makedirs(self.cache_folder,exist_ok=True)
- 90
- 91defbuild_connector_meta(self,scheduler_output:SchedulerOutput):
- 92# NOTE: This is a simplified implementation, and does not work with chunked prefill.
- 93
- 94metadata=PersistentKvCacheConnectorMetadata()
- 95
- 96forreqinscheduler_output.new_requests:
- 97# If we don't have any pending loads for this request, we can skip it.
- 98ifreq.request_idnotinself.pending_loads:
- 99continue
-100
-101num_computed_blocks=req.computed_position//self.block_size
-102block_ids=req.new_block_ids
+ 89fromtensorrt_llmimportLLM,SamplingParams,logger
+ 90fromtensorrt_llm._torch.pyexecutor.kv_cache_connectorimport(
+ 91KvCacheConnectorScheduler,KvCacheConnectorWorker,SchedulerOutput)
+ 92fromtensorrt_llm.bindings.internal.batch_managerimportLlmRequest
+ 93fromtensorrt_llm.llmapi.llm_argsimportKvCacheConnectorConfig,TorchLlmArgs
+ 94
+ 95CONNECTOR_CACHE_FOLDER_KEY="CONNECTOR_CACHE_FOLDER"
+ 96
+ 97
+ 98@dataclass
+ 99classPersistentKvCacheConnectorMetadata:
+100load:list[tuple[str,int]]=field(default_factory=list)
+101save:list[tuple[str,int]]=field(default_factory=list)
+102103
-104pending_load=self.pending_loads[req.request_id]
+104classPersistentKvCacheConnectorWorker(KvCacheConnectorWorker):105
-106forfile_path,block_posinzip(
-107pending_load,range(num_computed_blocks,len(block_ids))):
-108metadata.load.append((file_path,block_ids[block_pos]))
-109
-110# Break up the remainder of the token sequence into chunks.
-111chunks=self._chunk_tokens(req.new_tokens)
-112
-113# For each chunk that isn't already on device, and isn't in our connector cache, we need to save it.
-114forblock_posinrange(num_computed_blocks+len(pending_load),
-115len(block_ids)):
-116iflen(chunks[block_pos])==self.block_size:
-117hashed_tokens=self._hash_tokens(chunks[block_pos])
-118
-119file_path=self._file_path(hashed_tokens)
-120
-121metadata.save.append((file_path,block_ids[block_pos]))
+106def__init__(self,llm_args:TorchLlmArgs):
+107super().__init__(llm_args)
+108
+109self.kv_cache_tensor=None
+110
+111defregister_kv_caches(self,kv_cache_tensor:torch.Tensor):
+112assertself.kv_cache_tensorisNone,"KV cache tensor already registered"
+113self.kv_cache_tensor=kv_cache_tensor
+114
+115defstart_load_kv(self,stream:torch.cuda.Stream):
+116# Do all loads synchronously, and blockwise.
+117forpath,block_idinself._metadata.load:
+118cpu_tensor=torch.load(path,map_location="cpu")
+119
+120# Copy into the device block.
+121self.kv_cache_tensor[block_id].copy_(cpu_tensor,non_blocking=False)122
-123self.pending_loads={}
-124
-125returnmetadata
-126
-127def_hash_tokens(self,tokens:list[int])->int:
-128returnabs(hash(tuple(tokens)))
-129
-130def_file_path(self,hash_value:int)->Path:
-131returnPath(self.cache_folder)/f"{hash_value}.pt"
-132
-133def_chunk_tokens(self,tokens:list[int])->list[list[int]]:
-134return[
-135tokens[i:i+self.block_size]
-136foriinrange(0,len(tokens),self.block_size)
-137]
-138
-139defget_num_new_matched_tokens(
-140self,request:LlmRequest,
-141num_computed_tokens:int)->tuple[int,bool]:
-142self.pending_loads[request.request_id]=[]
+123defwait_for_layer_load(self,layer_idx:int,stream:torch.cuda.Stream):
+124pass
+125
+126defsave_kv_layer(self,layer_idx:int,stream:torch.cuda.Stream):
+127pass
+128
+129defwait_for_save(self,stream:torch.cuda.Stream):
+130
+131# Make sure the forward pass is complete before beginning our save.
+132stream.synchronize()
+133
+134forpath,block_idinself._metadata.save:
+135cpu_tensor=self.kv_cache_tensor[block_id].cpu()
+136
+137# Don't write anything if this specific block already exists.
+138ifPath(path).exists():
+139continue
+140
+141# Do a blocking save to the file. This way, we only return once all saves are complete.
+142torch.save(cpu_tensor,path)143
-144# Don't bother with sequences with partial matches.
-145if(num_computed_tokens%self.block_size)!=0:
-146return0,False
+144defget_finished(
+145self,finished_gen_req_ids:list[int],
+146started_loading_req_ids:list[int])->tuple[list[int],list[int]]:147
-148computed_blocks=num_computed_tokens//self.block_size
+148return[],[]149
-150# Get all the tokens that don't have a cache hit on device.
-151remaining_tokens=request.get_tokens(0)[computed_blocks*
-152self.block_size:]
-153
-154remaining_chunks=self._chunk_tokens(remaining_tokens)
+150
+151classPersistentKvCacheConnectorLeader(KvCacheConnectorScheduler):
+152
+153def__init__(self,llm_args:TorchLlmArgs):
+154super().__init__(llm_args)155
-156# For each chunk, check if it exists in our cache.
-157forchunkinremaining_chunks:
-158# Only do full blocks.
-159iflen(chunk)==self.block_size:
-160hashed_tokens=self._hash_tokens(chunk)
+156self.block_size=self._llm_args.kv_cache_config.tokens_per_block
+157self.pending_loads={}
+158
+159self.cache_folder=os.environ.get(CONNECTOR_CACHE_FOLDER_KEY,
+160"./connector_cache")161
-162file_path=self._file_path(hashed_tokens)
+162os.makedirs(self.cache_folder,exist_ok=True)163
-164# If we get a cache hit, we want to load it into device.
-165# Otherwise, we can stop looking.
-166iffile_path.exists():
-167self.pending_loads[request.request_id].append(file_path)
-168else:
-169break
-170
-171logger.info(
-172f"KV CONNECTOR: Matched {len(self.pending_loads[request.request_id])} blocks for request {request.request_id}"
-173)
-174
-175returnlen(
-176self.pending_loads[request.request_id])*self.block_size,False
-177
-178defrequest_finished(self,request:LlmRequest,
-179cache_block_ids:list[int])->bool:
-180# We don't do any asynchronous saving, so always return False
-181returnFalse
+164defbuild_connector_meta(self,scheduler_output:SchedulerOutput):
+165# NOTE: This is a simplified implementation, and does not work with chunked prefill.
+166
+167metadata=PersistentKvCacheConnectorMetadata()
+168
+169forreqinscheduler_output.new_requests:
+170# If we don't have any pending loads for this request, we can skip it.
+171ifreq.request_idnotinself.pending_loads:
+172continue
+173
+174num_computed_blocks=req.computed_position//self.block_size
+175block_ids=req.new_block_ids
+176
+177pending_load=self.pending_loads[req.request_id]
+178
+179forfile_path,block_posinzip(
+180pending_load,range(num_computed_blocks,len(block_ids))):
+181metadata.load.append((file_path,block_ids[block_pos]))182
-183defupdate_state_after_alloc(self,request:LlmRequest,
-184block_ids:list[int]):
-185pass
-186
-187
-188@click.command()
-189@click.argument("model",type=str)
-190defmain(model:str):
-191sys.path.append(os.path.join(
-192os.path.dirname(__file__),
-193"..",
-194))
+183# Break up the remainder of the token sequence into chunks.
+184chunks=self._chunk_tokens(req.new_tokens)
+185
+186# For each chunk that isn't already on device, and isn't in our connector cache, we need to save it.
+187forblock_posinrange(num_computed_blocks+len(pending_load),
+188len(block_ids)):
+189iflen(chunks[block_pos])==self.block_size:
+190hashed_tokens=self._hash_tokens(chunks[block_pos])
+191
+192file_path=self._file_path(hashed_tokens)
+193
+194metadata.save.append((file_path,block_ids[block_pos]))195
-196this_module=__file__[__file__.rfind("/")+1:__file__.rfind(".py")]
+196self.pending_loads={}197
-198kv_connector_config=KvCacheConnectorConfig(
-199connector_module=this_module,
-200connector_scheduler_class="PersistentKvCacheConnectorLeader",
-201connector_worker_class="PersistentKvCacheConnectorWorker",
-202)
-203
-204connector_cache_dir=TemporaryDirectory()
-205os.environ[CONNECTOR_CACHE_FOLDER_KEY]=connector_cache_dir.name
-206
-207llm=LLM(model=model,
-208backend="pytorch",
-209cuda_graph_config=None,
-210kv_connector_config=kv_connector_config)
+198returnmetadata
+199
+200def_hash_tokens(self,tokens:list[int])->int:
+201returnabs(hash(tuple(tokens)))
+202
+203def_file_path(self,hash_value:int)->Path:
+204returnPath(self.cache_folder)/f"{hash_value}.pt"
+205
+206def_chunk_tokens(self,tokens:list[int])->list[list[int]]:
+207return[
+208tokens[i:i+self.block_size]
+209foriinrange(0,len(tokens),self.block_size)
+210]211
-212test_text=(
-213"Nvidia Corporation is an American technology company headquartered in Santa Clara, California."
-214"Founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, it develops graphics processing units (GPUs), "
-215"system on a chips (SoCs), and application programming interfaces (APIs) for data science, high-performance computing, "
-216"and mobile and automotive applications. Tell me about the company.")
-217
-218sampling_params=SamplingParams(max_tokens=32)
-219
-220output=llm.generate([test_text],sampling_params)
-221text0=output[0].outputs[0].text
+212defget_num_new_matched_tokens(
+213self,request:LlmRequest,
+214num_computed_tokens:int)->tuple[int,bool]:
+215self.pending_loads[request.request_id]=[]
+216
+217# Don't bother with sequences with partial matches.
+218if(num_computed_tokens%self.block_size)!=0:
+219return0,False
+220
+221computed_blocks=num_computed_tokens//self.block_size222
-223print("First output: ",text0)
-224print("Loading new LLM instance...")
-225
-226delllm
-227
-228llm=LLM(model=model,
-229backend="pytorch",
-230cuda_graph_config=None,
-231kv_connector_config=kv_connector_config)
-232
-233output=llm.generate([test_text],sampling_params)
-234text1=output[0].outputs[0].text
-235
-236print("Second output (using connector cache): ",text1)
-237
-238asserttext0==text1
-239
-240connector_cache_dir.cleanup()
-241
-242
-243if__name__=="__main__":
-244main()
+223# Get all the tokens that don't have a cache hit on device.
+224remaining_tokens=request.get_tokens(0)[computed_blocks*
+225self.block_size:]
+226
+227remaining_chunks=self._chunk_tokens(remaining_tokens)
+228
+229# For each chunk, check if it exists in our cache.
+230forchunkinremaining_chunks:
+231# Only do full blocks.
+232iflen(chunk)==self.block_size:
+233hashed_tokens=self._hash_tokens(chunk)
+234
+235file_path=self._file_path(hashed_tokens)
+236
+237# If we get a cache hit, we want to load it into device.
+238# Otherwise, we can stop looking.
+239iffile_path.exists():
+240self.pending_loads[request.request_id].append(file_path)
+241else:
+242break
+243
+244logger.info(
+245f"KV CONNECTOR: Matched {len(self.pending_loads[request.request_id])} blocks for request {request.request_id}"
+246)
+247
+248returnlen(
+249self.pending_loads[request.request_id])*self.block_size,False
+250
+251defrequest_finished(self,request:LlmRequest,
+252cache_block_ids:list[int])->bool:
+253# We don't do any asynchronous saving, so always return False
+254returnFalse
+255
+256defupdate_state_after_alloc(self,request:LlmRequest,
+257block_ids:list[int]):
+258pass
+259
+260
+261@click.command()
+262@click.argument("model",type=str)
+263defmain(model:str):
+264sys.path.append(os.path.join(
+265os.path.dirname(__file__),
+266"..",
+267))
+268
+269this_module=__file__[__file__.rfind("/")+1:__file__.rfind(".py")]
+270
+271# --- KV Cache Connector Config ---
+272kv_connector_config=KvCacheConnectorConfig(
+273connector_module=this_module,
+274connector_scheduler_class="PersistentKvCacheConnectorLeader",
+275connector_worker_class="PersistentKvCacheConnectorWorker",
+276)
+277
+278connector_cache_dir=TemporaryDirectory()
+279os.environ[CONNECTOR_CACHE_FOLDER_KEY]=connector_cache_dir.name
+280
+281# Create LLM instance with KV Cache Connector
+282llm=LLM(model=model,
+283backend="pytorch",
+284cuda_graph_config=None,
+285kv_connector_config=kv_connector_config)
+286
+287test_text=(
+288"Nvidia Corporation is an American technology company headquartered in Santa Clara, California."
+289"Founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, it develops graphics processing units (GPUs), "
+290"system on a chips (SoCs), and application programming interfaces (APIs) for data science, high-performance computing, "
+291"and mobile and automotive applications. Tell me about the company.")
+292
+293sampling_params=SamplingParams(max_tokens=32)
+294
+295# Generate text with the first LLM instance and save the kv cache blocks by the connector.
+296output=llm.generate([test_text],sampling_params)
+297text0=output[0].outputs[0].text
+298
+299print("First output: ",text0)
+300print("Loading new LLM instance...")
+301
+302delllm
+303
+304# Create a new LLM instance with the same connector configuration
+305llm=LLM(model=model,
+306backend="pytorch",
+307cuda_graph_config=None,
+308kv_connector_config=kv_connector_config)
+309
+310# Generate text with the second LLM instance and it should reuse the kv cache blocks from the connector.
+311output=llm.generate([test_text],sampling_params)
+312text1=output[0].outputs[0].text
+313
+314print("Second output (using connector cache): ",text1)
+315
+316# Verify that the two outputs are identical
+317asserttext0==text1
+318
+319connector_cache_dir.cleanup()
+320
+321
+322if__name__=="__main__":
+323main()
1#!/bin/bash 2#SBATCH -A <account> # parameter 3#SBATCH -p <partition> # parameter
@@ -526,48 +531,87 @@
8#SBATCH -e logs/llmapi-distributed.err 9#SBATCH -J llmapi-distributed-task10
-11
-12# NOTE, this feature is experimental and may not work on all systems.
-13# The trtllm-llmapi-launch is a script that launches the LLM-API code on
-14# Slurm-like systems, and can support multi-node and multi-GPU setups.
-15
-16# Note that, the number of MPI processes should be the same as the model world
-17# size. e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
-18# each, or 4 nodes with 4 gpus for each or other combinations.
-19
-20# This docker image should have tensorrt_llm installed, or you need to install
-21# it in the task.
-22
-23# The following variables are expected to be set in the environment:
-24# You can set them via --export in the srun/sbatch command.
-25# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
-26# MOUNT_DIR: the directory to mount in the container
-27# MOUNT_DEST: the destination directory in the container
-28# WORKDIR: the working directory in the container
-29# SOURCE_ROOT: the path to the TensorRT LLM source
-30# PROLOGUE: the prologue to run before the script
-31# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
-32# not supported in Slurm mode, you need to download the model and put it in
-33# the LOCAL_MODEL directory.
-34
-35# Adjust the paths to run
-36exportscript=$SOURCE_ROOT/examples/llm-api/quickstart_advanced.py
-37
-38# Just launch the PyTorch example with trtllm-llmapi-launch command.
-39srun-l\
-40--container-image=${CONTAINER_IMAGE}\
-41--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
-42--container-workdir=${WORKDIR}\
-43--export=ALL\
-44--mpi=pmix\
-45bash-c"
-46$PROLOGUE
-47 export PATH=$PATH:~/.local/bin
-48 trtllm-llmapi-launch python3 $script \
-49 --model_dir $LOCAL_MODEL \
-50 --prompt 'Hello, how are you?' \
-51 --tp_size 2
-52 "
+11##############################################################################
+12# OVERVIEW:
+13# This script demonstrates running a custom LLM API Python script on SLURM
+14# with distributed inference support. It executes quickstart_advanced.py with
+15# tensor parallelism across multiple GPUs/nodes.
+16#
+17# WHAT TO MODIFY:
+18# 1. SLURM Parameters (lines 2-9):
+19# - Replace <account> with your SLURM account name
+20# - Replace <partition> with your SLURM partition name
+21# - Adjust -N (number of nodes) based on your TP size
+22# - Adjust --ntasks-per-node (GPUs per node) to match your setup
+23#
+24# 2. Environment Variables (set before running sbatch):
+25# - CONTAINER_IMAGE: Docker image with TensorRT-LLM installed
+26# - MOUNT_DIR: Host directory to mount in container
+27# - MOUNT_DEST: Container mount destination path
+28# - WORKDIR: Working directory inside container
+29# - SOURCE_ROOT: Path to TensorRT-LLM source code
+30# - PROLOGUE: Commands to run before main task (e.g., module loads)
+31# - LOCAL_MODEL: Path to your pre-downloaded model directory
+32#
+33# 3. Script Configuration (lines 39, 51-54):
+34# - Line 39: Change $script to point to your own Python script
+35# - Line 52: Modify --model_dir to use your model path
+36# - Line 53: Customize --prompt with your test prompt
+37# - Line 54: Adjust --tp_size to match your node/GPU setup
+38#
+39# EXAMPLE USAGE:
+40# export CONTAINER_IMAGE="nvcr.io/nvidia/tensorrt_llm:latest"
+41# export LOCAL_MODEL="/path/to/llama-model"
+42# sbatch llm_mgmn_llm_distributed.sh
+43#
+44# NOTE: This is a template - you can replace quickstart_advanced.py with any
+45# LLM API Python script. The trtllm-llmapi-launch wrapper handles the
+46# distributed execution setup automatically.
+47##############################################################################
+48
+49
+50# NOTE, this feature is experimental and may not work on all systems.
+51# The trtllm-llmapi-launch is a script that launches the LLM-API code on
+52# Slurm-like systems, and can support multi-node and multi-GPU setups.
+53
+54# IMPORTANT: Total MPI processes (nodes × ntasks-per-node) must equal tp_size.
+55# e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
+56# each, or 4 nodes with 4 gpus for each or other combinations.
+57
+58# This docker image should have tensorrt_llm installed, or you need to install
+59# it in the task.
+60
+61# The following variables are expected to be set in the environment:
+62# You can set them via --export in the srun/sbatch command.
+63# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
+64# MOUNT_DIR: the directory to mount in the container
+65# MOUNT_DEST: the destination directory in the container
+66# WORKDIR: the working directory in the container
+67# SOURCE_ROOT: the path to the TensorRT LLM source
+68# PROLOGUE: the prologue to run before the script
+69# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
+70# not supported in Slurm mode, you need to download the model and put it in
+71# the LOCAL_MODEL directory.
+72
+73# Adjust the paths to run
+74exportscript=$SOURCE_ROOT/examples/llm-api/quickstart_advanced.py
+75
+76# Just launch the PyTorch example with trtllm-llmapi-launch command.
+77srun-l\
+78--container-image=${CONTAINER_IMAGE}\
+79--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
+80--container-workdir=${WORKDIR}\
+81--export=ALL\
+82--mpi=pmix\
+83bash-c"
+84$PROLOGUE
+85 export PATH=$PATH:~/.local/bin
+86 trtllm-llmapi-launch python3 $script \
+87 --model_dir $LOCAL_MODEL \
+88 --prompt 'Hello, how are you?' \
+89 --tp_size 2 \
+90 --max_batch_size 256
+91 "
1#!/bin/bash
- 2#SBATCH -A <account>
- 3#SBATCH -p <partition>
- 4#SBATCH -t 01:00:00
- 5#SBATCH -N 2
- 6#SBATCH --ntasks-per-node=8
- 7#SBATCH -o logs/trtllm-bench.out
- 8#SBATCH -e logs/trtllm-bench.err
- 9#SBATCH -J trtllm-bench
-10
-11
-12# NOTE, this feature is experimental and may not work on all systems.
-13# The trtllm-llmapi-launch is a script that launches the LLM-API code on
-14# Slurm-like systems, and can support multi-node and multi-GPU setups.
-15
-16# Note that, the number of MPI processes should be the same as the model world
-17# size. e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
-18# each, or 4 nodes with 4 gpus for each or other combinations.
-19
-20# This docker image should have tensorrt_llm installed, or you need to install
-21# it in the task.
-22
-23# The following variables are expected to be set in the environment:
-24# You can set them via --export in the srun/sbatch command.
-25# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
-26# MOUNT_DIR: the directory to mount in the container
-27# MOUNT_DEST: the destination directory in the container
-28# WORKDIR: the working directory in the container
-29# SOURCE_ROOT: the path to the TensorRT LLM source
-30# PROLOGUE: the prologue to run before the script
-31# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
-32# not supported in Slurm mode, you need to download the model and put it in
-33# the LOCAL_MODEL directory.
-34
-35exportprepare_dataset="$SOURCE_ROOT/benchmarks/cpp/prepare_dataset.py"
-36exportdata_path="$WORKDIR/token-norm-dist.txt"
-37
-38echo"Preparing dataset..."
-39srun-l\
-40-N1\
-41-n1\
-42--container-image=${CONTAINER_IMAGE}\
-43--container-name="prepare-name"\
-44--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
-45--container-workdir=${WORKDIR}\
-46--export=ALL\
-47--mpi=pmix\
-48bash-c"
-49$PROLOGUE
-50 python3 $prepare_dataset \
-51 --tokenizer=$LOCAL_MODEL \
-52 --stdout token-norm-dist \
-53 --num-requests=100 \
-54 --input-mean=128 \
-55 --output-mean=128 \
-56 --input-stdev=0 \
-57 --output-stdev=0 > $data_path
-58 "
-59
-60echo"Running benchmark..."
-61# Just launch trtllm-bench job with trtllm-llmapi-launch command.
-62
-63srun-l\
-64--container-image=${CONTAINER_IMAGE}\
-65--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
-66--container-workdir=${WORKDIR}\
-67--export=ALL,PYTHONPATH=${SOURCE_ROOT}\
-68--mpi=pmix\
-69bash-c"
-70 set -ex
-71$PROLOGUE
-72 export PATH=$PATH:~/.local/bin
-73
-74 # This is optional
-75 cat > /tmp/pytorch_extra_args.txt << EOF
-76cuda_graph_config: null
-77print_iter_log: true
-78enable_attention_dp: false
-79EOF
-80
-81 # launch the benchmark
-82 trtllm-llmapi-launch \
-83 trtllm-bench \
-84 --model $MODEL_NAME \
-85 --model_path $LOCAL_MODEL \
-86 throughput \
-87 --dataset $data_path \
-88 --backend pytorch \
-89 --tp 16 \
-90 --extra_llm_api_options /tmp/pytorch_extra_args.txt \
-91$EXTRA_ARGS
-92 "
+
1#!/bin/bash
+ 2#SBATCH -A <account>
+ 3#SBATCH -p <partition>
+ 4#SBATCH -t 01:00:00
+ 5#SBATCH -N 2
+ 6#SBATCH --ntasks-per-node=8
+ 7#SBATCH -o logs/trtllm-bench.out
+ 8#SBATCH -e logs/trtllm-bench.err
+ 9#SBATCH -J trtllm-bench
+ 10
+ 11##############################################################################
+ 12# OVERVIEW:
+ 13# This script runs trtllm-bench throughput benchmarking on SLURM with multi-node,
+ 14# multi-GPU setup. It prepares a synthetic dataset and then benchmarks the model
+ 15# using the PyTorch backend with tensor parallelism.
+ 16#
+ 17# WHAT TO MODIFY:
+ 18# 1. SLURM Parameters (lines 2-9):
+ 19# - Replace <account> with your SLURM account name
+ 20# - Replace <partition> with your SLURM partition name
+ 21# - Adjust -N (number of nodes) based on your TP size
+ 22# - Adjust --ntasks-per-node (GPUs per node) to match your setup
+ 23#
+ 24# 2. Environment Variables (set before running sbatch):
+ 25# - CONTAINER_IMAGE: Docker image with TensorRT-LLM installed
+ 26# - MOUNT_DIR: Host directory to mount in container
+ 27# - MOUNT_DEST: Container mount destination path
+ 28# - WORKDIR: Working directory inside container
+ 29# - SOURCE_ROOT: Path to TensorRT-LLM source code
+ 30# - PROLOGUE: Commands to run before main task (e.g., module loads)
+ 31# - LOCAL_MODEL: Path to your pre-downloaded model directory
+ 32# - MODEL_NAME: Name of the model to benchmark
+ 33# - EXTRA_ARGS: (Optional) Additional benchmark arguments
+ 34#
+ 35# 3. Model Configuration (lines 87-94):
+ 36# - --tp 16: Adjust tensor parallelism size to match your node/GPU setup
+ 37# - --num-requests (line 56): Change number of benchmark requests
+ 38# - --input-mean/output-mean (lines 57-58): Adjust token lengths
+ 39#
+ 40# EXAMPLE USAGE:
+ 41# export CONTAINER_IMAGE="nvcr.io/nvidia/tensorrt_llm:latest"
+ 42# export LOCAL_MODEL="/path/to/llama-model"
+ 43# export MODEL_NAME="meta-llama/Llama-2-7b-hf"
+ 44# sbatch llm_mgmn_trtllm_bench.sh
+ 45##############################################################################
+ 46
+ 47
+ 48# NOTE, this feature is experimental and may not work on all systems.
+ 49# The trtllm-llmapi-launch is a script that launches the LLM-API code on
+ 50# Slurm-like systems, and can support multi-node and multi-GPU setups.
+ 51
+ 52# IMPORTANT: Total MPI processes (nodes × ntasks-per-node) must equal tensor_parallel_size.
+ 53# e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
+ 54# each, or 4 nodes with 4 gpus for each or other combinations.
+ 55
+ 56# This docker image should have tensorrt_llm installed, or you need to install
+ 57# it in the task.
+ 58
+ 59# The following variables are expected to be set in the environment:
+ 60# You can set them via --export in the srun/sbatch command.
+ 61# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
+ 62# MOUNT_DIR: the directory to mount in the container
+ 63# MOUNT_DEST: the destination directory in the container
+ 64# WORKDIR: the working directory in the container
+ 65# SOURCE_ROOT: the path to the TensorRT LLM source
+ 66# PROLOGUE: the prologue to run before the script
+ 67# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
+ 68# not supported in Slurm mode, you need to download the model and put it in
+ 69# the LOCAL_MODEL directory.
+ 70
+ 71exportprepare_dataset="$SOURCE_ROOT/benchmarks/cpp/prepare_dataset.py"
+ 72exportdata_path="$WORKDIR/token-norm-dist.txt"
+ 73
+ 74echo"Preparing dataset..."
+ 75srun-l\
+ 76-N1\
+ 77-n1\
+ 78--container-image=${CONTAINER_IMAGE}\
+ 79--container-name="prepare-name"\
+ 80--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
+ 81--container-workdir=${WORKDIR}\
+ 82--export=ALL\
+ 83--mpi=pmix\
+ 84bash-c"
+ 85$PROLOGUE
+ 86 python3 $prepare_dataset \
+ 87 --tokenizer=$LOCAL_MODEL \
+ 88 --stdout token-norm-dist \
+ 89 --num-requests=100 \
+ 90 --input-mean=128 \
+ 91 --output-mean=128 \
+ 92 --input-stdev=0 \
+ 93 --output-stdev=0 > $data_path
+ 94 "
+ 95
+ 96echo"Running benchmark..."
+ 97# Just launch trtllm-bench job with trtllm-llmapi-launch command.
+ 98
+ 99srun-l\
+100--container-image=${CONTAINER_IMAGE}\
+101--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
+102--container-workdir=${WORKDIR}\
+103--export=ALL,PYTHONPATH=${SOURCE_ROOT}\
+104--mpi=pmix\
+105bash-c"
+106 set -ex
+107$PROLOGUE
+108 export PATH=$PATH:~/.local/bin
+109
+110 # This is optional
+111 cat > /tmp/pytorch_extra_args.txt << EOF
+112cuda_graph_config: null
+113print_iter_log: true
+114enable_attention_dp: false
+115EOF
+116
+117 # launch the benchmark
+118 trtllm-llmapi-launch \
+119 trtllm-bench \
+120 --model $MODEL_NAME \
+121 --model_path $LOCAL_MODEL \
+122 throughput \
+123 --dataset $data_path \
+124 --backend pytorch \
+125 --tp 16 \
+126 --extra_llm_api_options /tmp/pytorch_extra_args.txt \
+127$EXTRA_ARGS
+128 "
1#!/bin/bash 2#SBATCH -A <account> 3#SBATCH -p <partition>
@@ -526,49 +531,85 @@
8#SBATCH -e logs/trtllm-serve.err 9#SBATCH -J trtllm-serve10
-11
-12# NOTE, this feature is experimental and may not work on all systems.
-13# The trtllm-llmapi-launch is a script that launches the LLM-API code on
-14# Slurm-like systems, and can support multi-node and multi-GPU setups.
-15
-16# Note that, the number of MPI processes should be the same as the model world
-17# size. e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
-18# each, or 4 nodes with 4 gpus for each or other combinations.
-19
-20# This docker image should have tensorrt_llm installed, or you need to install
-21# it in the task.
-22
-23# The following variables are expected to be set in the environment:
-24# You can set them via --export in the srun/sbatch command.
-25# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
-26# MOUNT_DIR: the directory to mount in the container
-27# MOUNT_DEST: the destination directory in the container
-28# WORKDIR: the working directory in the container
-29# SOURCE_ROOT: the path to the TensorRT LLM source
-30# PROLOGUE: the prologue to run before the script
-31# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
-32# not supported in Slurm mode, you need to download the model and put it in
-33# the LOCAL_MODEL directory.
-34
-35echo"Starting trtllm-serve..."
-36# Just launch trtllm-serve job with trtllm-llmapi-launch command.
-37srun-l\
-38--container-image=${CONTAINER_IMAGE}\
-39--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
-40--container-workdir=${WORKDIR}\
-41--export=ALL,PYTHONPATH=${SOURCE_ROOT}\
-42--mpi=pmix\
-43bash-c"
-44 set -ex
-45$PROLOGUE
-46 export PATH=$PATH:~/.local/bin
+11##############################################################################
+12# OVERVIEW:
+13# This script launches trtllm-serve (OpenAI-compatible API server) on SLURM
+14# with multi-node, multi-GPU support. The server will start on all allocated
+15# nodes and serve the model with tensor parallelism.
+16#
+17# WHAT TO MODIFY:
+18# 1. SLURM Parameters (lines 2-9):
+19# - Replace <account> with your SLURM account name
+20# - Replace <partition> with your SLURM partition name
+21# - Adjust -N (number of nodes) based on your TP size
+22# - Adjust --ntasks-per-node (GPUs per node) to match your setup
+23#
+24# 2. Environment Variables (set before running sbatch):
+25# - CONTAINER_IMAGE: Docker image with TensorRT-LLM installed
+26# - MOUNT_DIR: Host directory to mount in container
+27# - MOUNT_DEST: Container mount destination path
+28# - WORKDIR: Working directory inside container
+29# - SOURCE_ROOT: Path to TensorRT-LLM source code
+30# - PROLOGUE: Commands to run before main task (e.g., module loads)
+31# - LOCAL_MODEL: Path to your pre-downloaded model directory
+32# - ADDITIONAL_OPTIONS: (Optional) Extra trtllm-serve options
+33#
+34# 3. Server Configuration (lines 51-55):
+35# - --tp_size 16: Adjust tensor parallelism to match your node/GPU setup
+36# - --host 0.0.0.0: Server bind address (0.0.0.0 allows external access)
+37#
+38# EXAMPLE USAGE:
+39# export CONTAINER_IMAGE="nvcr.io/nvidia/tensorrt_llm:latest"
+40# export LOCAL_MODEL="/path/to/llama-model"
+41# sbatch llm_mgmn_trtllm_serve.sh
+42#
+43# NOTE: After the server starts, you can send requests to it using curl or
+44# the OpenAI Python client. Check the output logs for the server address.
+45##############################################################################
+4647
-48 trtllm-llmapi-launch \
-49 trtllm-serve $LOCAL_MODEL \
-50 --tp_size 16 \
-51 --host 0.0.0.0 \
-52${ADDITIONAL_OPTIONS}
-53 "
+48# NOTE, this feature is experimental and may not work on all systems.
+49# The trtllm-llmapi-launch is a script that launches the LLM-API code on
+50# Slurm-like systems, and can support multi-node and multi-GPU setups.
+51
+52# IMPORTANT: Total MPI processes (nodes × ntasks-per-node) must equal tp_size.
+53# e.g. For tensor_parallel_size=16, you may use 2 nodes with 8 gpus for
+54# each, or 4 nodes with 4 gpus for each or other combinations.
+55
+56# This docker image should have tensorrt_llm installed, or you need to install
+57# it in the task.
+58
+59# The following variables are expected to be set in the environment:
+60# You can set them via --export in the srun/sbatch command.
+61# CONTAINER_IMAGE: the docker image to use, you'd better install tensorrt_llm in it, or install it in the task.
+62# MOUNT_DIR: the directory to mount in the container
+63# MOUNT_DEST: the destination directory in the container
+64# WORKDIR: the working directory in the container
+65# SOURCE_ROOT: the path to the TensorRT LLM source
+66# PROLOGUE: the prologue to run before the script
+67# LOCAL_MODEL: the local model directory to use, NOTE: downloading from HF is
+68# not supported in Slurm mode, you need to download the model and put it in
+69# the LOCAL_MODEL directory.
+70
+71echo"Starting trtllm-serve..."
+72# Just launch trtllm-serve job with trtllm-llmapi-launch command.
+73srun-l\
+74--container-image=${CONTAINER_IMAGE}\
+75--container-mounts=${MOUNT_DIR}:${MOUNT_DEST}\
+76--container-workdir=${WORKDIR}\
+77--export=ALL,PYTHONPATH=${SOURCE_ROOT}\
+78--mpi=pmix\
+79bash-c"
+80 set -ex
+81$PROLOGUE
+82 export PATH=$PATH:~/.local/bin
+83
+84 trtllm-llmapi-launch \
+85 trtllm-serve $LOCAL_MODEL \
+86 --tp_size 16 \
+87 --host 0.0.0.0 \
+88${ADDITIONAL_OPTIONS}
+89 "
@@ -852,7 +857,7 @@ reach that point).
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocates new blocks from a pool and recycles those
blocks when required. See the implementation of
-KVCacheManager.
+KVCacheManager.
Figure 7. Dynamo integration with disaggregated service
In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
-
For more information on how to use Dynamo with TensorRT-LLM, please refer to this documentation.
+
For more information on how to use Dynamo with TensorRT-LLM, please refer to this documentation.
Guided decoding (or interchangeably constrained decoding, structured generation) guarantees that the LLM outputs are amenable to a user-specified grammar (e.g., JSON schema, regular expression or EBNF grammar).
If you are using trtllm-serve, enable guided decoding by specifying guided_decoding_backend with xgrammar or llguidance in the YAML configuration file, and pass it to --extra_llm_api_options. For example,
Define a JSON schema and pass it to response_format when creating the OpenAI chat completion request. Alternatively, the JSON schema can be created using pydantic.
+
fromopenaiimportOpenAI
+
+client=OpenAI(
+ base_url="http://localhost:8000/v1",
+ api_key="tensorrt_llm",
+)
+
+json_schema={
+ "type":"object",
+ "properties":{
+ "name":{
+ "type":"string",
+ "pattern":"^[\\w]+$"
+ },
+ "population":{
+ "type":"integer"
+ },
+ },
+ "required":["name","population"],
+}
+messages=[
+ {
+ "role":"system",
+ "content":"You are a helpful assistant.",
+ },
+ {
+ "role":"user",
+ "content":"Give me the information of the capital of France in the JSON format.",
+ },
+]
+chat_completion=client.chat.completions.create(
+ model="nvidia/Llama-3.1-8B-Instruct-FP8",
+ messages=messages,
+ max_completion_tokens=256,
+ response_format={
+ "type":"json",
+ "schema":json_schema
+ },
+)
+
+message=chat_completion.choices[0].message
+print(message.content)
+
Define a structural tag and pass it to response_format when creating the OpenAI chat completion request.
+
Structural tag is supported by xgrammar backend only. It is a powerful and flexible tool to represent the LLM output constraints. Please see structural tag usage for a comprehensive tutorial. Below is an example of function calling with customized function call format for Llama-3.1-8B-Instruct.
+
fromopenaiimportOpenAI
+
+client=OpenAI(
+ base_url="http://localhost:8000/v1",
+ api_key="tensorrt_llm",
+)
+
+tool_get_current_weather={
+ "type":"function",
+ "function":{
+ "name":"get_current_weather",
+ "description":"Get the current weather in a given location",
+ "parameters":{
+ "type":"object",
+ "properties":{
+ "city":{
+ "type":"string",
+ "description":"The city to find the weather for, e.g. 'San Francisco'",
+ },
+ "state":{
+ "type":"string",
+ "description":"the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'",
+ },
+ "unit":{
+ "type":"string",
+ "description":"The unit to fetch the temperature in",
+ "enum":["celsius","fahrenheit"],
+ },
+ },
+ "required":["city","state","unit"],
+ },
+ },
+}
+
+tool_get_current_date={
+ "type":"function",
+ "function":{
+ "name":"get_current_date",
+ "description":"Get the current date and time for a given timezone",
+ "parameters":{
+ "type":"object",
+ "properties":{
+ "timezone":{
+ "type":"string",
+ "description":"The timezone to fetch the current date and time for, e.g. 'America/New_York'",
+ }
+ },
+ "required":["timezone"],
+ },
+ },
+}
+
+system_prompt=f"""# Tool Instructions
+- Always execute python code in messages that you share.
+- When looking for real time information use relevant functions if available else fallback to brave_search
+You have access to the following functions:
+Use the function 'get_current_weather' to: Get the current weather in a given location
+{tool_get_current_weather["function"]}
+Use the function 'get_current_date' to: Get the current date and time for a given timezone
+{tool_get_current_date["function"]}
+If a you choose to call a function ONLY reply in the following format:
+<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
+where
+start_tag => `<function`
+parameters => a JSON dict with the function argument name as key and function argument value as value.
+end_tag => `</function>`
+Here is an example,
+<function=example_function_name>{{"example_name": "example_value"}}</function>
+Reminder:
+- Function calls MUST follow the specified format
+- Required parameters MUST be specified
+- Only call one function at a time
+- Put the entire function call reply on one line
+- Always add your sources when using search results to answer the user query
+You are a helpful assistant."""
+user_prompt="You are in New York. Please get the current date and time, and the weather."
+
+messages=[
+ {
+ "role":"system",
+ "content":system_prompt,
+ },
+ {
+ "role":"user",
+ "content":user_prompt,
+ },
+]
+
+chat_completion=client.chat.completions.create(
+ model="nvidia/Llama-3.1-8B-Instruct-FP8",
+ messages=messages,
+ max_completion_tokens=256,
+ response_format={
+ "type":"structural_tag",
+ "format":{
+ "type":"triggered_tags",
+ "triggers":["<function="],
+ "tags":[
+ {
+ "begin":"<function=get_current_weather>",
+ "content":{
+ "type":"json_schema",
+ "json_schema":tool_get_current_weather["function"]["parameters"]
+ },
+ "end":"</function>",
+ },
+ {
+ "begin":"<function=get_current_date>",
+ "content":{
+ "type":"json_schema",
+ "json_schema":tool_get_current_date["function"]["parameters"]
+ },
+ "end":"</function>",
+ },
+ ],
+ },
+ },
+)
+
+message=chat_completion.choices[0].message
+print(message.content)
+
+
+
The output would look like:
+
<function=get_current_date>{"timezone": "America/New_York"}</function>
+<function=get_current_weather>{"city": "New York", "state": "NY", "unit": "fahrenheit"}</function>
+
If you are using LLM API, enable guided decoding by specifying guided_decoding_backend with xgrammar or llguidance when creating the LLM instance. For example,
Create a GuidedDecodingParams with the json field specified with a JSON schema, use it to create SamplingParams, and then pass to llm.generate or llm.generate_async. Alternatively, the JSON schema can be created using pydantic.
+
fromtensorrt_llmimportLLM
+fromtensorrt_llm.sampling_paramsimportSamplingParams,GuidedDecodingParams
+
+if__name__=="__main__":
+ llm=LLM("nvidia/Llama-3.1-8B-Instruct-FP8",guided_decoding_backend="xgrammar")
+
+ json_schema={
+ "type":"object",
+ "properties":{
+ "name":{
+ "type":"string",
+ "pattern":"^[\\w]+$"
+ },
+ "population":{
+ "type":"integer"
+ },
+ },
+ "required":["name","population"],
+ }
+ messages=[
+ {
+ "role":"system",
+ "content":"You are a helpful assistant.",
+ },
+ {
+ "role":"user",
+ "content":"Give me the information of the capital of France in the JSON format.",
+ },
+ ]
+ prompt=llm.tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
+
+ output=llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256,guided_decoding=GuidedDecodingParams(json=json_schema)),
+ )
+ print(output.outputs[0].text)
+
Create a GuidedDecodingParams with the regex field specified with a regular expression, use it to create SamplingParams, and then pass to llm.generate or llm.generate_async.
+
fromtensorrt_llmimportLLM
+fromtensorrt_llm.sampling_paramsimportSamplingParams,GuidedDecodingParams
+
+if__name__=="__main__":
+ llm=LLM("nvidia/Llama-3.1-8B-Instruct-FP8",guided_decoding_backend="xgrammar")
+
+ messages=[
+ {
+ "role":"system",
+ "content":"You are a helpful assistant.",
+ },
+ {
+ "role":"user",
+ "content":"What is the capital of France?",
+ },
+ ]
+ prompt=llm.tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
+
+ output=llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256,guided_decoding=GuidedDecodingParams(regex="(Paris|London)")),
+ )
+ print(output.outputs[0].text)
+
Create a GuidedDecodingParams with the grammar field specified with an EBNF grammar, use it to create SamplingParams, and then pass to llm.generate or llm.generate_async.
+
fromtensorrt_llmimportLLM
+fromtensorrt_llm.sampling_paramsimportSamplingParams,GuidedDecodingParams
+
+if__name__=="__main__":
+ llm=LLM("nvidia/Llama-3.1-8B-Instruct-FP8",guided_decoding_backend="xgrammar")
+
+ ebnf_grammar="""root ::= description
+city ::= "London" | "Paris" | "Berlin" | "Rome"
+description ::= city " is " status
+status ::= "the capital of " country
+country ::= "England" | "France" | "Germany" | "Italy"
+"""
+ messages=[
+ {
+ "role":"system",
+ "content":"You are a helpful geography bot."
+ },
+ {
+ "role":"user",
+ "content":"Give me the information of the capital of France.",
+ },
+ ]
+ prompt=llm.tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
+
+ output=llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256,guided_decoding=GuidedDecodingParams(grammar=ebnf_grammar)),
+ )
+ print(output.outputs[0].text)
+
Create a GuidedDecodingParams with the structural_tag field specified with a structural tag string, use it to create SamplingParams, and then pass to llm.generate or llm.generate_async.
+
Structural tag is supported by xgrammar backend only. It is a powerful and flexible tool to represent the LLM output constraints. Please see structural tag usage for a comprehensive tutorial. Below is an example of function calling with customized function call format for Llama-3.1-8B-Instruct.
+
importjson
+fromtensorrt_llmimportLLM
+fromtensorrt_llm.sampling_paramsimportSamplingParams,GuidedDecodingParams
+
+if__name__=="__main__":
+ llm=LLM("nvidia/Llama-3.1-8B-Instruct-FP8",guided_decoding_backend="xgrammar")
+
+ tool_get_current_weather={
+ "type":"function",
+ "function":{
+ "name":"get_current_weather",
+ "description":"Get the current weather in a given location",
+ "parameters":{
+ "type":"object",
+ "properties":{
+ "city":{
+ "type":"string",
+ "description":"The city to find the weather for, e.g. 'San Francisco'",
+ },
+ "state":{
+ "type":"string",
+ "description":"the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'",
+ },
+ "unit":{
+ "type":"string",
+ "description":"The unit to fetch the temperature in",
+ "enum":["celsius","fahrenheit"],
+ },
+ },
+ "required":["city","state","unit"],
+ },
+ },
+ }
+
+ tool_get_current_date={
+ "type":"function",
+ "function":{
+ "name":"get_current_date",
+ "description":"Get the current date and time for a given timezone",
+ "parameters":{
+ "type":"object",
+ "properties":{
+ "timezone":{
+ "type":"string",
+ "description":"The timezone to fetch the current date and time for, e.g. 'America/New_York'",
+ }
+ },
+ "required":["timezone"],
+ },
+ },
+ }
+
+ system_prompt=f"""# Tool Instructions
+- Always execute python code in messages that you share.
+- When looking for real time information use relevant functions if available else fallback to brave_search
+You have access to the following functions:
+Use the function 'get_current_weather' to: Get the current weather in a given location
+{tool_get_current_weather["function"]}
+Use the function 'get_current_date' to: Get the current date and time for a given timezone
+{tool_get_current_date["function"]}
+If a you choose to call a function ONLY reply in the following format:
+<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}
+where
+start_tag => `<function`
+parameters => a JSON dict with the function argument name as key and function argument value as value.
+end_tag => `</function>`
+Here is an example,
+<function=example_function_name>{{"example_name": "example_value"}}</function>
+Reminder:
+- Function calls MUST follow the specified format
+- Required parameters MUST be specified
+- Only call one function at a time
+- Put the entire function call reply on one line
+- Always add your sources when using search results to answer the user query
+You are a helpful assistant."""
+ user_prompt="You are in New York. Please get the current date and time, and the weather."
+ structural_tag={
+ "type":"structural_tag",
+ "format":{
+ "type":"triggered_tags",
+ "triggers":["<function="],
+ "tags":[
+ {
+ "begin":"<function=get_current_weather>",
+ "content":{
+ "type":"json_schema",
+ "json_schema":tool_get_current_weather["function"]["parameters"]
+ },
+ "end":"</function>",
+ },
+ {
+ "begin":"<function=get_current_date>",
+ "content":{
+ "type":"json_schema",
+ "json_schema":tool_get_current_date["function"]["parameters"]
+ },
+ "end":"</function>",
+ },
+ ],
+ },
+ }
+
+ messages=[
+ {
+ "role":"system",
+ "content":system_prompt,
+ },
+ {
+ "role":"user",
+ "content":user_prompt,
+ },
+ ]
+ prompt=llm.tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
+
+ output=llm.generate(
+ prompt,
+ sampling_params=SamplingParams(max_tokens=256,guided_decoding=GuidedDecodingParams(structural_tag=json.dumps(structural_tag))),
+ )
+ print(output.outputs[0].text)
+
+
+
The output would look like:
+
<function=get_current_date>{"timezone": "America/New_York"}</function>
+<function=get_current_weather>{"city": "New York", "state": "NY", "unit": "fahrenheit"}</function>
+
Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional attention-FFN disaggregation (AFD) techniques, which spatially separate attention and FFN blocks onto different GPUs, Helix temporally separates them by reconfiguring the same GPUs.
The KV Cache Connector is a flexible interface in TensorRT-LLM that enables remote or external access to the Key-Value (KV) cache. It allows developers to implement custom logic for loading, saving, and managing KV cache blocks, extending the capabilities of the standard KV cache manager.
+
This document explains the KV Cache Connector architecture, common use cases, and provides a detailed walkthrough of the included example.
The KV Cache Connector is designed to support a variety of advanced serving scenarios:
+
+
KV Cache Offloading: Move KV cache blocks from GPU memory to cheaper/larger storage (CPU RAM, NVMe SSD, or network storage) when they are not immediately needed, and reload them when required.
+
Custom Disaggregated Serving: Separate the prefill (context processing) and decode (token generation) phases onto different instances or machines. The connector can be used to transmit the KV cache generated during prefill to the decode instances.
+
KV Cache Sharing / P2P Transfer: Share KV cache states between different model instances or across peer-to-peer connections.
The connector architecture is split into two main components:
+
+
Scheduler (Leader): Responsible for orchestration. It decides what needs to be loaded or saved and builds metadata instructions. It runs only on the leader rank (rank 0).
+
Worker: Responsible for execution. It receives metadata from the scheduler and performs the actual data transfers (loading/saving) on the KV cache tensors. It runs on all ranks.
Description: The core orchestration method. Called during the scheduling phase. It examines the current requests and decides which blocks need to be loaded from or saved to the external store.
+
Arguments: scheduler_output contains information about new requests, blocks allocated, and current request states.
+
Returns: An arbitrary metadata object (picklable) that describes the tasks for the workers. This object is broadcasted to all workers.
Description: Called when a new request arrives. It checks to see if any KV cache can be loaded from an external KV store.
+
Returns: A tuple (num_tokens,is_async). num_tokens is the number of tokens found in the external cache. is_async indicates if the loading will happen asynchronously (background) or requires blocking.
Description: Called when a request completes generation.
+
Returns: A boolean indicating if an asynchronous save operation is underway. If True, the system waits for the operation to complete before releasing the KV cache blocks.
Description: A synchronization point. Ensures that the KV cache for a specific layer is fully loaded before the model attempts to perform the forward pass on that layer.
This example implements a file-system based KV cache.
+1.Save: When a request finishes or needs to be swapped out, its KV blocks are saved to disk as .pt files.
+2.Load: When a new request arrives with the same prompt prefix, the connector identifies the cached files and loads them back into GPU memory, skipping re-computation.
Metadata: The example defines a PersistentKvCacheConnectorMetadata dataclass containing lists of (file_path,block_id) tuples for both loading and saving. This simple structure allows the Scheduler to tell the Worker exactly which file corresponds to which GPU block index.
+
Hashing Strategy: The PersistentKvCacheConnectorLeader hashes the token sequence of a block to generate a unique filename (e.g., hash_value.pt). This acts as the lookup key.
+
Worker Logic:
+
+
start_load_kv: Iterates through the load list provided in the metadata, loads the .pt file to CPU, and copies it to the specific block_id in the GPU tensor.
+
wait_for_save: Performs the reverse. It copies data from the GPU block_id to CPU and saves it to disk using torch.save.
This example illustrates the API mechanics but has several limitations that make it unsuitable for high-performance production use without modification:
+
+
Blocking I/O: The example uses torch.load and torch.save synchronously. In a real implementation, these should be offloaded to a background thread or asynchronous I/O handler to avoid stalling the GPU.
+
Simplified Block Matching: The get_num_new_matched_tokens implementation in the example only matches full blocks. It does not handle partial cache hits.
+
FileSystem Latency: Storing one file per block can create high filesystem overhead.
The following examples demonstrate how to use TensorRT LLM’s multimodal support in various scenarios, including quick run examples, serving endpoints, and performance benchmarking.
You can then send OpenAI-compatible requests, such as via curl or API clients, to the server endpoint. See curl chat client for multimodal script as an example.
+
You can then send OpenAI-compatible requests, such as via curl or API clients, to the server endpoint. See curl chat client for multimodal script as an example.
@@ -602,9 +607,9 @@ different types of KV caches: contiguous and pagedThe paged KV cache decomposes the KV cache into blocks that are distributed to
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocates new blocks from a pool, and recycles those blocks when required. See the simplified implementation of
-tensorrt_llm.runtime.KVCacheManager.
+tensorrt_llm.runtime.KVCacheManager.
A more efficient C++ implementation is included in the
-Batch Manager.
+Batch Manager.
@@ -793,9 +798,9 @@ A more efficient C++ implementation is included in the
diff --git a/latest/features/parallel-strategy.html b/latest/features/parallel-strategy.html
index 51089555e1..f6b0172fb4 100644
--- a/latest/features/parallel-strategy.html
+++ b/latest/features/parallel-strategy.html
@@ -61,7 +61,7 @@
@@ -76,7 +76,7 @@
-
+
@@ -376,7 +376,9 @@
The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability, guided decoding and logits processors
+
The PyTorch backend supports most of the sampling features that are supported on the C++ backend, such as temperature, top-k and top-p sampling, beam search, stop words, bad words, penalty, context and generation logits, log probability and logits processors
@@ -521,7 +526,7 @@ on NGC. This is likely the simplest way to obtain TensorRT LLM. Please refer to
Container image tags
In the example shell commands, x.y.z corresponds to the TensorRT-LLM container
version to use. If omitted, IMAGE_TAG will default to tensorrt_llm.__version__
-(e.g., this documentation was generated from the 1.2.0rc4 source tree).
+(e.g., this documentation was generated from the 1.2.0rc5 source tree).
If this does not work, e.g., because a container for the version you are
currently working with has not been released yet, you can try using a
container published for a previous
@@ -658,9 +663,9 @@ for all related options.
Install CUDA Toolkit 13.0 following the CUDA Installation Guide for Linux
+and make sure CUDA_HOME environment variable is properly set.
+
The cuda-compat-13-0 package may be required depending on your system’s NVIDIA GPU
+driver version. For additional information, refer to the CUDA Forward Compatibility.
# By default, PyTorch CUDA 12.8 package is installed. Install PyTorch CUDA 13.0 package to align with the CUDA version used for building TensorRT LLM wheels.
pip3installtorch==2.9.0torchvision--index-urlhttps://download.pytorch.org/whl/cu130
@@ -733,9 +740,9 @@ pip3install--upgrade
@@ -702,9 +707,9 @@ This feature is currently in prototype, and the related API is subjected to chan
diff --git a/latest/legacy/advanced/executor.html b/latest/legacy/advanced/executor.html
index ae12628df8..17e37c3ece 100644
--- a/latest/legacy/advanced/executor.html
+++ b/latest/legacy/advanced/executor.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
TensorRT-LLM includes a high-level C++ API called the Executor API which allows you to execute requests
asynchronously, with in-flight batching, and without the need to define callbacks.
A software component (referred to as “the client” in the text that follows) can interact
-with the executor using the API defined in the executor.h file.
+with the executor using the API defined in the executor.h file.
For details about the API, refer to the _cpp_gen/executor.rst.
The following sections provide an overview of the main classes defined in the Executor API.
@@ -585,7 +590,7 @@ This allows the runtime to reconfigure itself for a new beam width when no reque
stop_token_ids=[tokenizer.eos_token_id]
-
Refer to tensorrt_llm/llmapi/tokenizer.py for more details. You may dump these materials to disk, and reload them to C++ runtime for use.
+
Refer to tensorrt_llm/llmapi/tokenizer.py for more details. You may dump these materials to disk, and reload them to C++ runtime for use.
Each request can be optionally specified with a GuidedDecodingParams, which defines the desired structured format. Currently, it supports four types:
GuidedDecodingParams::GuideType::kJSON: The generated text is amenable to JSON format;
@@ -634,12 +639,12 @@ This allows the runtime to reconfigure itself for a new beam width when no reque
Python bindings for the Executor API are also available to use the Executor API from Python. The Python bindings are defined in bindings.cpp and once built, are available in package tensorrt_llm.bindings.executor. Running 'help('tensorrt_llm.bindings.executor') in a Python interpreter will provide an overview of the classes available.
-
In addition, three Python examples are provided to demonstrate how to use the Python bindings to the Executor API for single and multi-GPU models. They can be found in examples/bindings.
+
Python bindings for the Executor API are also available to use the Executor API from Python. The Python bindings are defined in bindings.cpp and once built, are available in package tensorrt_llm.bindings.executor. Running 'help('tensorrt_llm.bindings.executor') in a Python interpreter will provide an overview of the classes available.
+
In addition, three Python examples are provided to demonstrate how to use the Python bindings to the Executor API for single and multi-GPU models. They can be found in examples/bindings.
In-flight Batching with the Triton Inference Server#
@@ -685,9 +690,9 @@ reach that point).
the different requests by a cache manager during processing. That cache manager
keeps track of the sequences, allocate new blocks from a pool and recycle those
blocks when required. See the simplified implementation of
-tensorrt_llm.runtime.KVCacheManager.
+tensorrt_llm.runtime.KVCacheManager.
A more efficient C++ implementation is included in the
-Batch Manager.
+Batch Manager.
@@ -977,9 +982,9 @@ is computed as:
diff --git a/latest/legacy/advanced/gpt-runtime.html b/latest/legacy/advanced/gpt-runtime.html
index 953cf6d029..19e4dc089e 100644
--- a/latest/legacy/advanced/gpt-runtime.html
+++ b/latest/legacy/advanced/gpt-runtime.html
@@ -63,7 +63,7 @@
@@ -76,7 +76,7 @@
-
+
@@ -376,7 +376,9 @@
@@ -745,9 +750,9 @@ An “event” is any significant change in the lifecycle or state of a KV cache
diff --git a/latest/legacy/advanced/kv-cache-reuse.html b/latest/legacy/advanced/kv-cache-reuse.html
index ca3f9eae58..6399eb592b 100644
--- a/latest/legacy/advanced/kv-cache-reuse.html
+++ b/latest/legacy/advanced/kv-cache-reuse.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
There are a few pitfalls that can prevent kv cache reuse when that seems possible. KV cache state only becomes reusable after the request that computed the state terminates. If you have a shared system prompt, the first request will compute kv cache state for the system prompt, the second request will reuse it, but only if the second request launches after the first request completed. If you run with a large batch-size, it is likely that many requests that share a common system prompt will be launched before the first request has terminated. No reuse will occur until one of the requests terminate, then subsequently scheduled requests can reuse.
Kv cache state for system prompts will remain reusable until memory is needed for launching a new request or propagating an existing one. When this happens, reusable blocks are evicted based on LRU. System prompts that are frequently used have a better chance of remaining reusable, but there is no guarantee since launching new requests take priority over possible reuse. Running with a larger batch size, or larger output sequence lengths for example will reduce the probability of kv cache blocks being reused, since it increases memory needs.
-
KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
+
KV cache state is stored in blocks, each block holds multiple tokens. Only full blocks can be shared by multiple requests, thus the block size matters. Partially matched blocks can also be reused, but that creates a new copy of the block for each sequence. The block size is a trade-off, larger block size may improve efficiency of compute kernels, but it reduces the likelihood of kv cache state reuse. The block defaults to 128 tokens, this can be changed when the model is built with the trtllm-build command, for example
trtllm-build--tokens_per_block32...
will create a model where one KV cache block can hold 32 tokens. Note that tokens_per_block must be a power of 2.
@@ -718,9 +723,9 @@ Assume vocabulary size is 100, which means normal text token ids are in range [0
diff --git a/latest/legacy/advanced/lora.html b/latest/legacy/advanced/lora.html
index 8d819bbe78..16e9698922 100644
--- a/latest/legacy/advanced/lora.html
+++ b/latest/legacy/advanced/lora.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
@@ -832,9 +837,9 @@ However, similar to any new model, you can follow the same approach to define yo
diff --git a/latest/legacy/advanced/weight-streaming.html b/latest/legacy/advanced/weight-streaming.html
index 3d52504b14..f7ac5361cf 100644
--- a/latest/legacy/advanced/weight-streaming.html
+++ b/latest/legacy/advanced/weight-streaming.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
@@ -521,7 +526,7 @@ to create graph representations of deep neural networks in TensorRT. To become
familiar with the core concepts of the TensorRT API, refer to the
Core Concepts
section of the TensorRT documentation before proceeding further.
-
In TensorRT-LLM, the tensorrt_llm.Builder class
contains a
tensorrt.Builder
object. That instance is used in the tensorrt_llm.Builder.create_network
@@ -529,7 +534,7 @@ method to create an instance of the
tensorrt.INetworkDefinition
class. The INetworkDefinition object can then be populated using the free
functions defined in the
-tensorrt_llm.functional.
A simple example of such a free function is tensorrt_llm.activation that inserts a
tensorrt.IActivationLayer
node in the graph of the model:
@@ -664,14 +669,14 @@ limitation, TensorRT offers a powerful mechanism known as
plugins.
The plugins are nodes inserted in the network graph definition that map to user-defined
GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in
-the cpp/tensorrt_llm/plugins directory.
Plugins are written in C++ and follow a well-defined interface described in the
Extending TensorRT with Custom Layers
section of the TensorRT
Developer Guide.
When executed within a TensorRT engine, plugins trigger the execution of
their encapsulated GPU kernels. A fairly simple example of plugins is the
-QuantizeTensorPlugin that
+QuantizeTensorPlugin that
triggers a CUDA kernel in the QuantizeTensorPlugin::enqueue member function:
// In cpp/tensorrt_llm/plugins/quantizeTensorPlugin/quantizeTensorPlugin.cpp:
@@ -715,7 +720,7 @@ using TensorRT plugins that wrap communication primitives from the
plugin that optimize the All-Reduce primitive in the presence of All-to-all
connections between GPUs (through NVSwitch in DGX systems).
@@ -917,9 +922,9 @@ This can be enabled via the LLM-API as such
diff --git a/latest/legacy/performance/performance-tuning-guide/useful-runtime-flags.html b/latest/legacy/performance/performance-tuning-guide/useful-runtime-flags.html
index 23a6c1a13f..9a68cbabda 100644
--- a/latest/legacy/performance/performance-tuning-guide/useful-runtime-flags.html
+++ b/latest/legacy/performance/performance-tuning-guide/useful-runtime-flags.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
@@ -2589,9 +2594,9 @@ the number of tokens used for each task, should be equal to prompt_embedding_tab
diff --git a/latest/legacy/python-api/tensorrt_llm.models.html b/latest/legacy/python-api/tensorrt_llm.models.html
index 652fbb14eb..e5f1f8bbb2 100644
--- a/latest/legacy/python-api/tensorrt_llm.models.html
+++ b/latest/legacy/python-api/tensorrt_llm.models.html
@@ -61,7 +61,7 @@
@@ -74,7 +74,7 @@
-
+
@@ -374,7 +374,9 @@
TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see BufferManager::initMemoryPool, which uses the default memory pool managed by the CUDA driver. When a TrtGptModel object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a TrtGptModel object. Memory will be released from the pool if it is required for other memory allocations.
-
However, nvidia-smi may still show high memory occupation after memory is returned to the CUDA driver’s memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by BufferManager::memoryPoolReserved()) and BufferManager::memoryPoolFree()), respectively.
+
TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see BufferManager::initMemoryPool, which uses the default memory pool managed by the CUDA driver. When a TrtGptModel object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a TrtGptModel object. Memory will be released from the pool if it is required for other memory allocations.
+
However, nvidia-smi may still show high memory occupation after memory is returned to the CUDA driver’s memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by BufferManager::memoryPoolReserved()) and BufferManager::memoryPoolFree()), respectively.
@@ -577,7 +582,7 @@ maintaining the accuracy of the network (on downstream tasks).
weights of the model. TensorRT-LLM includes scripts to prepare the model to
run using the SmoothQuant method.
Examples of how to enable SmoothQuant for GPT, GPT-J and LLaMA can be found in
-the examples/quantization folder of that release.
@@ -586,8 +591,8 @@ a model and dequantizing those weights on-the-fly in linear layers (Matmuls).
The activations are encoded using floating-point values (FP16 or BF16).
To use INT4/INT8 Weight-Only methods, the user must determine the scaling
factors to use to quantize and dequantize the weights of the model.
-
This release includes examples for GPT and
-LLaMA.
+
This release includes examples for GPT and
+LLaMA.
@@ -691,9 +696,9 @@ This feature is currently in beta, and the related API is subjected to change in
diff --git a/latest/llm-api/index.html b/latest/llm-api/index.html
index 25232c5d7f..ee02f6e4fb 100644
--- a/latest/llm-api/index.html
+++ b/latest/llm-api/index.html
@@ -61,7 +61,7 @@
@@ -76,7 +76,7 @@
-
+
@@ -376,7 +376,9 @@
otlp_traces_endpoint (Optional[str]) – prototype Target URL to which OpenTelemetry traces will be sent. Defaults to None.
return_perf_metrics (bool) – prototype Return perf metrics. Defaults to False.
orchestrator_type (Optional[Literal['rpc', 'ray']]) – prototype The orchestrator type to use. Defaults to None, which uses MPI. Defaults to None.
+
env_overrides (Optional[Dict[str, str]]) – prototype [EXPERIMENTAL] Environment variable overrides. NOTE: import-time-cached env vars in the code won’t update unless the code fetches them from os.environ on demand. Defaults to None.
garbage_collection_gen0_threshold (int) – beta Threshold for Python garbage collection of generation 0 objects.Lower values trigger more frequent garbage collection. Defaults to 20000.
cuda_graph_config (Optional[tensorrt_llm.llmapi.llm_args.CudaGraphConfig]) – beta CUDA graph config.If true, use CUDA graphs for decoding. CUDA graphs are only created for the batch sizes in cuda_graph_config.batch_sizes, and are enabled for batches that consist of decoding requests only (the reason is that it’s hard to capture a single graph with prefill requests since the input shapes are a function of the sequence lengths). Note that each CUDA graph can use up to 200 MB of extra memory. Defaults to None.
attn_backend (str) – beta Attention backend to use. Defaults to TRTLLM.
sampler_type (Union[str, tensorrt_llm.llmapi.llm_args.SamplerType]) – beta The type of sampler to use. Options are TRTLLMSampler, TorchSampler or auto. Defaults to auto, which will use TorchSampler unless BeamSearch is requested. Defaults to auto.
@@ -627,6 +634,7 @@ If checkpoint_format and checkpoint_loader are both provided, checkpoint_loader
mm_encoder_only (bool) – prototype Only load/execute the vision encoder part of the full model. Defaults to False. Defaults to False.
ray_worker_extension_cls (Optional[str]) – prototype The full worker extension class name including module path.Allows users to extend the functions of the RayGPUWorker class. Defaults to None.
enable_sleep (bool) – prototype Enable LLM sleep feature. Sleep feature requires extra setup that may slowdown model loading.Only enable it if you intend to use this feature. Defaults to False.
+
disable_flashinfer_sampling (bool) – prototype Disable the use of FlashInfer.sampling. This option is likely to be removed in the future. Defaults to False.
@@ -5352,7 +5360,7 @@ a subset of the possible backends.
@@ -17476,6 +17516,12 @@ If checkpoint_format and checkpoint_loader are both provided, checkpoint_loader
beta CUDA graph config.If true, use CUDA graphs for decoding. CUDA graphs are only created for the batch sizes in cuda_graph_config.batch_sizes, and are enabled for batches that consist of decoding requests only (the reason is that it’s hard to capture a single graph with prefill requests since the input shapes are a function of the sequence lengths). Note that each CUDA graph can use up to 200 MB of extra memory.
@@ -17548,6 +17594,12 @@ If checkpoint_format and checkpoint_loader are both provided, checkpoint_loader
prototype Enable LLM sleep feature. Sleep feature requires extra setup that may slowdown model loading.Only enable it if you intend to use this feature.
prototype [EXPERIMENTAL] Environment variable overrides. NOTE: import-time-cached env vars in the code won’t update unless the code fetches them from os.environ on demand.
[EXPERIMENTAL] Environment variable overrides. NOTE: import-time-cached env vars in the code won’t update unless the code fetches them from os.environ on demand.