Models
- class tensorrt_llm.models.BaichuanForCausalLM(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, position_embedding_type, dtype, mlp_hidden_size=None, mapping=<tensorrt_llm.mapping.Mapping object>)[source]
Bases:
BaichuanModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width, max_num_tokens: int | None = None)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.BertForQuestionAnswering(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, type_vocab_size, num_labels=2, mapping=<tensorrt_llm.mapping.Mapping object>, dtype=None)[source]
Bases:
Module
- class tensorrt_llm.models.BertModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, type_vocab_size, mapping=<tensorrt_llm.mapping.Mapping object>, dtype=None)[source]
Bases:
Module
- class tensorrt_llm.models.BloomForCausalLM(num_layers, num_heads, hidden_size, vocab_size, max_position_embeddings, hidden_act='gelu', dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, mlp_hidden_size=None, bias=True, quant_mode=QuantMode.None, multi_query_mode=False, use_parallel_embedding=False, embedding_sharding_dim=0, share_embedding_table=False)[source]
Bases:
BloomModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width: int = 1)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.BloomModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, mlp_hidden_size=None, bias=True, quant_mode=QuantMode.None, multi_query_mode=False, use_parallel_embedding=False, embedding_sharding_dim=0)[source]
Bases:
Module
- class tensorrt_llm.models.ChatGLM2HeadModel(hidden_size, num_attention_heads, kv_channels=128, multi_query_group_num=2, apply_query_key_layer_scaling=False, attention_mask_type=AttentionMaskType.causal, qkv_bias=True, linear_bias=False, use_int8_kv_cache=False, mapping=<tensorrt_llm.mapping.Mapping object>, ffn_hiden_size=13696, num_layers=28, eps=1e-05, act_func='swiglu', dtype=<DataType.HALF: 1>, quant_mode=QuantMode.None, max_seq_length=32768, vocab_size=65024, use_cache=True, kv_cache_block_pointers=None)[source]
Bases:
ChatGLM2Model,GenerationMixin- forward(input_ids=None, position_ids=None, last_token_ids=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width: int = 1)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.ChatGLM2Model(hidden_size, num_attention_heads, kv_channels=128, multi_query_group_num=2, apply_query_key_layer_scaling=False, attention_mask_type=AttentionMaskType.causal, qkv_bias=True, linear_bias=False, use_int8_kv_cache=False, mapping=<tensorrt_llm.mapping.Mapping object>, ffn_hiden_size=13696, num_layers=28, eps=1e-05, act_func='swiglu', dtype=<DataType.HALF: 1>, quant_mode=QuantMode.None, max_seq_length=32768, vocab_size=65024)[source]
Bases:
Module
- class tensorrt_llm.models.ChatGLM6BHeadModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype, mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, inter_size=None, bias=True, quant_mode=QuantMode.None)[source]
Bases:
ChatGLM6BModel- forward(input_ids=None, position_ids=None, use_cache=False, last_token_ids=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width: int = 1)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.ChatGLM6BModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, inter_size=None, bias=True, quant_mode=QuantMode.None)[source]
Bases:
Module
- class tensorrt_llm.models.DecoderModel(num_layers, num_heads, hidden_size, ffn_hidden_size, encoder_num_heads, encoder_hidden_size, vocab_size, dtype, logits_dtype='float32', num_kv_heads=None, max_position_embeddings=None, has_position_embedding=False, relative_attention=False, max_distance=None, num_buckets=None, type_vocab_size=None, has_embedding_layernorm=False, has_embedding_scale=False, q_scaling=1.0, has_attention_qkvo_bias=False, has_mlp_bias=False, has_model_final_layernorm=False, layernorm_eps=1e-05, layernorm_position=LayerNormPositionType.pre_layernorm, layernorm_type=LayerNormType.LayerNorm, hidden_act='relu', has_lm_head_bias=False, tp_group=None, tp_size=1, residual_scaling=1.0)[source]
Bases:
Module- forward(decoder_input_ids: Tensor, encoder_output: Tensor, position_ids=None, token_type_ids=None, use_cache=False, attention_mask=None, last_token_ids=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(num_layers, max_batch_size, max_beam_width, max_input_len, max_new_tokens, max_encoder_input_len)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.EncoderModel(num_layers, num_heads, hidden_size, ffn_hidden_size, vocab_size, dtype, num_kv_heads=None, max_position_embeddings=None, has_position_embedding=False, relative_attention=False, max_distance=None, num_buckets=None, type_vocab_size=None, has_embedding_layernorm=False, has_embedding_scale=False, q_scaling=1.0, has_attention_qkvo_bias=False, has_mlp_bias=False, has_model_final_layernorm=False, layernorm_eps=1e-05, layernorm_position=LayerNormPositionType.pre_layernorm, layernorm_type=LayerNormType.LayerNorm, hidden_act='relu', tp_group=None, tp_size=1, residual_scaling=1.0)[source]
Bases:
Module
- class tensorrt_llm.models.FalconForCausalLM(num_layers: int, num_heads: int, hidden_size: int, vocab_size: int, max_position_embeddings: int, hidden_act: str = 'gelu', dtype: str | ~tensorrt.tensorrt.DataType | None = None, num_kv_heads: int | None = None, mlp_hidden_size: int | None = None, bias: bool = True, quant_mode: ~tensorrt_llm.quantization.mode.QuantMode = QuantMode.None, use_alibi: bool = True, parallel_attention: bool = False, new_decoder_architecture: bool = False, logits_dtype: str | ~tensorrt.tensorrt.DataType = 'float32', mapping=<tensorrt_llm.mapping.Mapping object>)[source]
Bases:
FalconModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None, hidden_states=None, all_reduce_workspace=None)[source]
- prepare_inputs(max_batch_size: int, max_input_len: int, max_new_tokens: int, use_cache: bool, max_beam_width: int = 1, max_num_tokens: int | None = None)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.FalconModel(num_layers: int, num_heads: int, hidden_size: int, vocab_size: int, hidden_act: int, max_position_embeddings: int, dtype: str | ~tensorrt.tensorrt.DataType | None = None, mapping: ~tensorrt_llm.mapping.Mapping = <tensorrt_llm.mapping.Mapping object>, num_kv_heads: int | None = None, mlp_hidden_size: int | None = None, bias: bool = True, quant_mode: ~tensorrt_llm.quantization.mode.QuantMode = QuantMode.None, use_alibi: bool = True, parallel_attention: bool = False, new_decoder_architecture: bool = False)[source]
Bases:
Module
- class tensorrt_llm.models.GPTJForCausalLM(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, rotary_dim, dtype, logits_dtype='float32', mapping=<tensorrt_llm.mapping.Mapping object>, quant_mode=QuantMode.None)[source]
Bases:
GPTJModel- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width, max_num_tokens: int | None = None, enable_two_optimization_profiles: bool = False)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.GPTJModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, rotary_dim, dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, quant_mode=QuantMode.None)[source]
Bases:
Module
- class tensorrt_llm.models.GPTLMHeadModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype, logits_dtype='float32', mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, position_embedding_type=PositionEmbeddingType.learned_absolute, rotary_embedding_percentage=1.0, inter_size=None, bias=True, quant_mode=QuantMode.None, multi_query_mode=False, use_prompt_tuning=False, use_parallel_embedding=False, embedding_sharding_dim=0, share_embedding_table=False)[source]
Bases:
GPTModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None, prompt_embedding_table=None, prompt_tasks=None, prompt_vocab_size=None, workspace=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width: int = 1, max_num_tokens: int | None = None, prompt_embedding_table_size: int = 128, gather_all_token_logits: bool = False)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.GPTModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, position_embedding_type=PositionEmbeddingType.learned_absolute, rotary_embedding_percentage=1.0, inter_size=None, bias=True, quant_mode=QuantMode.None, multi_query_mode=False, use_prompt_tuning=False, use_parallel_embedding=False, embedding_sharding_dim=0)[source]
Bases:
Module
- class tensorrt_llm.models.GPTNeoXForCausalLM(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, rotary_dim, dtype, position_embedding_type=PositionEmbeddingType.rope_gpt_neox, mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, use_parallel_embedding=False, embedding_sharding_dim=0)[source]
Bases:
GPTNeoXModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, kv_cache_params=None, attention_params=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.GPTNeoXModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, rotary_dim, dtype=None, position_embedding_type=PositionEmbeddingType.rope_gpt_neox, mapping=<tensorrt_llm.mapping.Mapping object>, apply_query_key_layer_scaling=False, use_parallel_embedding=False, embedding_sharding_dim=0)[source]
Bases:
Module
- class tensorrt_llm.models.LLaMAForCausalLM(num_layers, num_heads, num_kv_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype, logits_dtype='float32', mlp_hidden_size=None, position_embedding_type=PositionEmbeddingType.rope_gpt_neox, rotary_base=10000.0, rotary_scaling=None, mapping=<tensorrt_llm.mapping.Mapping object>, quant_mode=QuantMode.None, use_parallel_embedding=False, embedding_sharding_dim=0, rms_norm_eps=1e-06)[source]
Bases:
LLaMAModel,GenerationMixin- forward(input_ids, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None, hidden_states=None, all_reduce_workspace=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width, max_num_tokens: int | None = None)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.LLaMAModel(num_layers, num_heads, num_kv_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype, mlp_hidden_size=None, position_embedding_type=PositionEmbeddingType.rope_gpt_neox, rotary_base=10000.0, rotary_scaling=None, mapping=<tensorrt_llm.mapping.Mapping object>, quant_mode=QuantMode.None, use_parallel_embedding=False, embedding_sharding_dim=0, rms_norm_eps=1e-06)[source]
Bases:
Module
- class tensorrt_llm.models.OPTLMHeadModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype, mapping=<tensorrt_llm.mapping.Mapping object>, pre_norm=True, do_layer_norm_before=True, use_prompt_tuning=False, use_parallel_embedding=False, embedding_sharding_dim=0, share_embedding_table=False)[source]
Bases:
OPTModel,GenerationMixin- forward(input_ids: Tensor, position_ids=None, use_cache=False, last_token_ids=None, attention_mask=None, kv_cache_params=None, attention_params=None, prompt_embedding_table=None, prompt_tasks=None, prompt_vocab_size=None)[source]
- prepare_inputs(max_batch_size, max_input_len, max_new_tokens, use_cache, max_beam_width, prompt_embedding_table_size=32)[source]
@brief: Prepare inputs Tensors for the model, the given sizes are used to determine the ranges of the dimensions of when using TRT dynamic shapes.
@return: a list contains values which can be fed into the self.forward()
- class tensorrt_llm.models.OPTModel(num_layers, num_heads, hidden_size, vocab_size, hidden_act, max_position_embeddings, dtype=None, mapping=<tensorrt_llm.mapping.Mapping object>, pre_norm=True, do_layer_norm_before=True, use_prompt_tuning=False, use_parallel_embedding=False, embedding_sharding_dim=0)[source]
Bases:
Module
- tensorrt_llm.models.fp8_quantize(model, quant_mode: QuantMode, quant_scales: dict | None = None)[source]