Layers
Activation
Attention
- class tensorrt_llm.layers.attention.Attention(hidden_size, num_attention_heads, num_kv_heads=None, max_position_embeddings=1024, num_layers=1, apply_query_key_layer_scaling=False, attention_mask_type=AttentionMaskType.padding, bias=True, dtype=None, position_embedding_type=PositionEmbeddingType.learned_absolute, rotary_embedding_base=10000.0, rotary_embedding_scaling=None, use_int8_kv_cache=False, rotary_embedding_percentage=1.0, tp_group=None, tp_size=1, tp_rank=0, multi_block_mode=False, quant_mode: ~tensorrt_llm.quantization.mode.QuantMode = QuantMode.None, q_scaling=1.0, cross_attention=False, relative_attention=False, max_distance=0, num_buckets=0, instance_id: int = 0)[source]
Bases:
Module
- class tensorrt_llm.layers.attention.AttentionParams(sequence_length: Tensor | None = None, context_lengths: Tensor | None = None, host_context_lengths: Tensor | None = None, max_context_length: int | None = None, host_request_types: Tensor | None = None, encoder_input_lengths: Tensor | None = None, encoder_max_input_length: Tensor | None = None)[source]
Bases:
object
- class tensorrt_llm.layers.attention.BertAttention(hidden_size, num_attention_heads, num_kv_heads=None, max_position_embeddings=1024, num_layers=1, q_scaling=1.0, apply_query_key_layer_scaling=False, bias=True, dtype=None, tp_group=None, tp_size=1, tp_rank=0, relative_attention=False, max_distance=0, num_buckets=0)[source]
Bases:
Module
- class tensorrt_llm.layers.attention.KeyValueCacheParams(past_key_value: List[Tensor] | None = None, host_past_key_value_lengths: Tensor | None = None, kv_cache_block_pointers: List[Tensor] | None = None, cache_indirection: Tensor | None = None, past_key_value_length: Tensor | None = None)[source]
Bases:
object
Cast
Conv
- class tensorrt_llm.layers.conv.Conv2d(in_channels: int, out_channels: int, kernel_size: Tuple[int, int], stride: Tuple[int, int] = (1, 1), padding: Tuple[int, int] = (0, 0), dilation: Tuple[int, int] = (1, 1), groups: int = 1, bias: bool = True, padding_mode: str = 'zeros', dtype=None)[source]
Bases:
Module
- class tensorrt_llm.layers.conv.ConvTranspose2d(in_channels: int, out_channels: int, kernel_size: Tuple[int, int], stride: Tuple[int, int] = (1, 1), padding: Tuple[int, int] = (0, 0), output_padding: Tuple[int, int] = (0, 0), dilation: Tuple[int, int] = (1, 1), groups: int = 1, bias: bool = True, padding_mode: str = 'zeros', dtype=None)[source]
Bases:
Module
Embedding
- class tensorrt_llm.layers.embedding.Embedding(num_embeddings, embedding_dim, dtype=None, tp_size=1, tp_group=None, sharding_dim=0, tp_rank=None)[source]
Bases:
ModuleThe embedding layer takes input indices (x) and the embedding lookup table (weight) as input. And output the corresponding embeddings according to input indices. The size of weight is [num_embeddings, embedding_dim]
Four parameters (tp_size, tp_group, sharding_dim, tp_rank) are involved in tensor parallelism. Only when “tp_size > 1 and tp_group is not None”, tensor parallelism is enabled.
- When “sharding_dim == 0”, the weight is shared in the vocabulary dimension.
tp_rank must be set when sharding_dim == 0.
When “sharding_dim == 1”, the weight is shard in the hidden dimension.
- class tensorrt_llm.layers.embedding.PromptTuningEmbedding(num_embeddings, embedding_dim, vocab_size=None, dtype=None, tp_size=1, tp_group=None, sharding_dim=0, tp_rank=0)[source]
Bases:
EmbeddingPass all tokens though both normal and prompt embedding tables.
Then, combine results based on whether the token was “normal” or “prompt/virtual”.
Linear
MLP
Normalization
- class tensorrt_llm.layers.normalization.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True, dtype=None)[source]
Bases:
Module