TensorRT-LLMs/python-api/tensorrt_llm.runtime.html
2023-12-04 18:59:41 +08:00

772 lines
121 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en" data-content_root="../">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Runtime &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
<link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=19f00094" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js?v=5d32c60e"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=888ff710"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Runtime" href="../_cpp_gen/runtime.html" />
<link rel="prev" title="Quantization" href="tensorrt_llm.quantization.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../architecture.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="../gpt_runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="../batch_manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../gpt_attention.html">Multi-head, Multi-query and Group-query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="../precision.html">Numerical Precision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation.html">Build TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../performance.html">Performance of TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="../2023-05-19-how-to-debug.html">How to debug</a></li>
<li class="toctree-l1"><a class="reference internal" href="../2023-05-17-how-to-add-a-new-model.html">How to add a new model</a></li>
<li class="toctree-l1"><a class="reference internal" href="../graph-rewriting.html">Graph Rewriting Module</a></li>
<li class="toctree-l1"><a class="reference internal" href="../memory.html">Memory Usage of TensorRT-LLM</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="tensorrt_llm.quantization.html">Quantization</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Runtime</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.ChatGLMGenerationSession"><code class="docutils literal notranslate"><span class="pre">ChatGLMGenerationSession</span></code></a></li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSequence"><code class="docutils literal notranslate"><span class="pre">GenerationSequence</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSequence.get_batch_idx"><code class="docutils literal notranslate"><span class="pre">GenerationSequence.get_batch_idx()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSequence.get_seq_idx"><code class="docutils literal notranslate"><span class="pre">GenerationSequence.get_seq_idx()</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession"><code class="docutils literal notranslate"><span class="pre">GenerationSession</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.batch_size"><code class="docutils literal notranslate"><span class="pre">GenerationSession.batch_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.buffer_allocated"><code class="docutils literal notranslate"><span class="pre">GenerationSession.buffer_allocated</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.cross_attention"><code class="docutils literal notranslate"><span class="pre">GenerationSession.cross_attention</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.cuda_graph_mode"><code class="docutils literal notranslate"><span class="pre">GenerationSession.cuda_graph_mode</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.cuda_stream_guard"><code class="docutils literal notranslate"><span class="pre">GenerationSession.cuda_stream_guard()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.debug_mode"><code class="docutils literal notranslate"><span class="pre">GenerationSession.debug_mode</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.debug_tensors_to_save"><code class="docutils literal notranslate"><span class="pre">GenerationSession.debug_tensors_to_save</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.decode"><code class="docutils literal notranslate"><span class="pre">GenerationSession.decode()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.decode_batch"><code class="docutils literal notranslate"><span class="pre">GenerationSession.decode_batch()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.decode_regular"><code class="docutils literal notranslate"><span class="pre">GenerationSession.decode_regular()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.decode_stream"><code class="docutils literal notranslate"><span class="pre">GenerationSession.decode_stream()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.device"><code class="docutils literal notranslate"><span class="pre">GenerationSession.device</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.dtype"><code class="docutils literal notranslate"><span class="pre">GenerationSession.dtype</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.finalize_decoder"><code class="docutils literal notranslate"><span class="pre">GenerationSession.finalize_decoder()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.first_layer"><code class="docutils literal notranslate"><span class="pre">GenerationSession.first_layer</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.gather_all_token_logits"><code class="docutils literal notranslate"><span class="pre">GenerationSession.gather_all_token_logits</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.handle_per_step"><code class="docutils literal notranslate"><span class="pre">GenerationSession.handle_per_step()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.has_position_embedding"><code class="docutils literal notranslate"><span class="pre">GenerationSession.has_position_embedding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.has_token_type_embedding"><code class="docutils literal notranslate"><span class="pre">GenerationSession.has_token_type_embedding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.head_size"><code class="docutils literal notranslate"><span class="pre">GenerationSession.head_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.hidden_size"><code class="docutils literal notranslate"><span class="pre">GenerationSession.hidden_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.last_layer"><code class="docutils literal notranslate"><span class="pre">GenerationSession.last_layer</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.mapping"><code class="docutils literal notranslate"><span class="pre">GenerationSession.mapping</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.num_heads"><code class="docutils literal notranslate"><span class="pre">GenerationSession.num_heads</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.num_heads_kv"><code class="docutils literal notranslate"><span class="pre">GenerationSession.num_heads_kv</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.num_layers"><code class="docutils literal notranslate"><span class="pre">GenerationSession.num_layers</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.paged_kv_cache"><code class="docutils literal notranslate"><span class="pre">GenerationSession.paged_kv_cache</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.pp_communicate_final_output_ids"><code class="docutils literal notranslate"><span class="pre">GenerationSession.pp_communicate_final_output_ids()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.pp_communicate_new_tokens"><code class="docutils literal notranslate"><span class="pre">GenerationSession.pp_communicate_new_tokens()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.quant_mode"><code class="docutils literal notranslate"><span class="pre">GenerationSession.quant_mode</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.remove_input_padding"><code class="docutils literal notranslate"><span class="pre">GenerationSession.remove_input_padding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.runtime"><code class="docutils literal notranslate"><span class="pre">GenerationSession.runtime</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.setup"><code class="docutils literal notranslate"><span class="pre">GenerationSession.setup()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.tokens_per_block"><code class="docutils literal notranslate"><span class="pre">GenerationSession.tokens_per_block</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.use_custom_all_reduce"><code class="docutils literal notranslate"><span class="pre">GenerationSession.use_custom_all_reduce</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.use_gpt_attention_plugin"><code class="docutils literal notranslate"><span class="pre">GenerationSession.use_gpt_attention_plugin</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.use_lora_plugin"><code class="docutils literal notranslate"><span class="pre">GenerationSession.use_lora_plugin</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession.vocab_size"><code class="docutils literal notranslate"><span class="pre">GenerationSession.vocab_size</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.KVCacheManager"><code class="docutils literal notranslate"><span class="pre">KVCacheManager</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.KVCacheManager.add_sequence"><code class="docutils literal notranslate"><span class="pre">KVCacheManager.add_sequence()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.KVCacheManager.get_pointer_arrays"><code class="docutils literal notranslate"><span class="pre">KVCacheManager.get_pointer_arrays()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.KVCacheManager.step"><code class="docutils literal notranslate"><span class="pre">KVCacheManager.step()</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig"><code class="docutils literal notranslate"><span class="pre">ModelConfig</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.cross_attention"><code class="docutils literal notranslate"><span class="pre">ModelConfig.cross_attention</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.dtype"><code class="docutils literal notranslate"><span class="pre">ModelConfig.dtype</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.gather_all_token_logits"><code class="docutils literal notranslate"><span class="pre">ModelConfig.gather_all_token_logits</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.gpt_attention_plugin"><code class="docutils literal notranslate"><span class="pre">ModelConfig.gpt_attention_plugin</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.has_position_embedding"><code class="docutils literal notranslate"><span class="pre">ModelConfig.has_position_embedding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.has_token_type_embedding"><code class="docutils literal notranslate"><span class="pre">ModelConfig.has_token_type_embedding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.head_size"><code class="docutils literal notranslate"><span class="pre">ModelConfig.head_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.hidden_size"><code class="docutils literal notranslate"><span class="pre">ModelConfig.hidden_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.lora_plugin"><code class="docutils literal notranslate"><span class="pre">ModelConfig.lora_plugin</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.max_prompt_embedding_table_size"><code class="docutils literal notranslate"><span class="pre">ModelConfig.max_prompt_embedding_table_size</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.model_name"><code class="docutils literal notranslate"><span class="pre">ModelConfig.model_name</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.num_heads"><code class="docutils literal notranslate"><span class="pre">ModelConfig.num_heads</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.num_kv_heads"><code class="docutils literal notranslate"><span class="pre">ModelConfig.num_kv_heads</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.num_layers"><code class="docutils literal notranslate"><span class="pre">ModelConfig.num_layers</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.paged_kv_cache"><code class="docutils literal notranslate"><span class="pre">ModelConfig.paged_kv_cache</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.quant_mode"><code class="docutils literal notranslate"><span class="pre">ModelConfig.quant_mode</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.remove_input_padding"><code class="docutils literal notranslate"><span class="pre">ModelConfig.remove_input_padding</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.tokens_per_block"><code class="docutils literal notranslate"><span class="pre">ModelConfig.tokens_per_block</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.use_custom_all_reduce"><code class="docutils literal notranslate"><span class="pre">ModelConfig.use_custom_all_reduce</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig.vocab_size"><code class="docutils literal notranslate"><span class="pre">ModelConfig.vocab_size</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner"><code class="docutils literal notranslate"><span class="pre">ModelRunner</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner.from_dir"><code class="docutils literal notranslate"><span class="pre">ModelRunner.from_dir()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner.generate"><code class="docutils literal notranslate"><span class="pre">ModelRunner.generate()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner.remove_input_padding"><code class="docutils literal notranslate"><span class="pre">ModelRunner.remove_input_padding</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.Session"><code class="docutils literal notranslate"><span class="pre">Session</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.context"><code class="docutils literal notranslate"><span class="pre">Session.context</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.engine"><code class="docutils literal notranslate"><span class="pre">Session.engine</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.from_engine"><code class="docutils literal notranslate"><span class="pre">Session.from_engine()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.from_serialized_engine"><code class="docutils literal notranslate"><span class="pre">Session.from_serialized_engine()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.infer_shapes"><code class="docutils literal notranslate"><span class="pre">Session.infer_shapes()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.run"><code class="docutils literal notranslate"><span class="pre">Session.run()</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.runtime"><code class="docutils literal notranslate"><span class="pre">Session.runtime</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.Session.set_shapes"><code class="docutils literal notranslate"><span class="pre">Session.set_shapes()</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo"><code class="docutils literal notranslate"><span class="pre">TensorInfo</span></code></a><ul>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo.dtype"><code class="docutils literal notranslate"><span class="pre">TensorInfo.dtype</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo.name"><code class="docutils literal notranslate"><span class="pre">TensorInfo.name</span></code></a></li>
<li class="toctree-l3"><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo.shape"><code class="docutils literal notranslate"><span class="pre">TensorInfo.shape</span></code></a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#tensorrt_llm.runtime.to_word_list_format"><code class="docutils literal notranslate"><span class="pre">to_word_list_format()</span></code></a></li>
</ul>
</li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../_cpp_gen/runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Blogs</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H100vsA100.html">H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token</a></li>
<li class="toctree-l1"><a class="reference internal" href="../blogs/H200launch.html">H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Runtime</li>
<li class="wy-breadcrumbs-aside">
<a href="../_sources/python-api/tensorrt_llm.runtime.rst.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="module-tensorrt_llm">
<span id="runtime"></span><h1>Runtime<a class="headerlink" href="#module-tensorrt_llm" title="Link to this heading"></a></h1>
<dl class="py class" id="module-tensorrt_llm.runtime">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ChatGLMGenerationSession">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">ChatGLMGenerationSession</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model_config</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig" title="tensorrt_llm.runtime.generation.ModelConfig"><span class="pre">ModelConfig</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">engine_buffer</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">mapping</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Mapping</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">debug_mode</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">debug_tensors_to_save</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cuda_graph_mode</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stream</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Stream</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#ChatGLMGenerationSession"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.ChatGLMGenerationSession" title="Link to this definition"></a></dt>
<dd><p>Bases: <a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession" title="tensorrt_llm.runtime.generation.GenerationSession"><code class="xref py py-class docutils literal notranslate"><span class="pre">GenerationSession</span></code></a></p>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSequence">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">GenerationSequence</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">seq_idx</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_idx</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#GenerationSequence"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSequence" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSequence.get_batch_idx">
<span class="sig-name descname"><span class="pre">get_batch_idx</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">int</span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#GenerationSequence.get_batch_idx"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSequence.get_batch_idx" title="Link to this definition"></a></dt>
<dd><p>Returns idx of sequence in batch</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSequence.get_seq_idx">
<span class="sig-name descname"><span class="pre">get_seq_idx</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">int</span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#GenerationSequence.get_seq_idx"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSequence.get_seq_idx" title="Link to this definition"></a></dt>
<dd><p>Returns sequence idx</p>
</dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">GenerationSession</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">model_config</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#tensorrt_llm.runtime.ModelConfig" title="tensorrt_llm.runtime.generation.ModelConfig"><span class="pre">ModelConfig</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">engine_buffer</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">mapping</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Mapping</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">debug_mode</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">debug_tensors_to_save</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cuda_graph_mode</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stream</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Stream</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.batch_size">
<span class="sig-name descname"><span class="pre">batch_size</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.batch_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.buffer_allocated">
<span class="sig-name descname"><span class="pre">buffer_allocated</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.buffer_allocated" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.cross_attention">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">cross_attention</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.cross_attention" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.cuda_graph_mode">
<span class="sig-name descname"><span class="pre">cuda_graph_mode</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.cuda_graph_mode" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.cuda_stream_guard">
<span class="sig-name descname"><span class="pre">cuda_stream_guard</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.cuda_stream_guard"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.cuda_stream_guard" title="Link to this definition"></a></dt>
<dd><p>Sync external stream and set current stream to the one bound to the session. Reset on exit.</p>
</dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.debug_mode">
<span class="sig-name descname"><span class="pre">debug_mode</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.debug_mode" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.debug_tensors_to_save">
<span class="sig-name descname"><span class="pre">debug_tensors_to_save</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">None</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.debug_tensors_to_save" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.decode">
<span class="sig-name descname"><span class="pre">decode</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sampling_config</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_embedding_table</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tasks</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_vocab_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stop_words_list</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bad_words_list</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">no_repeat_ngram_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">streaming</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_dict</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_output</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_input_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.decode"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.decode" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.decode_batch">
<span class="sig-name descname"><span class="pre">decode_batch</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Sequence</span><span class="p"><span class="pre">[</span></span><span class="pre">Tensor</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sampling_config</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">streaming</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="o"><span class="pre">**</span></span><span class="n"><span class="pre">kwargs</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.decode_batch"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.decode_batch" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.decode_regular">
<span class="sig-name descname"><span class="pre">decode_regular</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">scfg</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">host_context_lengths</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_context_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cache_indirections</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">hidden_states</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_embedding_table</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tasks</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_vocab_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ite</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_limit_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stop_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bad_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">no_repeat_ngram_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_dict</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_output</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_input_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.decode_regular"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.decode_regular" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.decode_stream">
<span class="sig-name descname"><span class="pre">decode_stream</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">scfg</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">host_context_lengths</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_context_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cache_indirections</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">hidden_states</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_embedding_table</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tasks</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_vocab_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ite</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_limit_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stop_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bad_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">no_repeat_ngram_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">return_dict</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_output</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_input_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.decode_stream"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.decode_stream" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.device">
<span class="sig-name descname"><span class="pre">device</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">device</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.device" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.dtype">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">dtype</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.dtype" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.finalize_decoder">
<span class="sig-name descname"><span class="pre">finalize_decoder</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">context_lengths</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">scfg</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.finalize_decoder"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.finalize_decoder" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.first_layer">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">first_layer</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.first_layer" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.gather_all_token_logits">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">gather_all_token_logits</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.gather_all_token_logits" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.handle_per_step">
<span class="sig-name descname"><span class="pre">handle_per_step</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">cache_indirections</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">step</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_context_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">hidden_states</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">scfg</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">kv_cache_block_pointers</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_embedding_table</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tasks</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">host_context_lengths</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">attention_mask</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">prompt_vocab_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ite</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_limit_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">next_step_buffer</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">dict</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stop_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">bad_words_list</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">no_repeat_ngram_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_output</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_input_lengths</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Tensor</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.handle_per_step"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.handle_per_step" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.has_position_embedding">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">has_position_embedding</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.has_position_embedding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.has_token_type_embedding">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">has_token_type_embedding</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.has_token_type_embedding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.head_size">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">head_size</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.head_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.hidden_size">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">hidden_size</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.hidden_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.last_layer">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">last_layer</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.last_layer" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.mapping">
<span class="sig-name descname"><span class="pre">mapping</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">Mapping</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.mapping" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.num_heads">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">num_heads</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.num_heads" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.num_heads_kv">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">num_heads_kv</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.num_heads_kv" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.num_layers">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">num_layers</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.num_layers" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.paged_kv_cache">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">paged_kv_cache</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.paged_kv_cache" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.pp_communicate_final_output_ids">
<span class="sig-name descname"><span class="pre">pp_communicate_final_output_ids</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">final_output_ids</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_size</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.pp_communicate_final_output_ids"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.pp_communicate_final_output_ids" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.pp_communicate_new_tokens">
<span class="sig-name descname"><span class="pre">pp_communicate_new_tokens</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">should_stop</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cache_indir</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sequence_length</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.pp_communicate_new_tokens"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.pp_communicate_new_tokens" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.quant_mode">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">quant_mode</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.quant_mode" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.remove_input_padding">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">remove_input_padding</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.remove_input_padding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.runtime">
<span class="sig-name descname"><span class="pre">runtime</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">_Runtime</span></em><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.runtime" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.setup">
<span class="sig-name descname"><span class="pre">setup</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_context_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_new_tokens</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">1</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_kv_cache_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoder_max_input_length</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lora_manager</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">LoraManager</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lora_uids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#GenerationSession.setup"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.setup" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.tokens_per_block">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">tokens_per_block</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.tokens_per_block" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.use_custom_all_reduce">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">use_custom_all_reduce</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.use_custom_all_reduce" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.use_gpt_attention_plugin">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">use_gpt_attention_plugin</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.use_gpt_attention_plugin" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.use_lora_plugin">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">use_lora_plugin</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.use_lora_plugin" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.GenerationSession.vocab_size">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">vocab_size</span></span><a class="headerlink" href="#tensorrt_llm.runtime.GenerationSession.vocab_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.KVCacheManager">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">KVCacheManager</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">memory_pools</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">Tensor</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">blocks</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tokens_per_block</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_blocks_per_seq</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_kv_cache_len</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">1</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#KVCacheManager"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.KVCacheManager" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.KVCacheManager.add_sequence">
<span class="sig-name descname"><span class="pre">add_sequence</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">sequence</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSequence" title="tensorrt_llm.runtime.kv_cache_manager.GenerationSequence"><span class="pre">GenerationSequence</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">context_len</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#KVCacheManager.add_sequence"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.KVCacheManager.add_sequence" title="Link to this definition"></a></dt>
<dd><p>Add sequence to the manager and allocate minimum amount of blocks for context</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.KVCacheManager.get_pointer_arrays">
<span class="sig-name descname"><span class="pre">get_pointer_arrays</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">beam_width</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">Tensor</span><span class="p"><span class="pre">]</span></span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#KVCacheManager.get_pointer_arrays"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.KVCacheManager.get_pointer_arrays" title="Link to this definition"></a></dt>
<dd><p>Returns arrays of pointers for all memory pools copied to GPU</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.KVCacheManager.step">
<span class="sig-name descname"><span class="pre">step</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">finished</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">bool</span><span class="p"><span class="pre">]</span></span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/kv_cache_manager.html#KVCacheManager.step"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.KVCacheManager.step" title="Link to this definition"></a></dt>
<dd><p>Iterate to the next generation step.
Add new blocks where needed and clear finished sequences.</p>
</dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">ModelConfig</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">vocab_size:</span> <span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">num_layers:</span> <span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">num_heads:</span> <span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">num_kv_heads:</span> <span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">hidden_size:</span> <span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">gpt_attention_plugin:</span> <span class="pre">bool</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">remove_input_padding:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">model_name:</span> <span class="pre">str</span> <span class="pre">=</span> <span class="pre">''</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">paged_kv_cache:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">cross_attention:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">head_size:</span> <span class="pre">int</span> <span class="pre">=</span> <span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">has_position_embedding:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">True</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">has_token_type_embedding:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tokens_per_block:</span> <span class="pre">int</span> <span class="pre">=</span> <span class="pre">64</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_prompt_embedding_table_size:</span> <span class="pre">int</span> <span class="pre">=</span> <span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">quant_mode:</span> <span class="pre">tensorrt_llm.quantization.mode.QuantMode</span> <span class="pre">=</span> <span class="pre">&lt;QuantMode.0:</span> <span class="pre">0&gt;</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">gather_all_token_logits:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">dtype:</span> <span class="pre">str</span> <span class="pre">=</span> <span class="pre">''</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">use_custom_all_reduce:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">lora_plugin:</span> <span class="pre">bool</span> <span class="pre">=</span> <span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#ModelConfig"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.cross_attention">
<span class="sig-name descname"><span class="pre">cross_attention</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.cross_attention" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.dtype">
<span class="sig-name descname"><span class="pre">dtype</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">str</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">''</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.dtype" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.gather_all_token_logits">
<span class="sig-name descname"><span class="pre">gather_all_token_logits</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.gather_all_token_logits" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.gpt_attention_plugin">
<span class="sig-name descname"><span class="pre">gpt_attention_plugin</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.gpt_attention_plugin" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.has_position_embedding">
<span class="sig-name descname"><span class="pre">has_position_embedding</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">True</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.has_position_embedding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.has_token_type_embedding">
<span class="sig-name descname"><span class="pre">has_token_type_embedding</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.has_token_type_embedding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.head_size">
<span class="sig-name descname"><span class="pre">head_size</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">None</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.head_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.hidden_size">
<span class="sig-name descname"><span class="pre">hidden_size</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.hidden_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.lora_plugin">
<span class="sig-name descname"><span class="pre">lora_plugin</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.lora_plugin" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.max_prompt_embedding_table_size">
<span class="sig-name descname"><span class="pre">max_prompt_embedding_table_size</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">0</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.max_prompt_embedding_table_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.model_name">
<span class="sig-name descname"><span class="pre">model_name</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">str</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">''</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.model_name" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.num_heads">
<span class="sig-name descname"><span class="pre">num_heads</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.num_heads" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.num_kv_heads">
<span class="sig-name descname"><span class="pre">num_kv_heads</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.num_kv_heads" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.num_layers">
<span class="sig-name descname"><span class="pre">num_layers</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.num_layers" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.paged_kv_cache">
<span class="sig-name descname"><span class="pre">paged_kv_cache</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.paged_kv_cache" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.quant_mode">
<span class="sig-name descname"><span class="pre">quant_mode</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><a class="reference internal" href="tensorrt_llm.quantization.html#tensorrt_llm.quantization.QuantMode" title="tensorrt_llm.quantization.mode.QuantMode"><span class="pre">QuantMode</span></a></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">0</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.quant_mode" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.remove_input_padding">
<span class="sig-name descname"><span class="pre">remove_input_padding</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.remove_input_padding" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.tokens_per_block">
<span class="sig-name descname"><span class="pre">tokens_per_block</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">64</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.tokens_per_block" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.use_custom_all_reduce">
<span class="sig-name descname"><span class="pre">use_custom_all_reduce</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><em class="property"><span class="w"> </span><span class="p"><span class="pre">=</span></span><span class="w"> </span><span class="pre">False</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.use_custom_all_reduce" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelConfig.vocab_size">
<span class="sig-name descname"><span class="pre">vocab_size</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">int</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelConfig.vocab_size" title="Link to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelRunner">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">ModelRunner</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">session</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><a class="reference internal" href="#tensorrt_llm.runtime.GenerationSession" title="tensorrt_llm.runtime.generation.GenerationSession"><span class="pre">GenerationSession</span></a></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_batch_size</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">max_input_len</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/model_runner.html#ModelRunner"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.ModelRunner" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<p>An interface class that wraps GenerationSession and provides generation methods.</p>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelRunner.from_dir">
<em class="property"><span class="pre">classmethod</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">from_dir</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">engine_dir</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">rank</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">int</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">0</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">debug_mode</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">bool</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner" title="tensorrt_llm.runtime.model_runner.ModelRunner"><span class="pre">ModelRunner</span></a></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/model_runner.html#ModelRunner.from_dir"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.ModelRunner.from_dir" title="Link to this definition"></a></dt>
<dd><p>Create a ModelRunner instance from an engine directory.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>engine_dir</strong> (<em>str</em>) The directory that contains the serialized engine files and config files.</p></li>
<li><p><strong>rank</strong> (<em>int</em>) The runtime rank id.</p></li>
<li><p><strong>debug_mode</strong> (<em>int</em>) Whether or not to turn on the debug mode.</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>An instance of ModelRunner.</p>
</dd>
<dt class="field-odd">Return type<span class="colon">:</span></dt>
<dd class="field-odd"><p><a class="reference internal" href="#tensorrt_llm.runtime.ModelRunner" title="tensorrt_llm.runtime.ModelRunner">ModelRunner</a></p>
</dd>
</dl>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelRunner.generate">
<span class="sig-name descname"><span class="pre">generate</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">batch_input_ids</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">Tensor</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">sampling_config</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">SamplingConfig</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="o"><span class="pre">**</span></span><span class="n"><span class="pre">kwargs</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">Tensor</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">dict</span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/model_runner.html#ModelRunner.generate"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.ModelRunner.generate" title="Link to this definition"></a></dt>
<dd><p>Generates sequences of token ids.
The generation-controlling parameters are set in the sampling_config; it will be set to a default one if not passed.
You can override any sampling_configs attributes by passing corresponding parameters.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>batch_input_ids</strong> (<em>List</em><em>[</em><em>torch.Tensor</em><em>]</em>) A list of input id tensors. Each tensor is of shape (sequence_length, ).</p></li>
<li><p><strong>sampling_config</strong> (<em>Optional</em><em>[</em><em>SamplingConfig</em><em>]</em>) The sampling configuration to be used as base parametrization for the generation call.
The passed <a href="#id1"><span class="problematic" id="id2">**</span></a>kwargs matching the sampling_configs attributes will override them.
If the sampling_config is not provided, a default will be used.</p></li>
<li><p><strong>(</strong><strong>Dict</strong><strong>[</strong><strong>str</strong> (<em>kwargs</em>) Ad hoc parametrization of sampling_config.
The passed <a href="#id3"><span class="problematic" id="id4">**</span></a>kwargs matching the sampling_configs attributes will override them.</p></li>
<li><p><strong>Any</strong><strong>]</strong> Ad hoc parametrization of sampling_config.
The passed <a href="#id5"><span class="problematic" id="id6">**</span></a>kwargs matching the sampling_configs attributes will override them.</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>If return_dict=False, the method returns generated output_ids.
If return_dict=True, the method returns a dict of output_ids,
sequence_lengths (if sampling_config.output_sequence_lengths=True),
context_logits and generation_logits (if self.session.gather_all_token_logits=True).</p>
</dd>
<dt class="field-odd">Return type<span class="colon">:</span></dt>
<dd class="field-odd"><p>torch.Tensor or dict</p>
</dd>
</dl>
</dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.ModelRunner.remove_input_padding">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">remove_input_padding</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">bool</span></em><a class="headerlink" href="#tensorrt_llm.runtime.ModelRunner.remove_input_padding" title="Link to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">Session</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="o"><span class="pre">**</span></span><span class="n"><span class="pre">kwargs</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<p>Session is a managed TensorRT runtime.</p>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.context">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">context</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">IExecutionContext</span></em><a class="headerlink" href="#tensorrt_llm.runtime.Session.context" title="Link to this definition"></a></dt>
<dd><dl class="simple">
<dt>Get the default TensorRT execution context,</dt><dd><p>use self.engine.create_execution_context() to create a new context if needed</p>
</dd>
</dl>
<p>&#64;return: one TensorRT execution context object</p>
<dl class="field-list simple">
<dt class="field-odd">Type<span class="colon">:</span></dt>
<dd class="field-odd"><p>&#64;brief</p>
</dd>
</dl>
</dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.engine">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">engine</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">ICudaEngine</span></em><a class="headerlink" href="#tensorrt_llm.runtime.Session.engine" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.from_engine">
<em class="property"><span class="pre">static</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">from_engine</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">engine</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><a class="reference internal" href="#tensorrt_llm.runtime.Session" title="tensorrt_llm.runtime.session.Session"><span class="pre">Session</span></a></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session.from_engine"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session.from_engine" title="Link to this definition"></a></dt>
<dd><p>&#64;brief: Create a session from an existing ICudaEngine engine
&#64;param engine: an ICudaEngine
&#64;return: a Session object</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.from_serialized_engine">
<em class="property"><span class="pre">static</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">from_serialized_engine</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">engine</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><a class="reference internal" href="#tensorrt_llm.runtime.Session" title="tensorrt_llm.runtime.session.Session"><span class="pre">Session</span></a></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session.from_serialized_engine"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session.from_serialized_engine" title="Link to this definition"></a></dt>
<dd><p>&#64;brief: Create a session from a serialized engine
&#64;param engine: a serialized engine
&#64;return: a Session object</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.infer_shapes">
<span class="sig-name descname"><span class="pre">infer_shapes</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">inputs</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo" title="tensorrt_llm.runtime.session.TensorInfo"><span class="pre">TensorInfo</span></a><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">IExecutionContext</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><a class="reference internal" href="#tensorrt_llm.runtime.TensorInfo" title="tensorrt_llm.runtime.session.TensorInfo"><span class="pre">TensorInfo</span></a><span class="p"><span class="pre">]</span></span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session.infer_shapes"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session.infer_shapes" title="Link to this definition"></a></dt>
<dd><dl class="simple">
<dt>&#64;brief: Set input shapes to given context, and infer the output shapes from the given input shapes.</dt><dd><p>This function should be called every time when the input shapes are changed before calling run().
Or call the context.set_input_shape on all dynamic shaped input tensors manually.</p>
</dd>
</dl>
<p>&#64;param inputs: list of TensorInfo object, each item represents an input tensor
&#64;param context: TensorRT execution context, if None, use the default context
&#64;return: list of TensorInfo object, each item represents an output tensor, returns None if failed</p>
</dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.run">
<span class="sig-name descname"><span class="pre">run</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">inputs</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Dict</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Any</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">outputs</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Dict</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Any</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">stream</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">bool</span></span></span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session.run"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session.run" title="Link to this definition"></a></dt>
<dd><p>&#64;brief: Run the TensorRT engine with the given inputs and outputs
&#64;param inputs: dict of input tensors, key is tensor name, value is tensor pointer or torch tensor
&#64;param outputs: dict of output tensors, key is tensor name, value is tensor pointer or torch tensor
&#64;param stream: cuda stream to enqueue the TensorRT engine on
&#64;param context: TensorRT execution context, if None, use the default context
&#64;return: True if enqueue succeeded, note the enqueue is an async call,</p>
<blockquote>
<div><p>returning True does not mean the execution is finished</p>
</div></blockquote>
</dd></dl>
<dl class="py property">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.runtime">
<em class="property"><span class="pre">property</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">runtime</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">Runtime</span></em><a class="headerlink" href="#tensorrt_llm.runtime.Session.runtime" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py method">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.Session.set_shapes">
<span class="sig-name descname"><span class="pre">set_shapes</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">tensor_dict</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">Dict</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">Tensor</span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">context</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">IExecutionContext</span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#Session.set_shapes"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.Session.set_shapes" title="Link to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="py class">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.TensorInfo">
<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">TensorInfo</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">name</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="s"><span class="pre">'str'</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">dtype</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="s"><span class="pre">'trt.DataType'</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">shape</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="s"><span class="pre">'tuple'</span></span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/session.html#TensorInfo"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.TensorInfo" title="Link to this definition"></a></dt>
<dd><p>Bases: <code class="xref py py-class docutils literal notranslate"><span class="pre">object</span></code></p>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.TensorInfo.dtype">
<span class="sig-name descname"><span class="pre">dtype</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">DataType</span></em><a class="headerlink" href="#tensorrt_llm.runtime.TensorInfo.dtype" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.TensorInfo.name">
<span class="sig-name descname"><span class="pre">name</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">str</span></em><a class="headerlink" href="#tensorrt_llm.runtime.TensorInfo.name" title="Link to this definition"></a></dt>
<dd></dd></dl>
<dl class="py attribute">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.TensorInfo.shape">
<span class="sig-name descname"><span class="pre">shape</span></span><em class="property"><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="pre">tuple</span></em><a class="headerlink" href="#tensorrt_llm.runtime.TensorInfo.shape" title="Link to this definition"></a></dt>
<dd></dd></dl>
</dd></dl>
<dl class="py function">
<dt class="sig sig-object py" id="tensorrt_llm.runtime.to_word_list_format">
<span class="sig-prename descclassname"><span class="pre">tensorrt_llm.runtime.</span></span><span class="sig-name descname"><span class="pre">to_word_list_format</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">word_dict</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">List</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="p"><span class="pre">]</span></span></span></em>, <em class="sig-param"><span class="n"><span class="pre">tokenizer</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">add_special_tokens</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">False</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/tensorrt_llm/runtime/generation.html#to_word_list_format"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#tensorrt_llm.runtime.to_word_list_format" title="Link to this definition"></a></dt>
<dd><dl class="simple">
<dt>format of word_dict</dt><dd><p>len(word_dict) should be same to batch_size
word_dict[i] means the words for batch i
len(word_dict[i]) must be 1, which means it only contains 1 string
This string can contains several sentences and split by “,”.
For example, if word_dict[2] = “ I am happy, I am sad”, then this function will return
the ids for two short sentences “ I am happy” and “ I am sad”.</p>
</dd>
</dl>
</dd></dl>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="tensorrt_llm.quantization.html" class="btn btn-neutral float-left" title="Quantization" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="../_cpp_gen/runtime.html" class="btn btn-neutral float-right" title="Runtime" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2023, NVidia.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>