TensorRT-LLMs/performance.html
2023-10-19 12:25:48 +00:00

755 lines
30 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Performance of TensorRT-LLM &mdash; tensorrt_llm documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
<script src="_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="_static/jquery.js?v=5d32c60e"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js?v=b3ba4146"></script>
<script src="_static/doctools.js?v=888ff710"></script>
<script src="_static/sphinx_highlight.js?v=4825356b"></script>
<script src="_static/js/theme.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Build From Sources" href="installation.html" />
<link rel="prev" title="Numerical Precision" href="precision.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="index.html" class="icon icon-home">
tensorrt_llm
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="architecture.html">TensorRT-LLM Architecture</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpt_runtime.html">C++ GPT Runtime</a></li>
<li class="toctree-l1"><a class="reference internal" href="batch_manager.html">The Batch Manager in TensorRT-LLM</a></li>
<li class="toctree-l1"><a class="reference internal" href="gpt_attention.html">Multi-head, Multi-query and Group-query Attention</a></li>
<li class="toctree-l1"><a class="reference internal" href="precision.html">Numerical Precision</a></li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Performance of TensorRT-LLM</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#methodology">Methodology</a></li>
<li class="toctree-l2"><a class="reference internal" href="#high-throughput">High Throughput</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#h100-gpus-fp8">H100 GPUs (FP8)</a></li>
<li class="toctree-l3"><a class="reference internal" href="#l40s-gpus-fp8">L40S GPUs (FP8)</a></li>
<li class="toctree-l3"><a class="reference internal" href="#a100-gpus-fp16">A100 GPUs (FP16)</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#low-latency">Low Latency</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#id1">H100 GPUs (FP8)</a></li>
<li class="toctree-l3"><a class="reference internal" href="#id2">L40S GPUs (FP8)</a></li>
<li class="toctree-l3"><a class="reference internal" href="#id3">A100 GPUs (FP16)</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#known-issues">Known Issues</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#fused-matmul-gated-silu-llama">Fused Matmul + Gated-SiLU (LLaMA)</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="installation.html">Build From Sources</a></li>
<li class="toctree-l1"><a class="reference internal" href="2023-05-19-how-to-debug.html">How to debug</a></li>
<li class="toctree-l1"><a class="reference internal" href="2023-05-17-how-to-add-a-new-model.html">How to add a new model</a></li>
<li class="toctree-l1"><a class="reference internal" href="graph-rewriting.html">Graph Rewriting Module</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.layers.html">Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.functional.html">Functionals</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.plugin.html">Plugin</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.quantization.html">Qunatization</a></li>
<li class="toctree-l1"><a class="reference internal" href="python-api/tensorrt_llm.runtime.html">Runtime</a></li>
</ul>
<p class="caption" role="heading"><span class="caption-text">C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="_cpp_gen/runtime.html">Runtime</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="index.html">tensorrt_llm</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Performance of TensorRT-LLM</li>
<li class="wy-breadcrumbs-aside">
<a href="_sources/performance.md.txt" rel="nofollow"> View page source</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="performance-of-tensorrt-llm">
<h1>Performance of TensorRT-LLM<a class="headerlink" href="#performance-of-tensorrt-llm" title="Permalink to this heading"></a></h1>
<p>This document summarizes performance measurements of TensorRT-LLM on H100
(Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.</p>
<p>The data in the following tables is provided as a reference point to help users
validate observed performance. It should not be considered as the peak
performance that can be delivered by TensorRT-LLM.</p>
<section id="methodology">
<h2>Methodology<a class="headerlink" href="#methodology" title="Permalink to this heading"></a></h2>
<p>The different performance numbers below were collected using the methodology
described in the benchmarks <a class="reference external" href="https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/benchmarks/">folder</a>.</p>
</section>
<section id="high-throughput">
<h2>High Throughput<a class="headerlink" href="#high-throughput" title="Permalink to this heading"></a></h2>
<p>The below tables provide reference data at large batch sizes, representating
high throughput tasks.</p>
<section id="h100-gpus-fp8">
<h3>H100 GPUs (FP8)<a class="headerlink" href="#h100-gpus-fp8" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-left"><p>Output Length</p></th>
<th class="head text-right"><p>Throughput (out tok/s)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>10,907</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>6,179</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>2,229</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>2,980</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>9,193</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>5,367</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>2,058</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>2,230</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,317</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>2,616</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>843</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,583</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>96</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>2,686</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>96</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>2,073</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>465</p></td>
</tr>
</tbody>
</table>
</section>
<section id="l40s-gpus-fp8">
<h3>L40S GPUs (FP8)<a class="headerlink" href="#l40s-gpus-fp8" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-left"><p>Output Length</p></th>
<th class="head text-right"><p>Throughput (out tok/s)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,630</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,859</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>616</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>757</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,240</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,622</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>581</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>16</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>531</p></td>
</tr>
</tbody>
</table>
</section>
<section id="a100-gpus-fp16">
<h3>A100 GPUs (FP16)<a class="headerlink" href="#a100-gpus-fp16" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-left"><p>Output Length</p></th>
<th class="head text-right"><p>Throughput (out tok/s)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,679</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,558</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>526</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>16</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>650</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>3,486</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,459</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>32</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>529</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>16</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>592</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>1,237</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>1,181</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>272</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>738</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>929</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>923</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>64</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>202</p></td>
</tr>
</tbody>
</table>
<p>(1) TP stands for Tensor Parallelism.</p>
</section>
</section>
<section id="low-latency">
<h2>Low Latency<a class="headerlink" href="#low-latency" title="Permalink to this heading"></a></h2>
<p>The below tables provide reference data at batch size 1 for first token
latency, representating end-users percieved latency for online streaming
tasks.</p>
<section id="id1">
<h3>H100 GPUs (FP8)<a class="headerlink" href="#id1" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-right"><p>1st Token Latency (ms)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>7</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>29</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>7</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>36</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>26</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>109</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>27</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>205</p></td>
</tr>
</tbody>
</table>
</section>
<section id="id2">
<h3>L40S GPUs (FP8)<a class="headerlink" href="#id2" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-right"><p>1st Token Latency (ms)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>12</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>71</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>14</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>73</p></td>
</tr>
</tbody>
</table>
</section>
<section id="id3">
<h3>A100 GPUs (FP16)<a class="headerlink" href="#id3" title="Permalink to this heading"></a></h3>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-left"><p>Model</p></th>
<th class="head text-left"><p>Batch Size</p></th>
<th class="head text-left"><p>TP (1)</p></th>
<th class="head text-left"><p>Input Length</p></th>
<th class="head text-right"><p>1st Token Latency (ms)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>12</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>GPT-J 6B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>129</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>16</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 7B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>133</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>47</p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>LLaMA 70B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>4</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>377</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-left"><p></p></td>
<td class="text-right"><p></p></td>
</tr>
<tr class="row-odd"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>128</p></td>
<td class="text-right"><p>61</p></td>
</tr>
<tr class="row-even"><td class="text-left"><p>Falcon 180B</p></td>
<td class="text-left"><p>1</p></td>
<td class="text-left"><p>8</p></td>
<td class="text-left"><p>2048</p></td>
<td class="text-right"><p>509</p></td>
</tr>
</tbody>
</table>
<p>(1) TP stands for Tensor Parallelism.</p>
</section>
</section>
<section id="known-issues">
<h2>Known Issues<a class="headerlink" href="#known-issues" title="Permalink to this heading"></a></h2>
<p>The following issues are being addressed to improve the efficiency of TensorRT-LLM.</p>
<section id="fused-matmul-gated-silu-llama">
<h3>Fused Matmul + Gated-SiLU (LLaMA)<a class="headerlink" href="#fused-matmul-gated-silu-llama" title="Permalink to this heading"></a></h3>
<p>There are different possible implementations for Matmul followed by Gated-SiLU.
The simplest implementation uses two Matmul operations and combines the results
in a separate CUDA kernel. Thats the current implementation in TensorRT-LLM.
The next release will include a more efficient implementation that runs a
single Matmul.</p>
</section>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="precision.html" class="btn btn-neutral float-left" title="Numerical Precision" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="installation.html" class="btn btn-neutral float-right" title="Build From Sources" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2023, NVidia.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>