[#7704][chore] Enable MathJax to fix formulas in documentation (#7744)

Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
This commit is contained in:
Kanghwan 2025-09-19 08:42:26 -07:00 committed by GitHub
parent 8030b540ac
commit 8fcd11515d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 26 additions and 20 deletions

View File

@ -45,44 +45,44 @@ To address this critical performance limitation, we introduce the **ADP (Attenti
We model and quantify the performance impact of load imbalance in Attention DP. Since workloads across ranks can be heterogeneous, the execution time for the Attention module in any given iteration is bounded by the rank with the highest workload:
```math
$$
time_i = \max_{0 \leq m < N} time_{i,m}
```
$$
where $time_{i,m}$ represents the execution time of rank $m$ in iteration $i$, and $N$ is the data parallel size.
To quantify load balance and theoretical performance bounds, we define two key metrics:
#### 1. Balance Ratio
The $balance\\_ratio$ measures the load distribution across ranks within the Attention module for each iteration:
The balance ratio measures the load distribution across ranks within the Attention module for each iteration:
```math
balance\_ratio = \frac{avg\_tokens}{max\_tokens}
```
$$
balance = \frac{tokens_{avg}}{tokens_{max}}
$$
where:
- $avg\\_tokens$ represents the average number of tokens across all ranks
- $max\\_tokens$ represents the maximum number of tokens across all ranks
- $tokens_{avg}$ represents the average number of tokens across all ranks
- $tokens_{max}$ represents the maximum number of tokens across all ranks
- $tokens_i$ represents the number of tokens processed by rank $i$
Note: MoE module load balancing is handled separately by the Expert Parallel Load Balancer (EPLB) module and is not considered during the early scheduling phase.
#### 2. Speed-of-Light Throughput (SOL TPS)
The $sol\\_tps$ represents the theoretical upper-bound throughput achievable with perfect load balancing:
The Speed-of-Light throughput represents the theoretical upper-bound throughput achievable with perfect load balancing:
```math
sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i
```
$$
time_{sol} = \sum_{i=0}^{\infty} time_i \times balance
$$
```math
sol\_tps = \frac{elapsed\_time}{sol\_time} \times actual\_tps
```
$$
tps_{sol} = \frac{time_{elapsed}}{time_{sol}} \times tps_{actual}
$$
where:
- $time_i$: Measured execution time of iteration $i$
- $elapsed\\_time$: Total empirically measured end-to-end execution time
- $actual\\_tps$: Observed throughput in tokens per second
- $sol\\_tps$: Theoretical maximum throughput under perfect load balance
- $time_{elapsed}$: Total empirically measured end-to-end execution time
- $tps_{actual}$: Observed throughput in tokens per second
- $tps_{sol}$: Theoretical maximum throughput under perfect load balance
This theoretical framework enables us to quantify the performance gap between current and optimal system utilization, providing clear targets for optimization.

View File

@ -50,6 +50,7 @@ extensions = [
'sphinx.ext.autosummary',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx.ext.mathjax',
'myst_parser', # for markdown support
"breathe",
'sphinx.ext.todo',
@ -86,6 +87,8 @@ myst_heading_anchors = 4
myst_enable_extensions = [
"deflist",
"substitution",
"dollarmath",
"amsmath",
]
myst_substitutions = {
@ -167,8 +170,11 @@ def tag_role(name, rawtext, text, lineno, inliner, options=None, content=None):
def setup(app):
from helper import generate_examples, generate_llmapi
from tensorrt_llm.llmapi.utils import tag_llm_params
tag_llm_params()
try:
from tensorrt_llm.llmapi.utils import tag_llm_params
tag_llm_params()
except ImportError:
print("Warning: tensorrt_llm not available, skipping tag_llm_params")
app.add_role('tag', tag_role)