mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-13 22:18:36 +08:00

History

Chang Liu e47c787dd7 [TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache (#8405 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>		2025-10-24 13:40:41 -04:00
..
eval_longbench_v1.py	[TRTLLM-8535][feat] Support DeepSeek V3.2 with FP8 + BF16 KV cache/NVFP4 + BF16 KV cache (#8405 )	2025-10-24 13:40:41 -04:00
eval_longbench_v2.py	[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086 )	2025-10-14 08:23:16 -07:00
README.md	[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086 )	2025-10-14 08:23:16 -07:00
requirements.txt	[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support (#8086 )	2025-10-14 08:23:16 -07:00

README.md

LongBench Evaluation with TensorRT-LLM and Sparse Attention

This directory contains evaluation scripts for both LongBench v1 and LongBench v2 datasets using TensorRT-LLM backend.

Environment Setup

1. Clone LongBench Repository

First, clone the LongBench repository which contains the datasets and evaluation utilities:

git clone https://github.com/THUDM/LongBench.git

2. Install Requirements

Install the required dependencies:

pip install -r requirements.txt

3. Directory Structure

After cloning, your directory structure should look like:

sparse_attention/
├── eval_longbench_v1.py          # LongBench v1 evaluation script
├── eval_longbench_v2.py          # LongBench v2 evaluation script
├── README.md                     # This file
└── LongBench/                    # Cloned LongBench repository
    ├── LongBench/                # LongBench v1 data and configs
    │   ├── config/
    │   └── ...
    ├── config/                   # LongBench v2 configs
    ├── ...
    └── requirements.txt

Scripts Overview

1. eval_longbench_v1.py

This script evaluates models on the LongBench v1 dataset, which includes multiple specific tasks like narrativeqa, qasper, multifieldqa, etc. Key features:

Dataset: LongBench v1 with task-specific evaluation
Tasks: Support for 20+ different long-context tasks
Prompts: Task-specific prompts from LongBench v1 configuration
Metrics: Task-specific metrics (F1, ROUGE, classification scores, etc.)
Output: Task-level results with comprehensive summary statistics

2. eval_longbench_v2.py

This script evaluates models on the LongBench v2 dataset, which is a standardized multiple-choice format. Key features:

Dataset: LongBench v2 with unified multiple-choice format
Format: All questions are A/B/C/D multiple choice
Context Length: 8K to 2M words (majority under 128K)
Difficulty: Easy/Hard categorization
Length: Short/Medium/Long categorization
Domains: Various domains (single-doc QA, multi-doc QA, code, etc.)
CoT Support: Chain-of-Thought reasoning support
Metrics: Accuracy with breakdowns by difficulty, length, and domain

Usage Examples

LongBench v1 Evaluation

Basic Usage (Standard Attention)

python eval_longbench_v1.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v1_vanilla \
    --attention_backend VANILLA \
    --backend pytorch

Specific tasks With Sparse Attention (RocketKV)

python eval_longbench_v1.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --dataset narrativeqa qasper \
    --output_dir results/v1_rocket \
    --attention_backend VANILLA \
    --backend pytorch \
    --rocket_sparse

LongBench v2 Evaluation

Basic Usage (Standard Attention)

python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_vanilla

With Chain-of-Thought Reasoning

python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_cot \
    --cot

Filter by Difficulty/Length/Domain

# Easy questions only
python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_easy \
    --difficulty easy

# Long context only
python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_long \
    --length long

# Specific domain
python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_code \
    --domain "Code"

Limited Sample Evaluation (for testing)

python eval_longbench_v2.py \
    --model_path "/path/to/your/model" \
    --longbench_path ./LongBench \
    --output_dir results/v2_test \
    --num_samples 10

Output Structure

LongBench v1 Output

results/v1_experiment/
├── config.json                          # Experiment configuration
├── overall_summary.json                 # Overall experiment summary
├── narrativeqa/
│   ├── narrativeqa_results.jsonl       # Detailed results
│   ├── narrativeqa_summary.json        # Task summary
│   └── pred/
│       └── narrativeqa.jsonl           # Predictions in LongBench format
├── qasper/
│   └── ...
└── ...

LongBench v2 Output

results/v2_experiment/
├── config.json                          # Experiment configuration
├── summary.json                         # Evaluation summary with metrics
├── longbench_v2_results.jsonl          # Detailed results
└── predictions.jsonl                    # Predictions in LongBench v2 format