mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| eval_longbench_v1.py | ||
| eval_longbench_v2.py | ||
| README.md | ||
| requirements.txt | ||
LongBench Evaluation with TensorRT-LLM and Sparse Attention
This directory contains evaluation scripts for both LongBench v1 and LongBench v2 datasets using TensorRT-LLM backend.
Environment Setup
1. Clone LongBench Repository
First, clone the LongBench repository which contains the datasets and evaluation utilities:
git clone https://github.com/THUDM/LongBench.git
2. Install Requirements
Install the required dependencies:
pip install -r requirements.txt
3. Directory Structure
After cloning, your directory structure should look like:
sparse_attention/
├── eval_longbench_v1.py # LongBench v1 evaluation script
├── eval_longbench_v2.py # LongBench v2 evaluation script
├── README.md # This file
└── LongBench/ # Cloned LongBench repository
├── LongBench/ # LongBench v1 data and configs
│ ├── config/
│ └── ...
├── config/ # LongBench v2 configs
├── ...
└── requirements.txt
Scripts Overview
1. eval_longbench_v1.py
This script evaluates models on the LongBench v1 dataset, which includes multiple specific tasks like narrativeqa, qasper, multifieldqa, etc. Key features:
- Dataset: LongBench v1 with task-specific evaluation
- Tasks: Support for 20+ different long-context tasks
- Prompts: Task-specific prompts from LongBench v1 configuration
- Metrics: Task-specific metrics (F1, ROUGE, classification scores, etc.)
- Output: Task-level results with comprehensive summary statistics
2. eval_longbench_v2.py
This script evaluates models on the LongBench v2 dataset, which is a standardized multiple-choice format. Key features:
- Dataset: LongBench v2 with unified multiple-choice format
- Format: All questions are A/B/C/D multiple choice
- Context Length: 8K to 2M words (majority under 128K)
- Difficulty: Easy/Hard categorization
- Length: Short/Medium/Long categorization
- Domains: Various domains (single-doc QA, multi-doc QA, code, etc.)
- CoT Support: Chain-of-Thought reasoning support
- Metrics: Accuracy with breakdowns by difficulty, length, and domain
Usage Examples
LongBench v1 Evaluation
Basic Usage (Standard Attention)
python eval_longbench_v1.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v1_vanilla \
--attention_backend VANILLA \
--backend pytorch
Specific tasks With Sparse Attention (RocketKV)
python eval_longbench_v1.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--dataset narrativeqa qasper \
--output_dir results/v1_rocket \
--attention_backend VANILLA \
--backend pytorch \
--rocket_sparse
LongBench v2 Evaluation
Basic Usage (Standard Attention)
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_vanilla
With Chain-of-Thought Reasoning
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_cot \
--cot
Filter by Difficulty/Length/Domain
# Easy questions only
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_easy \
--difficulty easy
# Long context only
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_long \
--length long
# Specific domain
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_code \
--domain "Code"
Limited Sample Evaluation (for testing)
python eval_longbench_v2.py \
--model_path "/path/to/your/model" \
--longbench_path ./LongBench \
--output_dir results/v2_test \
--num_samples 10
Output Structure
LongBench v1 Output
results/v1_experiment/
├── config.json # Experiment configuration
├── overall_summary.json # Overall experiment summary
├── narrativeqa/
│ ├── narrativeqa_results.jsonl # Detailed results
│ ├── narrativeqa_summary.json # Task summary
│ └── pred/
│ └── narrativeqa.jsonl # Predictions in LongBench format
├── qasper/
│ └── ...
└── ...
LongBench v2 Output
results/v2_experiment/
├── config.json # Experiment configuration
├── summary.json # Evaluation summary with metrics
├── longbench_v2_results.jsonl # Detailed results
└── predictions.jsonl # Predictions in LongBench v2 format