This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com> |
||
|---|---|---|
| .. | ||
| audio | ||
| README.md | ||
| requirements.txt | ||
| run_chat.py | ||
| run.py | ||
| utils.py | ||
Guide to Qwen2-Audio deployment pipeline
-
Download the Qwen2-Audio model.
git lfs install export MODEL_PATH="tmp/Qwen2-Audio-7B-Instruct" git clone https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct $MODEL_PATH -
Generate the TensorRT engine of audio encoder.
export ENGINE_DIR="./trt_engines/qwen2audio/fp16" python3 ../multimodal/build_multimodal_engine.py --model_type qwen2_audio --model_path $MODEL_PATH --max_batch_size 32 --output_dir ${ENGINE_DIR}/audioThe TensorRT engine will be generated under
${ENGINE_DIR}/audio. -
Build Qwen2 LLM TensorRT engine.
-
Convert checkpoint
- Install packages
pip install -r requirements.txt- Convert 2.1 FP16 checkpoint
python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \ --dtype=float16 \ --output_dir=./tllm_checkpoint_1gpu_fp162.2 (Optional) INT8 Weight Only checkpoint
python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \ --dtype=float16 \ --use_weight_only \ --weight_only_precision=int8 \ --output_dir=./tllm_checkpoint_1gpu_fp16_wo8 -
Build TensorRT-LLM engine
NOTE:
max_prompt_embedding_table_size = query_token_num * max_batch_size, therefore, if you changemax_batch_size,--max_prompt_embedding_table_sizemust be reset accordingly.trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu_fp16 \ --gemm_plugin=float16 --gpt_attention_plugin=float16 \ --max_batch_size=1 --max_prompt_embedding_table_size=4096 \ --output_dir=${ENGINE_DIR}/llmThe built Qwen engines are located in
${ENGINE_DIR}/llm.You can replace the
--checkpoint_dirwith INT8 Weight Only checkpoint to build INT8 Weight Only engine as well. For more information about Qwen, refer to the README.md inexample/models/core/qwen.
-
Assemble everything into the Qwen2-Audio pipeline.
4.1 Run with FP16 LLM engine
python3 run.py \ --tokenizer_dir=$MODEL_PATH \ --engine_dir=${ENGINE_DIR}/llm \ --audio_engine_path=${ENGINE_DIR}/audio/model.engine \ --audio_url='./audio/glass-breaking-151256.mp3'4.2 (Optional) For multiple rounds of dialogue, you can run:
python3 run_chat.py \ --tokenizer_dir=$MODEL_PATH \ --engine_dir=${ENGINE_DIR}/llm \ --audio_engine_path=${ENGINE_DIR}/audio/model.engine \ --max_new_tokens=256Note:
- This example supports reusing the KV Cache for audio segments by assigning unique audio IDs.
- To further optimize performance, users can also cache the audio features (encoder output) to bypass the audio encoder if the original audio data remains unchanged.