# Guide to Qwen2-Audio deployment pipeline
1. Download the Qwen2-Audio model.
    ```bash
    git lfs install
    export MODEL_PATH="tmp/Qwen2-Audio-7B-Instruct"
    git clone https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct $MODEL_PATH
    ```
2. Generate the TensorRT engine of audio encoder.
    ```bash
    export ENGINE_DIR="./trt_engines/qwen2audio/fp16"
    python3 ../multimodal/build_multimodal_engine.py --model_type qwen2_audio --model_path $MODEL_PATH --max_batch_size 32 --output_dir ${ENGINE_DIR}/audio
    ```

    The TensorRT engine will be generated under `${ENGINE_DIR}/audio`.

3. Build Qwen2 LLM TensorRT engine.
- Convert checkpoint
    1. Install packages
    ```bash
    pip install -r requirements.txt
    ```
    2. Convert
    2.1 FP16 checkpoint
    ```bash
    python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \
            --dtype=float16 \
            --output_dir=./tllm_checkpoint_1gpu_fp16
    ```
    2.2 (Optional) INT8 Weight Only checkpoint
    ```bash
    python3 ../qwen/convert_checkpoint.py --model_dir=$MODEL_PATH \
            --dtype=float16 \
            --use_weight_only \
            --weight_only_precision=int8 \
            --output_dir=./tllm_checkpoint_1gpu_fp16_wo8
    ```

- Build TensorRT LLM engine

    NOTE: `max_prompt_embedding_table_size = query_token_num * max_batch_size`, therefore, if you change `max_batch_size`, `--max_prompt_embedding_table_size` must be reset accordingly.
    ```bash
    trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu_fp16 \
                 --gemm_plugin=float16 --gpt_attention_plugin=float16 \
                 --max_batch_size=1 --max_prompt_embedding_table_size=4096 \
                 --output_dir=${ENGINE_DIR}/llm
    ```
    The built Qwen engines are located in `${ENGINE_DIR}/llm`.

    You can replace the `--checkpoint_dir` with INT8 Weight Only checkpoint to build INT8 Weight Only engine as well.
    For more information about Qwen, refer to the README.md in [`example/models/core/qwen`](../qwen).

4. Assemble everything into the Qwen2-Audio pipeline.

    4.1 Run with FP16 LLM engine
    ```bash
    python3 run.py \
        --tokenizer_dir=$MODEL_PATH \
        --engine_dir=${ENGINE_DIR}/llm \
        --audio_engine_path=${ENGINE_DIR}/audio/model.engine \
        --audio_url='./audio/glass-breaking-151256.mp3'
    ```
    4.2 (Optional) For multiple rounds of dialogue, you can run:
    ```bash
    python3 run_chat.py \
        --tokenizer_dir=$MODEL_PATH \
        --engine_dir=${ENGINE_DIR}/llm \
        --audio_engine_path=${ENGINE_DIR}/audio/model.engine \
        --max_new_tokens=256
    ```

    Note:
    - This example supports reusing the KV Cache for audio segments by assigning unique audio IDs.
    - To further optimize performance, users can also cache the audio features (encoder output) to bypass the audio encoder if the original audio data remains unchanged.