From ecf951f58aa0347d36a174ae373e285572fd54d7 Mon Sep 17 00:00:00 2001 From: Shredded-Pork <12421147@zju.edu.cn> Date: Sun, 23 Feb 2025 11:47:43 +0800 Subject: [PATCH] Update README.md --- README.md | 42 +++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 06446ad..3811d29 100644 --- a/README.md +++ b/README.md @@ -44,27 +44,6 @@ As shown in the chart, the R1-Onevision dataset is a carefully crafted tool desi This is a multimodal large language model fine-tuned from Qwen2.5-VL on the **R1-Onevision** dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding. With its robust ability to perform multimodal reasoning, R1-Onevision emerges as a powerful AI assistant capable of addressing a wide range of problem-solving challenges across different domains. -- Framework: The training process uses the open-source **LLama-Factory** library, with **Qwen2.5-VL-Instruct** as the base model. This model comes in three variants: 3B, 7B, and 72B. -- Parameters: For efficiency, we use a resolution of 512 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch. - -The training configuration is as follows: -```python -image_resolution: 512 -cutoff_len: 8192 -per_device_train_batch_size: 1 -gradient_accumulation_steps: 16 -learning_rate: 1.0e-5 -num_train_epochs: 1.0 -lr_scheduler_type: cosine -warmup_ratio: 0.05 -bf16: true -flash_attn: fa2 -``` - -Training loss curve: - - - You can load the model using the Hugging Face `transformers` library: ```python @@ -127,5 +106,26 @@ We evaluated R1-Onevision on Mathvision, Mathverse and R1-Onevision-Bench, and o ## 🏗️ Start +- Framework: The training process uses the open-source **LLama-Factory** library, with **Qwen2.5-VL-Instruct** as the base model. This model comes in three variants: 3B, 7B, and 72B. +- Parameters: For efficiency, we use a resolution of 512 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch. + +The training configuration is as follows: +```python +image_resolution: 512 +cutoff_len: 8192 +per_device_train_batch_size: 1 +gradient_accumulation_steps: 16 +learning_rate: 1.0e-5 +num_train_epochs: 1.0 +lr_scheduler_type: cosine +warmup_ratio: 0.05 +bf16: true +flash_attn: fa2 +``` + +Training loss curve: + + + ## 🧑‍💻 Authors Yi Yang*, Xiaoxuan He*, Hongkun Pan*, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Minfeng Zhu†, Bo Zhang†, Wei Chenâ€