I think this method looks like MM-COT, What's the difference with it except used llm? #45

dongxiaolong · 2023-04-19T05:57:01Z

dongxiaolong
Apr 19, 2023

Absolutely awesome work. Firstly thanks for the instruct tuning data that contributed to the community. Just confused with above question mentioned in the title.

152334H · 2023-04-19T07:20:17Z

152334H
Apr 19, 2023

the other important part is the synthetic gpt-4 based dataset.

0 replies

ChunyuanLI · 2023-04-19T20:57:12Z

ChunyuanLI
Apr 19, 2023
Collaborator

The section of fine-tuning on ScienceQA is similar with MM-COT, in terms of architectures and data organization. One major finding in MM-CoT is that prediction order matters: reason-then-answer, and thus it is called "CoT". In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Chain-of-thoughts. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training (89.96%). Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence speed, but contributes relatively little to the final performance.

Now, for both papers, the most performance gain is the use of vision encoder and the end-to-end training on ScienceQA. This dataset is relatively small compared with VQA 2.0. It is easy to reach high performance by training a large model on it. I hope this should be noted for readers to make solid conclusions. Further, there are implementation difference between the two papers for us to reach the different conclusions: (1) The choice of LLM; (2) We have pre-training stage to connect the two modality, which leads 5% improvement compared with training from scratch, while MMCoT does not has this pre-training stage. Hope it can re-considered in the development.

I'd like to clarify that ScienceQA has helped us quantitively ablate our design choice in the early stage of the project, but ScienceQA is not the single main focus of this project. We aim to help the community produce multimodal GPT-4 level capability with minimum efforts: (1) From focus shift from model-centric to data-centric AI: the multimodal instruction-following data is the key, and the most of our time is spent on. (2) Achieving multimodal chat with detailed description such as OCR and complex reasoning. The current demo has preliminary capabilities on this. Hope the community can be inspired to scale up this approach to reach better performance.

0 replies

152334H · 2023-04-20T01:17:44Z

152334H
Apr 20, 2023

In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:

Wow, thanks for the work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think this method looks like MM-COT, What's the difference with it except used llm? #45

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

I think this method looks like MM-COT, What's the difference with it except used llm? #45

dongxiaolong Apr 19, 2023

Replies: 3 comments

152334H Apr 19, 2023

ChunyuanLI Apr 19, 2023 Collaborator

152334H Apr 20, 2023

dongxiaolong
Apr 19, 2023

152334H
Apr 19, 2023

ChunyuanLI
Apr 19, 2023
Collaborator

152334H
Apr 20, 2023