How to reproduce the Tiny-eHub eval #13

zhangmozhe · 2023-11-15T06:49:22Z

Thanks for releasing this benchmark. Now we tried to compute the categorical score for each ability but found low scores on several abilities, like visual reasoning, and visual perception. We compute the text matching accuracy. We also download the inference results of existing works, like BLIP2, etc., and manually check that the text matching accuracy can hardly achieve ~50%. Below is an example of the prediction result of BLIP2. I wonder how these works can achieve high scores in the Tiny LVLM evaluation?

{
"question": "When does the coupon expire?",
"answer": "it expires on january 31st",
"gt_answers": [
"12/31/87"
],
"image_path": "updated_datasets/Visual_Reasoning/001.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the \u201cunit of quantity\u201d of Pulp?",
"answer": "Pulp is a term used to refer to the amount of pulp produced by a pulp mill, or the amount of pulp produced by a",
"gt_answers": [
"Tonne"
],
"image_path": "updated_datasets/Visual_Reasoning/002.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "what is the % of sugar in ro-neet?",
"answer": "% of sugar in ro-neet",
"gt_answers": [
"17.1%",
"17.1"
],
"image_path": "updated_datasets/Visual_Reasoning/003.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the total consultant costs under column "-04" based on "II. CONSULTANT COSTS"?",
"answer": "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0",
"gt_answers": [
"$1,532"
],
"image_path": "updated_datasets/Visual_Reasoning/004.png",
"model_name": "BLIP2",
"task_type": "VQA"
},

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reproduce the Tiny-eHub eval #13

How to reproduce the Tiny-eHub eval #13

zhangmozhe commented Nov 15, 2023

How to reproduce the Tiny-eHub eval #13

How to reproduce the Tiny-eHub eval #13

Comments

zhangmozhe commented Nov 15, 2023