Data file name | Size |
---|---|
open-llava-next_instruct_mix1M.json | 1.64 GB |
vqa_collection.zip | 30.20 GB |
We have made every effort to align our training data with that of LLaVA-NeXT. However, we were unable to access the tens of thousands of real user interaction data that LLaVA-NeXT collected. As a result, we used 200K ALLaVA-Instruct-VFLAN-4V data as a substitute. Additionally, since TextVQA has been included in the training data of most existing LMMs, we chose to retain it to enable fair comparisons with other LMMs.
The dataset, based on sharegpt4v_mix665k, has been expanded to include ALLaVA-Instruct-VFLAN-4V, DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+, totaling 1M image-text pairs.
First, download all images we used.
- LAION-CC-SBU-558K: images.zip
- COCO: train2017
- WebData: images. Only for academic usage.
- SAM: images
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg
- TextVQA: trainvalimages
- VisualGenome: part1, part2
- A collection of several VQA datasets: DocVQA, SynDog-EN, ChartQA, DVQA, AI2D, and GeoQA+.
- ALLaVA-Instruct-VFLAN-4V: image_191-task_1k
Then, organize the data as follows:
Open-LLaVA-NeXT
├── ...
├── data
│ ├── llava
│ │ ├── llava_pretrain
│ │ │ ├── images
│ ├── coco
│ │ ├── train2017
│ ├── sam
│ │ ├── images
│ ├── gqa
│ │ ├── images
│ ├── ocr_vqa
│ │ ├── images
│ ├── textvqa
│ │ ├── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ ├── open-llava-next
│ │ ├── open-llava-next_instruct_mix1M.json
│ ├── web-celebrity
│ │ ├── images
│ ├── web-landmark
│ │ ├── images
│ ├── wikiart
│ │ ├── images
│ ├── allava_vflan
│ │ ├── images
│ │ │ ├── images_191task_1k
│ ├── share_textvqa
│ │ ├── images
│ ├── ai2d
│ │ ├── images
│ ├── chatqa
│ │ ├── train
│ │ │ ├── png
│ ├── docvqa
│ │ ├── train
│ │ │ ├── documents
│ ├── dvqa
│ │ ├── images
│ ├── geoqa+
│ │ ├── images
│ ├── synthdog-en
│ │ ├── images
├── ...