Skip to content

Latest commit

 

History

History
613 lines (427 loc) · 59.1 KB

vlm.md

File metadata and controls

613 lines (427 loc) · 59.1 KB

Vision-Language Models

Survey

  • Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey, arXiv, 2412.18619, arxiv, pdf, cication: -1

    Liang Chen, Zekun Wang, Shuhuai Ren, ..., Tianyu Liu, Baobao Chang · (Awesome-Multimodal-Next-Token-Prediction - LMM101) Star

  • A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges, arXiv, 2412.11936, arxiv, pdf, cication: -1

    Yibo Yan, Jiamin Su, Jianxiang He, ..., Qingsong Wen, Xuming Hu

  • Personalized Multimodal Large Language Models: A Survey, arXiv, 2412.02142, arxiv, pdf, cication: -1

    Junda Wu, Hanjia Lyu, Yu Xia, ..., Jiebo Luo, Julian McAuley

  • Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey, arXiv, 2412.02104, arxiv, pdf, cication: -1

    Yunkai Dang, Kaichen Huang, Jiahao Huo, ..., Hui Xiong, Xuming Hu

  • Personalized Multimodal Large Language Models: A Survey, arXiv, 2412.02142, arxiv, pdf, cication: -1

    Junda Wu, Hanjia Lyu, Yu Xia, ..., Jiebo Luo, Julian McAuley

  • Papers I've read this week: vision language models

    · (𝕏)

  • short survey of trends in VLMs since Llava 1.0 came out 𝕏

    · (huggingface) · (youtube)

  • Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective, arXiv, 2410.22217, arxiv, pdf, cication: -1

    Shenghao Xie, Wenqiang Zu, Mingyang Zhao, ..., Shanghang Zhang, Lei Ma

  • A Survey of Hallucination in Large Visual Language Models, arXiv, 2410.15359, arxiv, pdf, cication: -1

    Wei Lan, Wenyi Chen, Qingfeng Chen, ..., Huiyu Zhou, Yi Pan

Vision-Language Models

  • Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models, arXiv, 2501.05767, arxiv, pdf, cication: -1

    You Li, Heyu Huang, Chi Chen, ..., Ruixuan Li, Maosong Sun · (migician-vg.github) · (arxiv) · (Migician - thunlp) Star

  • SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. 🤗

  • Moondream 2025-01-09 Release: Structured Text, Enhanced OCR, Gaze Detection

    · (𝕏) · (docs.moondream)

  • The Illusion-Illusion: Vision Language Models See Illusions Where There are None, arXiv, 2412.18613, arxiv, pdf, cication: -1

    Tomer Ullman

    · (𝕏)

  • Are Vision-Language Models Truly Understanding Multi-vision Sensor?, arXiv, 2412.20750, arxiv, pdf, cication: -1

    Sangyun Chung, Youngjoon Yu, Youngchae Chee, ..., Byung-Kwan Lee, Yong Man Ro

  • 🌟 QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. 🤗

  • SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding, arXiv, 2412.09604, arxiv, pdf, cication: -1

    Hao Li, Changyao Tian, Jie Shao, ..., Lewei Lu, Jifeng Dai

  • POINTS1.5: Building a Vision-Language Model towards Real World Applications, arXiv, 2412.08443, arxiv, pdf, cication: -1

    Yuan Liu, Le Tian, Xiao Zhou, ..., Yang Yu, Jie Zhou

  • OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv, 2412.09585, arxiv, pdf, cication: -1

    Jitesh Jain, Zhengyuan Yang, Humphrey Shi, ..., Jianfeng Gao, Jianwei Yang · (OLA-VLM; - SHI-Labs) Star · (praeclarumjj3.github)

  • 🌟 NVILA: Efficient Frontier Visual Language Models, arXiv, 2412.04468, arxiv, pdf, cication: -1

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, ..., Song Han, Yao Lu

  • moondream is a small vision language model designed to run efficiently on edge devices. 🤗

  • 🌟 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling, arXiv, 2412.05271, arxiv, pdf, cication: -1

    Zhe Chen, Weiyun Wang, Yue Cao, ..., Jifeng Dai, Wenhai Wang

  • 🌟 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale, arXiv, 2412.05237, arxiv, pdf, cication: -1

    Jarvis Guo, Tuney Zheng, Yuelin Bai, ..., Wenhu Chen, Xiang Yue · (𝕏)

  • 🌟 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models, arXiv, 2409.17146, arxiv, pdf, cication: -1

    Matt Deitke, Christopher Clark, Sangho Lee, ..., Ali Farhadi, Aniruddha Kembhavi · (molmo - allenai) Star

  • 🌟 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion, arXiv, 2412.04424, arxiv, pdf, cication: -1

    Jiuhai Chen, Jianwei Yang, Haiping Wu, ..., Tianyi Zhou, Bin Xiao · (huggingface)

  • Discriminative Fine-tuning of LVLMs, arXiv, 2412.04378, arxiv, pdf, cication: -1

    Yassine Ouali, Adrian Bulat, Alexandros Xenos, ..., Brais Martinez, Georgios Tzimiropoulos

  • CompCap: Improving Multimodal Large Language Models with Composite Captions, arXiv, 2412.05243, arxiv, pdf, cication: -1

    Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, ..., Xuewen Zhang, Baosheng He

  • Maya: An Instruction Finetuned Multilingual Multimodal Model, arXiv, 2412.07112, arxiv, pdf, cication: -1

    Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, ..., Snehanshu Mukherjee, Alham Fikri Aji · (maya - nahidalam) Star

  • Welcome PaliGemma 2 – New vision language models by Google 🤗

    · (𝕏)

  • FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity, arXiv, 2411.15411, arxiv, pdf, cication: -1

    Hang Hua, Qing Liu, Lingzhi Zhang, ..., Jianming Zhang, Jiebo Luo

  • Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations, arXiv, 2411.10414, arxiv, pdf, cication: -1

    Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, ..., Kartikeya Upasani, Mahesh Pasupuleti · (llama) · (llama-recipes - meta-llama) Star

  • Unified Generative and Discriminative Training for Multi-modal Large Language Models, arXiv, 2411.00304, arxiv, pdf, cication: -1

    Wei Chow, Juncheng Li, Qifan Yu, ..., Hanwang Zhang, Qianru Sun

  • 🌟 CLEAR: Character Unlearning in Textual and Visual Modalities, arXiv, 2410.18057, arxiv, pdf, cication: -1

    Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, ..., Ivan Oseledets, Elena Tutubalina · (huggingface) · (multimodal_unlearning - somvy) Star

  • Improve Vision Language Model Chain-of-thought Reasoning, arXiv, 2410.16198, arxiv, pdf, cication: -1

    Ruohong Zhang, Bowen Zhang, Yanghao Li, ..., Ruoming Pang, Yiming Yang

    · (LLaVA-Reasoner-DPO - RifleZhang) Star

  • Mitigating Object Hallucination via Concentric Causal Attention, arXiv, 2410.15926, arxiv, pdf, cication: -1

    Yun Xing, Yiheng Li, Ivan Laptev, ..., Shijian Lu

    · (cca-llava - xing0047) Star · (arxiv)

Image

  • ChatRex: Taming Multimodal LLM for Joint Perception and Understanding, arXiv, 2411.18363, arxiv, pdf, cication: -1

    Qing Jiang, Gen Luo, Yuqin Yang, ..., Tianhe Ren, Lei Zhang · (chatrex - idea-research) Star

  • DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding, arXiv, 2411.14347, arxiv, pdf, cication: -1

    Tianhe Ren, Yihao Chen, Qing Jiang, ..., Kent Yu, Lei Zhang

  • Teach Multimodal LLMs to Comprehend Electrocardiographic Images, arXiv, 2410.19008, arxiv, pdf, cication: -1

    Ruoqi Liu, Yuelin Bai, Xiang Yue, ..., Ping Zhang

Video

  • An Empirical Study of Autoregressive Pre-training from Videos, arXiv, 2501.05453, arxiv, pdf, cication: -1

    Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, ..., Christoph Feichtenhofer, Jitendra Malik · (brjathu.github)

  • Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction, arXiv, 2501.03218, arxiv, pdf, cication: -1

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, ..., Dahua Lin, Jiaqi Wang · (Dispider - Mark12Ding) Star

  • MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models, arXiv, 2501.02955, arxiv, pdf, cication: -1

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, ..., Yuxiao Dong, Jie Tang · (motion-bench.github)

  • Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, arXiv, 2501.04001, arxiv, pdf, cication: -1

    Haobo Yuan, Xiangtai Li, Tao Zhang, ..., Jiashi Feng, Ming-Hsuan Yang · (Sa2VA - magic-research) Star · (arxiv) · (huggingface) · (lxtgh.github)

  • VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, arXiv, 2501.00599, arxiv, pdf, cication: -1

    Yuqian Yuan, Hang Zhang, Wentong Li, ..., Jianke Zhu, Lidong Bing

  • Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models, arXiv, 2412.18609, arxiv, pdf, cication: -1

    Jinhui Yi, Syed Talal Wasim, Yanan Luo, ..., Muzammal Naseer, Juergen Gall · (Video-Panda - jh-yi) Star

  • 🌟 Apollo: An Exploration of Video Understanding in Large Multimodal Models, arXiv, 2412.10360, arxiv, pdf, cication: -1

    Orr Zohar, Xiaohan Wang, Yann Dubois, ..., Serena Yeung-Levy, Xide Xia · (apollo-lmms.github) · (huggingface) · (Apollo - Apollo-LMMs) Star · (huggingface)

  • StreamChat: Chatting with Streaming Video, arXiv, 2412.08646, arxiv, pdf, cication: -1

    Jihao Liu, Zhiding Yu, Shiyi Lan, ..., Hongsheng Li, Jose M. Alvare · (jihaonew.github)

  • VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding, arXiv, 2412.02186, arxiv, pdf, cication: -1

    Kangsan Kim, Geon Park, Youngwan Lee, ..., Woongyeong Yeo, Sung Ju Hwang

  • Towards Universal Soccer Video Understanding, arXiv, 2412.01820, arxiv, pdf, cication: -1

    Jiayuan Rao, Haoning Wu, Hao Jiang, ..., Yanfeng Wang, Weidi Xie · (jyrao.github) · (arxiv) · (UniSoccer - jyrao) Star

  • VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection, arXiv, 2411.14794, arxiv, pdf, cication: -1

    Songhao Han, Wei Huang, Hairong Shi, ..., Yue Liao, Si Liu · (VideoEspresso - hshjerry) Star

  • TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability, arXiv, 2411.18211, arxiv, pdf, cication: -1

    Shimin Chen, Xiaohan Lan, Yitian Yuan, ..., Zequn Jie, Lin Ma · (TimeMarker - TimeMarker-LLM) Star

  • Number it: Temporal Grounding Videos like Flipping Manga, arXiv, 2411.10332, arxiv, pdf, cication: -1

    Yongliang Wu, Xinting Hu, Yuyang Sun, ..., Bernt Schiele, Xu Yang · (NumPro. - yongliang-wu) Star

  • Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension, arXiv, 2411.13093, arxiv, pdf, cication: -1

    Yongdong Luo, Xiawu Zheng, Xiao Yang, ..., Jiebo Luo, Rongrong Ji

  • PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance, arXiv, 2411.02327, arxiv, pdf, cication: -1

    Ruyang Liu, Haoran Tang, Haibo Liu, ..., Chen Li, Jiankun Yang · (PPLLaVA. - farewellthree) Star

  • VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos, arXiv, 2411.04923, arxiv, pdf, cication: -1

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, ..., Fahad Shahbaz Khan, Salman Khan · (mbzuai-oryx.github)

  • xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs, arXiv, 2410.16267, arxiv, pdf, cication: -1

    Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, ..., Caiming Xiong, Juan Carlos Niebles

    · (salesforceairesearch)

  • LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding, arXiv, 2410.17434, arxiv, pdf, cication: -1

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, ..., Mohamed Elhoseiny, Vikas Chandra

    · (vision-cair.github) · (LongVU - Vision-CAIR) Star · (huggingface) · (huggingface)

  • VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI, arXiv, 2410.11623, arxiv, pdf, cication: -1

    Sijie Cheng, Kechen Fang, Yangyang Yu, ..., Lei Han, Yang Liu

  • OMCAT: Omni Context Aware Transformer, arXiv, 2410.12109, arxiv, pdf, cication: -1

    Arushi Goel, Karan Sapra, Matthieu Le, ..., Andrew Tao, Bryan Catanzaro · (om-cat.github)

Encoder

  • Unifying Specialized Visual Encoders for Video Language Models, arXiv, 2501.01426, arxiv, pdf, cication: -1

    Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, ..., Honglu Zhou, Olga Russakovsky · (tylerzhu) · (merv - princetonvisualai) Star

  • LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer, arXiv, 2412.13871, arxiv, pdf, cication: -1

    Yipeng Zhang, Yifan Liu, Zonghao Guo, ..., Tat-Seng Chua, Maosong Sun · (LLaVA-UHD - thunlp) Star

  • FastVLM: Efficient Vision Encoding for Vision Language Models, arXiv, 2412.13303, arxiv, pdf, cication: -1

    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, ..., Oncel Tuzel, Hadi Pouransari

  • TRecViT: A Recurrent Video Transformer, arXiv, 2412.14294, arxiv, pdf, cication: -1

    Viorica Pătrăucean, Xu Owen He, Joseph Heyward, ..., João Carreira, Razvan Pascanu · (trecvit - google-deepmind) Star

  • PruneVid: Visual Token Pruning for Efficient Video Large Language Models, arXiv, 2412.16117, arxiv, pdf, cication: -1

    Xiaohu Huang, Hao Zhou, Kai Han · (PruneVid - Visual-AI) Star

  • Large Motion Video Autoencoding with Cross-modal Video VAE, arXiv, 2412.17805, arxiv, pdf, cication: -1

    Yazhou Xing, Yang Fei, Yingqing He, ..., Xiaowei Chi, Qifeng Chen · (VideoVAEPlus - VideoVerses) Star

  • 🌟 VisionZip: Longer is Better but Not Necessary in Vision Language Models, arXiv, 2412.04467, arxiv, pdf, cication: -1

    Senqiao Yang, Yukang Chen, Zhuotao Tian, ..., Bei Yu, Jiaya Jia · (202.104.135) · (youtu) · (VisionZip - dvlab-research) Star

  • [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs, arXiv, 2412.05819, arxiv, pdf, cication: -1

    Ao Wang, Fengyuan Sun, Hui Chen, ..., Jungong Han, Guiguang Ding · (VTC-CLS - THU-MIG) Star

  • FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression, arXiv, 2411.14228, arxiv, pdf, cication: -1

    Yuke Zhu, Chi Xie, Shuang Liang, ..., Bo Zheng, Sheng Guo

  • DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, arXiv, 2411.15024, arxiv, pdf, cication: -1

    Keda Tao, Can Qin, Haoxuan You, ..., Yang Sui, Huan Wang

  • Factorized Visual Tokenization and Generation, arXiv, 2411.16681, arxiv, pdf, cication: -1

    Zechen Bai, Jianxiong Gao, Ziteng Gao, ..., Tong He, Mike Zheng Shou · (showlab.github)

  • REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents, arXiv, 2411.13552, arxiv, pdf, cication: -1

    Rui Tian, Qi Dai, Jianmin Bao, ..., Zuxuan Wu, Yu-Gang Jiang · (Reducio-VAE - microsoft) Star

  • Multimodal Autoregressive Pre-training of Large Vision Encoders, arXiv, 2411.14402, arxiv, pdf, cication: -1

    Enrico Fini, Mustafa Shukor, Xiujun Li, ..., Joshua M. Susskind, Alaaeldin El-Nouby · (ml-aim - apple) Star · (huggingface)

  • Don't Look Twice: Faster Video Transformers with Run-Length Tokenization, arXiv, 2411.05222, arxiv, pdf, cication: -1

    Rohan Choudhury, Guanglei Zhu, Sihan Liu, ..., Kris M. Kitani, László Jeni · (rccchoudhury.github) · (rlt - rccchoudhury) Star · (mp.weixin.qq)

  • SigLIP model pre-trained on WebLi at resolution 224x224. 🤗

  • 🌟 LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation, arXiv, 2411.04997, arxiv, pdf, cication: -1

    Weiquan Huang, Aoqi Wu, Yifan Yang, ..., Chong Luo, Lili Qiu · (aka) · (LLM2CLIP - microsoft) Star

  • In Search of Forgotten Domain Generalization, arXiv, 2410.08258, arxiv, pdf, cication: -1

    Prasanna Mayilvahanan, Roland S. Zimmermann, Thaddäus Wiedemer, ..., Matthias Bethge, Wieland Brendel · (𝕏)

  • Adaptive Length Image Tokenization via Recurrent Allocation, arXiv, 2411.02393, arxiv, pdf, cication: -1

    Shivam Duggal, Phillip Isola, Antonio Torralba, ..., William T. Freeman

  • LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior, arXiv, 2410.21264, arxiv, pdf, cication: -1

    Hanyu Wang, Saksham Suri, Yixuan Ren, ..., Hao Chen, Abhinav Shrivastava · (hywang66.github)

  • Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss, arXiv, 2410.17243, arxiv, pdf, cication: -1

    Zesen Cheng, Hang Zhang, Kehan Li, ..., Xin Li, Lidong Bing

    · (Inf-CLIP - DAMO-NLP-SG) Star · (arxiv)

Alignment

  • Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment, arXiv, 2412.19326, arxiv, pdf, cication: -1

    Ziang Yan, Zhilin Li, Yinan He, ..., Limin Wang, Yi Wang · (TPO - OpenGVLab) Star · (huggingface)

  • Preference Optimization for Vision Language Models with TRL 🤗

  • On Domain-Specific Post-Training for Multimodal Large Language Models, arXiv, 2411.19930, arxiv, pdf, cication: -1

    Daixuan Cheng, Shaohan Huang, Ziyu Zhu, ..., Bo Dai, Zhenliang Zhang · (huggingface)

  • 🌟 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization, arXiv, 2411.10442, arxiv, pdf, cication: -1

    Weiyun Wang, Zhe Chen, Wenhai Wang, ..., Yu Qiao, Jifeng Dai · (internvl.github)

  • SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization, arXiv, 2411.11909, arxiv, pdf, cication: -1

    Hongrui Jia, Chaoya Jiang, Haiyang Xu, ..., Fei Huang, Shikun Zhang

  • V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization, arXiv, 2411.02712, arxiv, pdf, cication: -1

    Yuxi Xie, Guanzhen Li, Xiao Xu, ..., Min-Yen Kan · (V-DPO - YuxiXie) Star

  • MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models, arXiv, 2410.17637, arxiv, pdf, cication: -1

    Ziyu Liu, Yuhang Zang, Xiaoyi Dong, ..., Dahua Lin, Jiaqi Wang

Reasoning

  • 🌟 LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs, arXiv, 2501.06186, arxiv, pdf, cication: -1

    Omkar Thawakar, Dinura Dissanayake, Ketan More, ..., Fahad Shahbaz Khan, Salman Khan

  • Multimodal Reasoning and its Applications to Computer Use and Robotics 🎬

  • 🌟 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM, arXiv, 2501.01904, arxiv, pdf, cication: -1

    Yifan Du, Zikang Liu, Yifan Li, ..., Zhongyuan Wang, Ji-Rong Wen · (Virgo - RUCAIBox) Star

  • URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics, arXiv, 2501.04686, arxiv, pdf, cication: -1

    Ruilin Luo, Zhuofan Zheng, Yifan Wang, ..., Jin Zeng, Yujiu Yang · (ursa-math.github)

  • 🌟 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search, arXiv, 2412.18319, arxiv, pdf, cication: -1

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, ..., Li Shen, Dacheng Tao · (Mulberry - HJYao00) Star

  • 🌟 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces, arXiv, 2412.14171, arxiv, pdf, cication: -1

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, ..., Li Fei-Fei, Saining Xie · (vision-x-nyu.github) · (thinking-in-space.git - vision-x-nyu) Star · (huggingface)

  • 🌟 Progressive Multimodal Reasoning via Active Retrieval, arXiv, 2412.14835, arxiv, pdf, cication: -1

    Guanting Dong, Chenghao Zhang, Mengjie Deng, ..., Zhicheng Dou, Ji-Rong Wen

  • 🌟 Diving into Self-Evolving Training for Multimodal Reasoning, arXiv, 2412.17451, arxiv, pdf, cication: -1

    Wei Liu, Junlong Li, Xiwen Zhang, ..., Yu Cheng, Junxian He · (mstar-lmm.github)

  • TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action, arXiv, 2412.05479, arxiv, pdf, cication: -1

    Zixian Ma, Jianguo Zhang, Zhiwei Liu, ..., Ranjay Krishna, Silvio Savarese · (taco-project.github) · (TACO - SalesforceAIResearch) Star

  • 🌟 LLaVA-o1: Let Vision Language Models Reason Step-by-Step, arXiv, 2411.10440, arxiv, pdf, cication: -1

    Guowei Xu, Peng Jin, Li Hao, ..., Lichao Sun, Li Yuan · (LLaVA-o1 - PKU-YuanGroup) Star

  • Llama-3.2V-11B-cot is the first version of LLaVA-o1, which is a visual language model capable of spontaneous, systematic reasoning. 🤗

  • Vision-Language Models Can Self-Improve Reasoning via Reflection, arXiv, 2411.00855, arxiv, pdf, cication: -1

    Kanzhi Cheng, Yantao Li, Fangzhi Xu, ..., Hao Zhou, Yang Liu

Evaluation

  • OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?, arXiv, 2501.05510, arxiv, pdf, cication: -1

    Yifei Li, Junbo Niu, Ziyang Miao, ..., Conghui He, Jiaqi Wang · (OVO-Bench - JoeLeelyf) Star

  • Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models, arXiv, 2412.12606, arxiv, pdf, cication: -1

    YiFan Zhang, Shanglin Lei, Runqi Qiao, ..., Xiaofei Wang, Honggang Zhang · (mdi-benchmark.github)

  • VisionArena: 230K Real World User-VLM Conversations with Preference Labels, arXiv, 2412.08687, arxiv, pdf, cication: -1

    Christopher Chou, Lisa Dunlap, Koki Mashita, ..., Joseph E. Gonzalez, Wei-Lin Chiang · (huggingface)

  • VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation, arXiv, 2411.13281, arxiv, pdf, cication: -1

    Ziyang Luo, Haoning Wu, Dongxu Li, ..., Mohan Kankanhalli, Junnan Li · (videoautoarena.github)

  • ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models, arXiv, 2411.10867, arxiv, pdf, cication: -1

    Vipula Rawte, Sarthak Jain, Aarush Sinha, ..., Amit P. Sheth, Amitava Das

  • M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework, arXiv, 2411.06176, arxiv, pdf, cication: -1

    Yew Ken Chia, Liying Cheng, Hou Pong Chan, ..., Soujanya Poria, Lidong Bing · (multimodal-documents.github)

  • DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models, arXiv, 2411.00836, arxiv, pdf, cication: -1

    Chengke Zou, Xingang Guo, Rui Yang, ..., Bin Hu, Huan Zhang · (DynaMath - DynaMath) Star · (huggingface)

  • 🌟 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination, arXiv, 2411.03823, arxiv, pdf, cication: -1

    Dingjie Song, Sicheng Lai, Shunian Chen, ..., Lichao Sun, Benyou Wang · (MM-Detect - MLLM-Data-Contamination) Star

  • StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding, arXiv, 2411.03628, arxiv, pdf, cication: -1

    Junming Lin, Zheng Fang, Chi Chen, ..., Yang Liu, Maosong Sun

  • TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models, arXiv, 2410.23266, arxiv, pdf, cication: -1

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, ..., Tesca Fitzgerald, Arman Cohan · (TOMATO - yale-nlp) Star

  • Image2Struct: Benchmarking Structure Extraction for Vision-Language Models, arXiv, 2410.22456, arxiv, pdf, cication: -1

    Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, ..., Yifan Mai, Percy Liang · (crfm.stanford) · (x)

  • AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models, arXiv, 2410.18325, arxiv, pdf, cication: -1

    Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, ..., Joon Son Chung, Tae-Hyun Oh

    · (AVHBench - AVHBench) Star

  • MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models, arXiv, 2410.10139, arxiv, pdf, cication: -1

    Peng Xia, Siwei Han, Shi Qiu, ..., Lijuan Wang, Huaxiu Yao · (mmie-bench.github) · (MMIE - Lillianwei-h) Star · (huggingface)

  • MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks, arXiv, 2410.10563, arxiv, pdf, cication: -1

    Jiacheng Chen, Tianhao Liang, Sherman Siu, ..., Xiang Yue, Wenhu Chen · (tiger-ai-lab.github)

  • LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content, arXiv, 2410.10783, arxiv, pdf, cication: -1

    Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, ..., Leonid Karlinsky, Raja Giryes

  • TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models, arXiv, 2410.10818, arxiv, pdf, cication: -1

    Mu Cai, Reuben Tan, Jianrui Zhang, ..., Yong Jae Lee, Jianwei Yang

  • NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples, arXiv, 2410.14669, arxiv, pdf, cication: -1

    Baiqi Li, Zhiqiu Lin, Wenxuan Peng, ..., Graham Neubig, Deva Ramanan · (arxiv) · (huggingface) · (linzhiqiu.github)

  • WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines, arXiv, 2410.12705, arxiv, pdf, cication: -1

    Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, ..., Alice Oh, Chong-Wah Ngo

  • HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks, arXiv, 2410.12381, arxiv, pdf, cication: -1

    Fengji Zhang, Linquan Wu, Huiyu Bai, ..., Bei Chen, Jacky Keung

Efficient

  • LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token, arXiv, 2501.03895, arxiv, pdf, cication: -1

    Shaolei Zhang, Qingkai Fang, Zhe Yang, ..., Yang Feng

  • Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration, arXiv, 2412.13180, arxiv, pdf, cication: -1

    Mark Endo, Xiaohan Wang, Serena Yeung-Levy · (web.stanford)

  • 🌟 SmolVLM - small yet mighty Vision Language Model 🤗

    · (𝕏) · (huggingface)

  • Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See, arXiv, 2410.06169, arxiv, pdf, cication: -1

    Zeliang Zhang, Phu Pham, Wentian Zhao, ..., Ajinkya Kale, Chenliang Xu · (YOPO_MLLM_Pruning - ZhangAIPI) Star

  • 🌟 BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices, arXiv, 2411.10640, arxiv, pdf, cication: -1

    Xudong Lu, Yinghao Chen, Cheng Chen, ..., Shuai Ren, Hongsheng Li

  • Inference Optimal VLMs Need Only One Visual Token but Larger Models, arXiv, 2411.03312, arxiv, pdf, cication: -1

    Kevin Y. Li, Sachin Goyal, Joao D. Semedo, ..., J. Zico Kolter · (llava-token-compression - locuslab) Star

  • PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction, arXiv, 2410.17247, arxiv, pdf, cication: -1

    Long Xing, Qidong Huang, Xiaoyi Dong, ..., Feng Wu, Dahua Lin

    · (PyramidDrop - Cooperx521) Star

Generation

  • Flowing from Words to Pixels: A Framework for Cross-Modality Evolution, arXiv, 2412.15213, arxiv, pdf, cication: -1

    Qihao Liu, Xi Yin, Alan Yuille, ..., Andrew Brown, Mannat Singh · (cross-flow.github)

  • MetaMorph: Multimodal Understanding and Generation via Instruction Tuning, arXiv, 2412.14164, arxiv, pdf, cication: -1

    Shengbang Tong, David Fan, Jiachen Zhu, ..., Saining Xie, Zhuang Liu · (𝕏)

  • LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation, arXiv, 2412.15188, arxiv, pdf, cication: -1

    Weijia Shi, Xiaochuang Han, Chunting Zhou, ..., Luke Zettlemoyer, Lili Yu

  • DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation, arXiv, 2412.07589, arxiv, pdf, cication: -1

    Jianzong Wu, Chao Tang, Jingbo Wang, ..., Xiangtai Li, Yunhai Tong · (jianzongwu.github) · (arxiv) · (DiffSensei - jianzongwu) Star

  • Multimodal Latent Language Modeling with Next-Token Diffusion, arXiv, 2412.08635, arxiv, pdf, cication: -1

    Yutao Sun, Hangbo Bao, Wenhui Wang, ..., Jianyong Wang, Furu Wei

  • EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM, arXiv, 2412.09618, arxiv, pdf, cication: -1

    Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, ..., Yu Liu, Hongsheng Li · (easyref-gen.github)

  • TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation, arXiv, 2412.03069, arxiv, pdf, cication: -1

    Liao Qu, Huichao Zhang, Yiheng Liu, ..., Zehuan Yuan, Xinglong Wu · (byteflow-ai.github) · (TokenFlow - ByteFlow-AI) Star

  • Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation, arXiv, 2412.04432, arxiv, pdf, cication: -1

    Yuying Ge, Yizhuo Li, Yixiao Ge, ..., Ying Shan · (huggingface) · (Divot - TencentARC) Star

  • ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance, arXiv, 2412.06673, arxiv, pdf, cication: -1

    Chunwei Wang, Guansong Lu, Junwei Yang, ..., Wei Zhang, Hang Xu

  • qwen2vl-flux - erwold Star

    Unifying Image and Text Guidance for Controllable Image Generation · (qwen2vl-flux - erwold) Star

  • 🌟 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation, arXiv, 2411.07975, arxiv, pdf, cication: -1

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, ..., Jiaying Liu, Chong Ruan · (Janus - deepseek-ai) Star

  • 🌟 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models, arXiv, 2411.04996, arxiv, pdf, cication: -1

    Weixin Liang, Lili Yu, Liang Luo, ..., Luke Zettlemoyer, Xi Victoria Lin

  • VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

    · (Vitron - SkyworkAI) Star

  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation, arXiv, 2410.13848, arxiv, pdf, cication: -1

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, ..., Chong Ruan, Ping Luo

  • PUMA: Empowering Unified MLLM with Multi-granular Visual Generation, arXiv, 2410.13861, arxiv, pdf, cication: -1

    Rongyao Fang, Chengqi Duan, Kun Wang, ..., Hongsheng Li, Xihui Liu

Dataset

  • 🌟 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining, arXiv, 2501.00958, arxiv, pdf, cication: -1

    Wenqi Zhang, Hang Zhang, Xin Li, ..., Yueting Zhuang, Lidong Bing · (multimodal-interleaved-textbook.github) · (multimodal_textbook - DAMO-NLP-SG) Star · (huggingface) · (zhuanlan.zhihu)

  • MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval, arXiv, 2412.14475, arxiv, pdf, cication: -1

    Junjie Zhou, Zheng Liu, Ze Liu, ..., Defu Lian, Yongping Xiong

  • MMPR, which includes additional data sources to enhance the data diversity and improves the performance of InternVL2.5 🤗

  • 🌟 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations, arXiv, 2412.08580, arxiv, pdf, cication: -1

    Zejian Li, Chenye Meng, Yize Li, ..., Jinxiong Chang, Lingyun Sun · (LAION-SG - mengcye) Star

  • Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions, arXiv, 2412.08737, arxiv, pdf, cication: -1

    Jiarui Zhang, Ollie Liu, Tianyu Yu, ..., Jinyi Hu, Willie Neiswanger

  • BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks, arXiv, 2412.04626, arxiv, pdf, cication: -1

    Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, ..., David Vazquez, Sai Rajeswar · (bigdocs.github)

  • BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions, arXiv, 2411.07461, arxiv, pdf, cication: -1

    Anas Awadalla, Le Xue, Manli Shu, ..., Caiming Xiong, Ran Xu · (huggingface)

  • HumanVLM: Foundation for Human-Scene Vision-Language Model, arXiv, 2411.03034, arxiv, pdf, cication: -1

    Dawei Dai, Xu Long, Li Yutang, ..., Zhang Yuanhui, Shuyin Xia

  • HourVideo: 1-Hour Video-Language Understanding, arXiv, 2411.04998, arxiv, pdf, cication: -1

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, ..., Jiajun Wu, Li Fei-Fei · (hourvideo.stanford)

  • Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data, arXiv, 2410.18558, arxiv, pdf, cication: -1

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, ..., Fangxiang Feng, Guang Liu

  • This dataset is our multimodal, fine-grained, ranking Google Shopping dataset, Marqo-GS-10M 🤗

  • LVD-2M: A Long-take Video Dataset with Temporally Dense Captions, arXiv, 2410.10816, arxiv, pdf, cication: -1

    Tianwei Xiong, Yuqing Wang, Daquan Zhou, ..., Jiashi Feng, Xihui Liu

    · (LVD-2M - SilentView) Star

  • Harnessing Webpage UIs for Text-Rich Visual Understanding, arXiv, 2410.13824, arxiv, pdf, cication: -1

    Junpeng Liu, Tianyue Ou, Yifan Song, ..., Graham Neubig, Xiang Yue

  • LVD-2M: A Long-take Video Dataset with Temporally Dense Captions, arXiv, 2410.10816, arxiv, pdf, cication: -1

    Tianwei Xiong, Yuqing Wang, Daquan Zhou, ..., Jiashi Feng, Xihui Liu · (silentview.github)

Projects

Products

Misc