Skip to content

Papers of "A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension"

License

Notifications You must be signed in to change notification settings

FudanDISC/Awesome-Large-Multimodal-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Large-Multimodal-Models

This repo summarizes the construction of current LMMs from the perspective of

input-output representation space extension

  • Based on the structure of input-output spaces, we systematically review the existing models, including main-stream models based on discrete-continuous hybrid spaces and models with unified multi-modal discrete representations.
  • Readers can refer to our [đź“– Preprint Paper] for detailed explanations.

Table of Contents

Preliminary

As presented in Figure below, the evolution of multi-modal research paradigms could be divided into three stages.

For readers to have a general picture about the development, we provide a tutorial here. The contents are summarized as follows:

Awesome Models

Large Vision-Language Models

With Text-only Output

Large_Vision_Language_Model Code Input Type Output Type LLM Backbone Modality Encoder Connection Max Res. Date
Flamingo: a Visual Language Model for Few-Shot Learning Github A 1 Chinchilla NFNet Perceiver 480 2022/04
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Github A 1 Flan-T5 / OPT CLIP ViT-L/14 / Eva-CLIP ViT-G/14 Q-Former 224 2023/01
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Github A 1 LLaMA CLIP-ViT-L/14 MLP 224 2023/03
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Github A 1 Vicuna Eva-CLIP ViT-G/14 Q-Former 224 2023/04
Visual Instruction Tuning Github A 1 Vicuna CLIP ViT-L/14 Linear 224 2023/04
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Github A 1 LLaMA CLIP ViT-L/14 Abstractor 224 2023/04
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Github A 1 LLaMA CLIP-ViT-L/14 MLP 224 2023/04
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Github A 1 Flan-T5 / Vicuna Eva-CLIP ViT-G/14 Q-Former 224 2023/05
Otter: A Multi-Modal Model with In-Context Instruction Tuning Github A 1 LLaMA CLIP ViT-L/14 Perceiver 224 2023/05
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models Github A 1 LLaMA CLIP ViT-L/14 MLP 224 2023/05
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans Github A 1 LLaMA CLIP ViT-L/14 Perceiver 224 2023/05
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Github A 1 Vicuna CLIP ViT-L/14 Linear 224 2023/06
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Github A 1 Vicuna CLIP ViT-L/14 Linear 224 2023/06
Valley: Video Assistant with Large Language model Enhanced abilitY Github A 1 Stable-Vicuna CLIP ViT-L/14 Temporal Module + Linear 224 2023/06
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? Github A 1 Vicuna EVA-1B Resampler 420 2023/07
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Github A 1 Qwen OpenCLIP ViT-bigG Cross-Attention 448 2023/08
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Github A 1 Flan-T5 / Vicuna Eva-CLIP ViT-G/14 Q-Former + MLP 224 2023/08
IDEFICS Huggingface A 1 LLaMA OpenCLIP ViT-H/14 Perceiver 224 2023/08
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Github A 1 LLaMA, MPT CLIP ViT-L/14 Perceiver 224 2023/08
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition Github A 1 InternLM Eva-CLIP ViT-G/14 Perceiver 224 2023/09
Improved Baselines with Visual Instruction Tuning Github A 1 Vicuna 1.5 CLIP ViT-L/14 MLP 336 2023/10
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Github A 1 LLaMA-2 EVA Linear 448 2023/10
Fuyu-8B: A Multimodal Architecture for AI Agents HF A 1 Persimmon - Linear unlimited 2023/10
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model Github A 1 LLaMA CLIP ViT-L/14 Abstractor 224*20 2023/10
CogVLM: Visual Expert for Pretrained Language Models Github A 1 Vicuna 1.5 EVA2-CLIP-E MLP 490 2023/11
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Github A 1 Qwen OpenCLIP ViT-bigG Cross-Attention 896 2023/11
ShareGPT4V:ImprovingLargeMulti-Modal Models with Better Captions Github A 1 Vicuna-1.5 CLIP ViT-L/14 MLP 336 2023/11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Github A 1 LLaMA-2 CLIP ViT-L/14 Abstractor 448 2023/11
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Github A 1 LLaMA-2 CLIP ViT-L/14 + CLIP ConvNeXt-XXL + DINOv2 ViT-G/14 Linear + Q-Former 672 2023/11
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Github A 1 Vicuna InternViT QLLaMA / MLP 336 2023/12
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices Github A 1 MobileLLaMA CLIP ViT-L/14 LDP (conv-based) 336 2023/12
VILA: On Pre-training for Visual Language Models Github A 1 LLaMA-2 CLIP ViT-L Linear 336 2023/12
Osprey: Pixel Understanding with Visual Instruction Tuning Github A 1 Vicuna CLIP ConvNeXt-L MLP 512 2023/12
Honeybee: Locality-enhanced Projector for Multimodal LLM Github A 1 Vicuna-1.5 CLIP ViT-L/14 C-Abstractor / D -Abstractor 336 2023/12
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts - A 1 UL2 Siglip ViT-G/14 Linear 1064 2023/12
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge Github A 1 Vicuna / Mistral / Hermes-2-Yi CLIP ViT-L/14 MLP 672 2024/01
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Github A 1 InternLM-2 CLIP ViT-L/14 MLP 490 2024/01
MouSi: Poly-Visual-Expert Vision-Language Models Github A 1 Vicuna-1.5 CLIP ViT-L/14 + MAE + LayoutLMv3 + ConvNeXt + SAM + DINOv2 ViT-G Poly-Expert Fusion 1024 2024/01
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs Github A 1 Vicuna1.5 CLIP ViT-L/14 MLP 336 2024/01
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Github A 1 StableL / Qwen / Phi-2 CLIP ViT-L/14 MLP 336 2024/01
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model Github A 1 MobileLLaMA CLIP ViT-L/14 LDP v2 336 2024/02
Bunny: Efficient Multimodal Learning from Data-centric Perspective Github A 1 Phi-1.5 / LLaMA-3 / StableLM-2 / Phi-2 SigLIP, EVA-CLIP MLP 1152 2024/02
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Github A 1 TinyLLaMA / Phi-2 / StableLM-2 SigLIP-L, CLIP ViT-L MLP 336/384 2024/02
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Github A 1 TinyLLaMA / InternLM2 / LLaMA2 / Mixtral CLIP ConvNeXt-XXL + DINOv2 ViT-G/14 Linear 672 2024/02
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Github A 1 Gemma / Vicuna / Mixtral / Hermes-2-Yi CLIP ViT-L + ConvNext-L Cross-Attention + MLP 1536 2024/03
DeepSeek-VL: Towards Real-World Vision-Language Understanding Github A 1 Deepseek LLM SigLIP-L, SAM-B MLP 1024 2024/03
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Github A 1 Vicuna CLIP ViT-L/14 Perceiver 336*6 2024/03
[Yi-VL] Yi: Open Foundation Models by 01.AI Github A 1 Yi CLIP ViT-H/14 MLP 448 2024/03
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Github A 1 in-house LLM CLIP ViT-H* C-Abstractor 1792 2024/03
VL-Mamba: Exploring State Space Models for Multimodal Learning Github A 1 Mamba LLM CLIP-ViT-L / SigLIP-SO400M VSS + MLP 384 2024/03
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Github A 1 Mamba-Zephyr DINOv2 + SigLIP MLP 384 2024/03
[InternVL 1.5] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Github A 1 InternLM2 InternViT-6B MLP 448*40 2024/04
[Phi-3-Vision] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Github A 1 Phi-3 CLIP ViT-L/14 MLP 336*16 2024/04
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Github A 1 Vicuna / Mistral / Hermes-2-Yi CLIP ViT-L/14 MLP + Adaptive Pooling 336 2024/04
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Github A 1 InternLM-1 SigLIP-SO400M/14 Resampler + MLP unlimited 2024/04
Imp: Highly Capable Large Multimodal Models for Mobile Devices Github A 1 Phi-2 SigLIP MLP 384 2024/05
[IDEFICS2] What matters when building vision-language models? HF A 1 Mistral-v0.1 SigLIP-SO400M/14 Perceiver + MLP 384*4 2024/05
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Github A 1 Vicuna- CLIP-ConvNeXt-L* MLP 1536 2024/05
Ovis: Structural Embedding Alignment for Multimodal Large Language Model Github A 1 LLaMA3 / Qwen1.5 CLIP ViT-L + Visual Embedding - 336 2024/05
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models Github A 1 Vicuna-1.5 CLIP ViT-L/14 MLP + Adaptive Pooling 336 2024/05
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts Github A 1 Mistral / Mixtral CLIP ViT-L/14 MLP 336 2024/05
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Github A 1 Vicuna-1.5 / LLaMA-3 / Hermes-2-Yi CLIP ViT-L/14 + DINOv2 ViT-L/14 + SigLIP ViT-SO400M + OpenCLIP ConvNeXt-XXL Spatial Vision Aggregator 1024 2024/06
GLM-4v Github A 1 GLM4 EVA-CLIP-E Conv + SwiGLU 1120 2024/06
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Github A 1 InternLM-2 CLIP ViT-L/14 MLP 560*24 2024/07
[IDEFICS3] Building and better understanding vision-language models: insights and future directions HF A 1 LLaMA 3.1 SigLIP-SO400M/14 Perceiver + MLP 1820 2024/08
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Github A 1 Qwen2 SigLIP-SO400M/14 Linear 384*6 2024/08
CogVLM2: Visual Language Models for Image and Video Understanding Github A 1 LLaMA3 EVA-CLIP-E Conv + SwiGLU 1344 2024/08
CogVLM2-vedio: Visual Language Models for Image and Video Understanding Github A 1 LLaMA3 EVA-CLIP-E Conv + SwiGLU 224 2024/08
LLaVA-OneVision: Easy Visual Task Transfer Github A 1 Qwen-2 SigLIP-SO400M/14 MLP 384*36 2024/09
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Github A 1 Qwen-2 ViT-675M MLP unlimited 2024/09

With Vision and Text Output

Large_Vision_Language_Model Code Input Type Output Type LLM Backbone Modality Encoder Modality Decoder Date
GILL: Generating Images with Multimodal Language Models Github A 2 OPT CLIP ViT-L SD 2023/05
Emu: Generative Pretraining in Multimodality Github A 2 LLaMA EVA-02-CLIP-1B SD 2023/07
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization Github A 3 LLaMA Eva-CLIP ViT-G/14 + LaVIT Tokenizer LaVIT D e-Tokenizer 2023/09
[CM3Leon] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning Github B 3 CM3Leon Make-A-Scene Make-A-Scene 2023/09
DreamLLM: Synergistic Multimodal Comprehension and Creation Github A 2 Vicuna CLIP ViT-L/14 SD 2023/09
Kosmos-G: Generating Images in Context with Multimodal Large Language Models Github A 2 MAGNETO CLIP ViT-L/14 SD 2023/10
SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer Github B 3 Vicuna / LLaMA-2 SEED Tokenizer SEED D e-Tokenizer 2023/10
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens Github A 2 Vicuna Eva-CLIP ViT-G/14 SD 2023/10
Emu-2: Generative Multimodal Models are In-Context Learners Github A 2 LLaMA EVA-02-CLIP-E-plus SDXL 2023/12
Chameleon: Mixed-Modal Early-Fusion Foundation Models Github B 3 Chameleon Make-A-Scene Make-A-Scene 2024/05
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts - B 3 Chamelon Make-A-Scene Make-A-Scene 2024/07
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation Github B 3 LLaMA-2 SigLIP + RQ-VAE RQ-VAE 2024/09

Large Audio-Language Models

Large_Audio_Language_Model Code Input Type Output Type Output Modality Backbone Modality Encoder Modality Decoder Date
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities Github B 3 Text, Audio LLaMA HuBERT Unit Vocoder 2023/05
Speech-LLaMA: On decoder-only architecture for speech-to-text and large language model integration - A 1 Text LLaMA CTC compressor - 2023/07
SALMONN: Towards Generic Hearing Abilities for Large Language Models Github A 1 Text Vicuna Whisper-Large-v2 + BEATs - 2023/10
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Github A 1 Text Qwen Whisper-Large-v2 - 2023/11
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation Github B 3 Text, Audio LLaMA-2 SpeechTokenizer SpeechTokenizer 2024/01
SLAM-ASR: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Github A 1 Text LLaMA-2 HuBERT - 2024/02
WavLLM: Towards Robust and Adaptive Speech Large Language Model Github A 1 Text LLaMA-2 Whisper-Large-v2 + WavLM-Base - 2024/04
SpeechVerse: A Large-scale Generalizable Audio Language Model - A 1 Text Flan-T5-XL WavLM-Large / Best-RQ - 2024/05
Qwen2-Audio Technical Report Github A 1 Text Qwen Whisper-Large-v3 - 2024/07
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Github A 2 Text, Audio LLaMA-3.1 Whisper-Large-v3 Unit Vocoder 2024/09

Any Modality Models

Any_Modality_Model Code Input Type Output Type Output Modality Backbone Modality Encoder Modality Decoder Date
PandaGPT: One Model To Instruction-Follow Them All Github A 1 Text Vicuna ImageBind - 2023/05
ImageBind-LLM: Multi-modality Instruction Tuning Github A 1 Text Chinese-LLaMA ImageBind + PointBind - 2023/09
NExT-GPT: Any-to-Any Multimodal LLM Github A 2 Text, Vision, Audio Vicuna ImageBind SD + AudioLDM + Zeriscope 2023/09
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Github A 2 Text, Vision, Audio LLaMA-2 ImageBind SD + AudioLDM2 + zeroscope v2 2023/11
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Github A 3 Text, Vision, Audio UnifiedIO2 OpenCLIP ViT-B + AST VQ-GAN + V iT-VQGAN 2023/12
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Github B 3 Text, Vision, Audio LLaMA-2 SEED + Encodec + SpeechTokenizer SEED + Encodec + SpeechTokenizer 2024/02
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts Github A 1 Text LLaMA CLIP ViT-L/14 + Whisper-small + BEATs - 2024/05

About

Papers of "A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published