Awesome-Large-Multimodal-Models

This repo summarizes the construction of current LMMs from the perspective of

input-output representation space extension

Based on the structure of input-output spaces, we systematically review the existing models, including main-stream models based on discrete-continuous hybrid spaces and models with unified multi-modal discrete representations.
Readers can refer to our [📖 Preprint Paper] for detailed explanations.

Preliminary

As presented in Figure below, the evolution of multi-modal research paradigms could be divided into three stages.

For readers to have a general picture about the development, we provide a tutorial here. The contents are summarized as follows:

Part 1: Vision-Language Pre-Training
Part 2: Architectures and Traning of LMMs
Part 3: Evaluation of LMMs
Part 4: Further Capability of LMMs
Part 5: Extension to Embodied Agents

Awesome Models

Large Vision-Language Models

With Text-only Output

Large_Vision_Language_Model	Code	Input Type	Output Type	LLM Backbone	Modality Encoder	Connection	Max Res.	Date
Flamingo: a Visual Language Model for Few-Shot Learning	Github	A	1	Chinchilla	NFNet	Perceiver	480	2022/04
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Github	A	1	Flan-T5 / OPT	CLIP ViT-L/14 / Eva-CLIP ViT-G/14	Q-Former	224	2023/01
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	Github	A	1	LLaMA	CLIP-ViT-L/14	MLP	224	2023/03
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Github	A	1	Vicuna	Eva-CLIP ViT-G/14	Q-Former	224	2023/04
Visual Instruction Tuning	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/04
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Github	A	1	LLaMA	CLIP ViT-L/14	Abstractor	224	2023/04
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	Github	A	1	LLaMA	CLIP-ViT-L/14	MLP	224	2023/04
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	Github	A	1	Flan-T5 / Vicuna	Eva-CLIP ViT-G/14	Q-Former	224	2023/05
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Github	A	1	LLaMA	CLIP ViT-L/14	Perceiver	224	2023/05
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	Github	A	1	LLaMA	CLIP ViT-L/14	MLP	224	2023/05
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	Github	A	1	LLaMA	CLIP ViT-L/14	Perceiver	224	2023/05
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/06
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Github	A	1	Vicuna	CLIP ViT-L/14	Linear	224	2023/06
Valley: Video Assistant with Large Language model Enhanced abilitY	Github	A	1	Stable-Vicuna	CLIP ViT-L/14	Temporal Module + Linear	224	2023/06
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Github	A	1	Vicuna	EVA-1B	Resampler	420	2023/07
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Github	A	1	Qwen	OpenCLIP ViT-bigG	Cross-Attention	448	2023/08
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	Github	A	1	Flan-T5 / Vicuna	Eva-CLIP ViT-G/14	Q-Former + MLP	224	2023/08
IDEFICS	Huggingface	A	1	LLaMA	OpenCLIP ViT-H/14	Perceiver	224	2023/08
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Github	A	1	LLaMA, MPT	CLIP ViT-L/14	Perceiver	224	2023/08
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	Github	A	1	InternLM	Eva-CLIP ViT-G/14	Perceiver	224	2023/09
Improved Baselines with Visual Instruction Tuning	Github	A	1	Vicuna 1.5	CLIP ViT-L/14	MLP	336	2023/10
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Github	A	1	LLaMA-2	EVA	Linear	448	2023/10
Fuyu-8B: A Multimodal Architecture for AI Agents	HF	A	1	Persimmon	-	Linear	unlimited	2023/10
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model	Github	A	1	LLaMA	CLIP ViT-L/14	Abstractor	224*20	2023/10
CogVLM: Visual Expert for Pretrained Language Models	Github	A	1	Vicuna 1.5	EVA2-CLIP-E	MLP	490	2023/11
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	Github	A	1	Qwen	OpenCLIP ViT-bigG	Cross-Attention	896	2023/11
ShareGPT4V:ImprovingLargeMulti-Modal Models with Better Captions	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	MLP	336	2023/11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Github	A	1	LLaMA-2	CLIP ViT-L/14	Abstractor	448	2023/11
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	Github	A	1	LLaMA-2	CLIP ViT-L/14 + CLIP ConvNeXt-XXL + DINOv2 ViT-G/14	Linear + Q-Former	672	2023/11
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	Github	A	1	Vicuna	InternViT	QLLaMA / MLP	336	2023/12
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	Github	A	1	MobileLLaMA	CLIP ViT-L/14	LDP (conv-based)	336	2023/12
VILA: On Pre-training for Visual Language Models	Github	A	1	LLaMA-2	CLIP ViT-L	Linear	336	2023/12
Osprey: Pixel Understanding with Visual Instruction Tuning	Github	A	1	Vicuna	CLIP ConvNeXt-L	MLP	512	2023/12
Honeybee: Locality-enhanced Projector for Multimodal LLM	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	C-Abstractor / D -Abstractor	336	2023/12
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	-	A	1	UL2	Siglip ViT-G/14	Linear	1064	2023/12
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Github	A	1	Vicuna / Mistral / Hermes-2-Yi	CLIP ViT-L/14	MLP	672	2024/01
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	Github	A	1	InternLM-2	CLIP ViT-L/14	MLP	490	2024/01
MouSi: Poly-Visual-Expert Vision-Language Models	Github	A	1	Vicuna-1.5	CLIP ViT-L/14 + MAE + LayoutLMv3 + ConvNeXt + SAM + DINOv2 ViT-G	Poly-Expert Fusion	1024	2024/01
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs	Github	A	1	Vicuna1.5	CLIP ViT-L/14	MLP	336	2024/01
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	Github	A	1	StableL / Qwen / Phi-2	CLIP ViT-L/14	MLP	336	2024/01
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	Github	A	1	MobileLLaMA	CLIP ViT-L/14	LDP v2	336	2024/02
Bunny: Efficient Multimodal Learning from Data-centric Perspective	Github	A	1	Phi-1.5 / LLaMA-3 / StableLM-2 / Phi-2	SigLIP, EVA-CLIP	MLP	1152	2024/02
TinyLLaVA: A Framework of Small-scale Large Multimodal Models	Github	A	1	TinyLLaMA / Phi-2 / StableLM-2	SigLIP-L, CLIP ViT-L	MLP	336/384	2024/02
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Github	A	1	TinyLLaMA / InternLM2 / LLaMA2 / Mixtral	CLIP ConvNeXt-XXL + DINOv2 ViT-G/14	Linear	672	2024/02
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	Github	A	1	Gemma / Vicuna / Mixtral / Hermes-2-Yi	CLIP ViT-L + ConvNext-L	Cross-Attention + MLP	1536	2024/03
DeepSeek-VL: Towards Real-World Vision-Language Understanding	Github	A	1	Deepseek LLM	SigLIP-L, SAM-B	MLP	1024	2024/03
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images	Github	A	1	Vicuna	CLIP ViT-L/14	Perceiver	336*6	2024/03
[Yi-VL] Yi: Open Foundation Models by 01.AI	Github	A	1	Yi	CLIP ViT-H/14	MLP	448	2024/03
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	Github	A	1	in-house LLM	CLIP ViT-H*	C-Abstractor	1792	2024/03
VL-Mamba: Exploring State Space Models for Multimodal Learning	Github	A	1	Mamba LLM	CLIP-ViT-L / SigLIP-SO400M	VSS + MLP	384	2024/03
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference	Github	A	1	Mamba-Zephyr	DINOv2 + SigLIP	MLP	384	2024/03
[InternVL 1.5] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	Github	A	1	InternLM2	InternViT-6B	MLP	448*40	2024/04
[Phi-3-Vision] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone	Github	A	1	Phi-3	CLIP ViT-L/14	MLP	336*16	2024/04
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Github	A	1	Vicuna / Mistral / Hermes-2-Yi	CLIP ViT-L/14	MLP + Adaptive Pooling	336	2024/04
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	Github	A	1	InternLM-1	SigLIP-SO400M/14	Resampler + MLP	unlimited	2024/04
Imp: Highly Capable Large Multimodal Models for Mobile Devices	Github	A	1	Phi-2	SigLIP	MLP	384	2024/05
[IDEFICS2] What matters when building vision-language models?	HF	A	1	Mistral-v0.1	SigLIP-SO400M/14	Perceiver + MLP	384*4	2024/05
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	Github	A	1	Vicuna-	CLIP-ConvNeXt-L*	MLP	1536	2024/05
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	Github	A	1	LLaMA3 / Qwen1.5	CLIP ViT-L + Visual Embedding	-	336	2024/05
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	Github	A	1	Vicuna-1.5	CLIP ViT-L/14	MLP + Adaptive Pooling	336	2024/05
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	Github	A	1	Mistral / Mixtral	CLIP ViT-L/14	MLP	336	2024/05
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Github	A	1	Vicuna-1.5 / LLaMA-3 / Hermes-2-Yi	CLIP ViT-L/14 + DINOv2 ViT-L/14 + SigLIP ViT-SO400M + OpenCLIP ConvNeXt-XXL	Spatial Vision Aggregator	1024	2024/06
GLM-4v	Github	A	1	GLM4	EVA-CLIP-E	Conv + SwiGLU	1120	2024/06
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	Github	A	1	InternLM-2	CLIP ViT-L/14	MLP	560*24	2024/07
[IDEFICS3] Building and better understanding vision-language models: insights and future directions	HF	A	1	LLaMA 3.1	SigLIP-SO400M/14	Perceiver + MLP	1820	2024/08
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	Github	A	1	Qwen2	SigLIP-SO400M/14	Linear	384*6	2024/08
CogVLM2: Visual Language Models for Image and Video Understanding	Github	A	1	LLaMA3	EVA-CLIP-E	Conv + SwiGLU	1344	2024/08
CogVLM2-vedio: Visual Language Models for Image and Video Understanding	Github	A	1	LLaMA3	EVA-CLIP-E	Conv + SwiGLU	224	2024/08
LLaVA-OneVision: Easy Visual Task Transfer	Github	A	1	Qwen-2	SigLIP-SO400M/14	MLP	384*36	2024/09
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Github	A	1	Qwen-2	ViT-675M	MLP	unlimited	2024/09

With Vision and Text Output

Large_Vision_Language_Model	Code	Input Type	Output Type	LLM Backbone	Modality Encoder	Modality Decoder	Date
GILL: Generating Images with Multimodal Language Models	Github	A	2	OPT	CLIP ViT-L	SD	2023/05
Emu: Generative Pretraining in Multimodality	Github	A	2	LLaMA	EVA-02-CLIP-1B	SD	2023/07
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	Github	A	3	LLaMA	Eva-CLIP ViT-G/14 + LaVIT Tokenizer	LaVIT D e-Tokenizer	2023/09
[CM3Leon] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	Github	B	3	CM3Leon	Make-A-Scene	Make-A-Scene	2023/09
DreamLLM: Synergistic Multimodal Comprehension and Creation	Github	A	2	Vicuna	CLIP ViT-L/14	SD	2023/09
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	Github	A	2	MAGNETO	CLIP ViT-L/14	SD	2023/10
SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer	Github	B	3	Vicuna / LLaMA-2	SEED Tokenizer	SEED D e-Tokenizer	2023/10
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	Github	A	2	Vicuna	Eva-CLIP ViT-G/14	SD	2023/10
Emu-2: Generative Multimodal Models are In-Context Learners	Github	A	2	LLaMA	EVA-02-CLIP-E-plus	SDXL	2023/12
Chameleon: Mixed-Modal Early-Fusion Foundation Models	Github	B	3	Chameleon	Make-A-Scene	Make-A-Scene	2024/05
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts	-	B	3	Chamelon	Make-A-Scene	Make-A-Scene	2024/07
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	Github	B	3	LLaMA-2	SigLIP + RQ-VAE	RQ-VAE	2024/09

Large Audio-Language Models

Large_Audio_Language_Model	Code	Input Type	Output Type	Output Modality	Backbone	Modality Encoder	Modality Decoder	Date
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	Github	B	3	Text, Audio	LLaMA	HuBERT	Unit Vocoder	2023/05
Speech-LLaMA: On decoder-only architecture for speech-to-text and large language model integration	-	A	1	Text	LLaMA	CTC compressor	-	2023/07
SALMONN: Towards Generic Hearing Abilities for Large Language Models	Github	A	1	Text	Vicuna	Whisper-Large-v2 + BEATs	-	2023/10
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	Github	A	1	Text	Qwen	Whisper-Large-v2	-	2023/11
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation	Github	B	3	Text, Audio	LLaMA-2	SpeechTokenizer	SpeechTokenizer	2024/01
SLAM-ASR: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	Github	A	1	Text	LLaMA-2	HuBERT	-	2024/02
WavLLM: Towards Robust and Adaptive Speech Large Language Model	Github	A	1	Text	LLaMA-2	Whisper-Large-v2 + WavLM-Base	-	2024/04
SpeechVerse: A Large-scale Generalizable Audio Language Model	-	A	1	Text	Flan-T5-XL	WavLM-Large / Best-RQ	-	2024/05
Qwen2-Audio Technical Report	Github	A	1	Text	Qwen	Whisper-Large-v3	-	2024/07
LLaMA-Omni: Seamless Speech Interaction with Large Language Models	Github	A	2	Text, Audio	LLaMA-3.1	Whisper-Large-v3	Unit Vocoder	2024/09

Any Modality Models

Any_Modality_Model	Code	Input Type	Output Type	Output Modality	Backbone	Modality Encoder	Modality Decoder	Date
PandaGPT: One Model To Instruction-Follow Them All	Github	A	1	Text	Vicuna	ImageBind	-	2023/05
ImageBind-LLM: Multi-modality Instruction Tuning	Github	A	1	Text	Chinese-LLaMA	ImageBind + PointBind	-	2023/09
NExT-GPT: Any-to-Any Multimodal LLM	Github	A	2	Text, Vision, Audio	Vicuna	ImageBind	SD + AudioLDM + Zeriscope	2023/09
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	Github	A	2	Text, Vision, Audio	LLaMA-2	ImageBind	SD + AudioLDM2 + zeroscope v2	2023/11
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	Github	A	3	Text, Vision, Audio	UnifiedIO2	OpenCLIP ViT-B + AST	VQ-GAN + V iT-VQGAN	2023/12
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	Github	B	3	Text, Vision, Audio	LLaMA-2	SEED + Encodec + SpeechTokenizer	SEED + Encodec + SpeechTokenizer	2024/02
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	Github	A	1	Text	LLaMA	CLIP ViT-L/14 + Whisper-small + BEATs	-	2024/05

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Large-Multimodal-Models

Table of Contents

Preliminary

Awesome Models

Large Vision-Language Models

With Text-only Output

With Vision and Text Output

Large Audio-Language Models

Any Modality Models

About

Releases

Packages

License

FudanDISC/Awesome-Large-Multimodal-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome-Large-Multimodal-Models

Table of Contents

Preliminary

Awesome Models

Large Vision-Language Models

With Text-only Output

With Vision and Text Output

Large Audio-Language Models

Any Modality Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages