input-output representation space extension
- Based on the structure of input-output spaces, we systematically review the existing models, including main-stream models based on discrete-continuous hybrid spaces and models with unified multi-modal discrete representations.
- Readers can refer to our [đź“– Preprint Paper] for detailed explanations.
As presented in Figure below, the evolution of multi-modal research paradigms could be divided into three stages.
For readers to have a general picture about the development, we provide a tutorial here. The contents are summarized as follows:
- Part 1: Vision-Language Pre-Training
- Part 2: Architectures and Traning of LMMs
- Part 3: Evaluation of LMMs
- Part 4: Further Capability of LMMs
- Part 5: Extension to Embodied Agents
Any_Modality_Model | Code | Input Type | Output Type | Output Modality | Backbone | Modality Encoder | Modality Decoder | Date |
---|---|---|---|---|---|---|---|---|
PandaGPT: One Model To Instruction-Follow Them All | Github | A | 1 | Text | Vicuna | ImageBind | - | 2023/05 |
ImageBind-LLM: Multi-modality Instruction Tuning | Github | A | 1 | Text | Chinese-LLaMA | ImageBind + PointBind | - | 2023/09 |
NExT-GPT: Any-to-Any Multimodal LLM | Github | A | 2 | Text, Vision, Audio | Vicuna | ImageBind | SD + AudioLDM + Zeriscope | 2023/09 |
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Github | A | 2 | Text, Vision, Audio | LLaMA-2 | ImageBind | SD + AudioLDM2 + zeroscope v2 | 2023/11 |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action | Github | A | 3 | Text, Vision, Audio | UnifiedIO2 | OpenCLIP ViT-B + AST | VQ-GAN + V iT-VQGAN | 2023/12 |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | Github | B | 3 | Text, Vision, Audio | LLaMA-2 | SEED + Encodec + SpeechTokenizer | SEED + Encodec + SpeechTokenizer | 2024/02 |
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | Github | A | 1 | Text | LLaMA | CLIP ViT-L/14 + Whisper-small + BEATs | - | 2024/05 |