From e82cfc3159933135045bedc9602d73588b2f64aa Mon Sep 17 00:00:00 2001 From: Xiangtai Li Date: Wed, 15 Jan 2025 11:47:19 +0800 Subject: [PATCH] Update README.md --- README.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 87b1ced..9eb3342 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ [**Haobo Yuan**](https://yuanhaobo.me/)1* · [**Xiangtai Li**](https://scholar.google.com/citations?user=NmHgX-wAAAAJ)2*† · [**Tao Zhang**](https://zhang-tao-whu.github.io/)2,3* · [**Zilong Huang**](http://speedinghzl.github.io/)2 · [**Shilin Xu**](https://xushilin1.github.io/)4 ·[**Shunping Ji**](https://scholar.google.com/citations?user=FjoRmF4AAAAJ&hl=en)3 ·[**Yunhai Tong**](https://scholar.google.com/citations?user=T4gqdPkAAAAJ&hl=zh-CN)4 · -[**Lu Qi**](https://luqi.info/)2 · [**Jiashi Feng**](https://sites.google.com/site/jshfeng/)2 · [**Ming-Hsuan Yang**](https://faculty.ucmerced.edu/mhyang/)1 +[**Lu Qi**](https://luqi.info/)2 · [**Jiashi Feng**](https://scholar.google.com/citations?user=Q8iay0gAAAAJ&hl=en)2 · [**Ming-Hsuan Yang**](https://faculty.ucmerced.edu/mhyang/)1 1UC Merced    2ByteDance Seed    3WHU    4PKU @@ -13,11 +13,21 @@ ![Teaser](assets/images/teaser.jpg) +## Opensource progress + +- [x] Release 1B,4B,8B, 26B Model. +- [x] Release training code. +- [x] Release inference and testing code. +- [x] Release demo code. + + + ## Overiew This repository contains the code for the paper "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos". Sa2VA is the the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. + ## Model Zoo We provide the following models: | Model Name | Base MLLM | Language Part | HF Link | @@ -25,6 +35,7 @@ We provide the following models: | Sa2VA-1B | [InternVL2.0-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B) | | Sa2VA-4B | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B) | | Sa2VA-8B | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B) | +| Sa2VA-26B | [InternVL2.5-26B](https://huggingface.co/OpenGVLab/InternVL2_5-26B) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-20b-chat) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-26B) | ## Gradio Demos