Skip to content

Latest commit

 

History

History
533 lines (518 loc) · 37.7 KB

README.md

File metadata and controls

533 lines (518 loc) · 37.7 KB

Awesome-3D-Visual-Grounding Awesome

A continual collection of papers related to Text-guided 3D Visual Grounding (T-3DVG).

Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. T-3DVG presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.

In the T-3DVG community, we've summarized existing T-3DVG methods in our survey paper👍.

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions.

If you find some important work missed, it would be super helpful to let me know ([email protected]). Thanks!

If you find our survey useful for your research, please consider citing:

@article{liu2024survey,
  title={A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions},
  author={Liu, Daizong and Liu, Yang and Huang, Wencan and Hu, Wei},
  journal={arXiv preprint arXiv:2406.05785},
  year={2024}
}

Table of Contents


Fully-Supervised-Two-Stage

Fully-Supervised-One-Stage

  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds | Github
  • 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection | Github
  • Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds | Github
    • Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki
    • Carnegie Mellon University, Meta AI
    • [ECCV2022] https://arxiv.org/abs/2112.08879
    • One-stage approach, unified detection-interaction
  • EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding | Github
    • Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang
    • Peking University, The Chinese University of Hong Kong, Peng Cheng Laboratory, Shanghai AI Laboratory
    • [CVPR2023] https://arxiv.org/abs/2209.14941
    • One-stage approach, unified detection-interaction, text-decoupling, dense
  • Dense Object Grounding in 3D Scenes |
    • Wencan Huang, Daizong Liu, Wei Hu
    • Peking University
    • [ACMMM2023] https://arxiv.org/abs/2309.02224
    • One-stage approach, unified detection-interaction, transformer
  • 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding |
    • Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
    • Zhejiang University, ByteDance
    • [EMNLP2023] https://aclanthology.org/2023.emnlp-main.656/
    • One-stage approach, unified detection-interaction, relative position
  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
    • Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
    • Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
    • [NeurIPs2023] https://arxiv.org/abs/2306.06687
    • A dataset, One-stage approach, regression-based, multi-task
  • PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
  • Toward Fine-Grained 3D Visual Grounding through Referring Textual Phrases | Github
    • Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li
    • CUHK-Shenzhen, Shanghai Jiao Tong University
    • [Arxiv2023] https://arxiv.org/abs/2207.01821
    • A dataset, One-stage approach, unified detection-interaction
  • A Unified Framework for 3D Point Cloud Visual Grounding | Github
    • Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji
    • Xiamen University, Peng Cheng Laboratory
    • [Arxiv2023] https://arxiv.org/abs/2308.11887
    • One-stage approach, unified detection-interaction, superpoint
  • Uni3DL: Unified Model for 3D and Language Understanding |
    • Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
    • King Abdullah University of Science and Technology, Ecole Polytechnique
    • [Arxiv2023] https://arxiv.org/abs/2312.03026
    • One-stage approach, regression-based, multi-task
  • 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Github
    • Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
    • Xiamen University
    • [AAAI2024] https://arxiv.org/abs/2308.16632
    • One-stage approach, unified detection-interaction, superpoint
  • Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding |
    • Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia
    • Tsinghua University, Hong Kong University of Science and Technology, Shenzhen University, Harbin Institute of Technology(Shenzhen), Peng Cheng Laboratory
    • [AAAI2024] https://arxiv.org/abs/2305.10714
    • One-stage approach, regression-based, pre-training
  • Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
    • Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
    • The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
    • [CVPR2024] https://arxiv.org/abs/2311.15383
    • One-stage approach, zero-shot, data construction
  • G3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding |
  • PointCloud-Text Matching: Benchmark Datasets and a Baseline |
    • Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu
    • Sichuan University, A*STAR
    • [Arxiv2024] https://arxiv.org/abs/2403.19386
    • A dataset, One-stage approach, regression-based, pre-training
  • PD-TPE: Parallel Decoder with Text-guided Position Encoding for 3D Visual Grounding |
    • Chenshu Hou, Liang Peng, Xiaopei Wu, Wenxiao Wang, Xiaofei He
    • Zhejiang University, FABU Inc.
    • [Arxiv2024] https://arxiv.org/abs/2407.14491
    • A dataset, One-stage approach #
  • Grounding 3D Scene Affordance From Egocentric Interactions |
    • Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, Zheng-Jun Zha
    • University of Science and Technology of China, Northeastern University
    • [Arxiv2024] https://arxiv.org/abs/2409.19650
    • A dataset, One-stage approach, video #
  • Multi-branch Collaborative Learning Network for 3D Visual Grounding | Github
    • Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
    • Xiamen University
    • [ECCV2024] https://arxiv.org/abs/2407.05363
    • One-stage approach, regression-based
  • Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding |

Weakly-supervised

Semi-supervised

  • Cross-Task Knowledge Transfer for Semi-supervised Joint 3D Grounding and Captioning |
  • Bayesian Self-Training for Semi-Supervised 3D Segmentation |
    • Ozan Unal, Christos Sakaridis, Luc Van Gool
    • ETH Zurich, Huawei Technologies, KU Leuven, INSAIT
    • [ECCV2024] https://arxiv.org/abs/2409.08102
    • semi-supervised, self-training

Other-Modality

  • Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images | Github
    • Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, Shuguang Cui
    • CUHK-Shenzhen, Deepwise AI Lab, The University of Hong Kong
    • [CVPR2021] https://arxiv.org/pdf/2103.07894
    • No point cloud input, RGB-D image
  • PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
  • Mono3DVG: 3D Visual Grounding in Monocular Images | Github
    • Yang Zhan, Yuan Yuan, Zhitong Xiong
    • Northwestern Polytechnical University, Technical University of Munich
    • [AAAI2024] https://arxiv.org/pdf/2312.08022
    • No point cloud input, monocular image
  • EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI | Github
    • Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang
    • Shanghai AI Laboratory, Shanghai Jiao Tong University, The University of Hong Kong, The Chinese University of Hong Kong, Tsinghua University
    • [CVPR2024] https://arxiv.org/abs/2312.16170
    • A dataset, No point cloud input, RGB-D image
  • WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language |
    • Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma
    • ShanghaiTech University, Shanghai AI Laboratory, The Chinese University of Hong Kong
    • [Arxiv2023] https://arxiv.org/abs/2304.05645
    • No point cloud input, wild point cloud, additional multi-modal input
  • HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models |
    • Vineet Bhat, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami
    • New York University
    • [Arxiv2024] https://arxiv.org/abs/2409.10419
    • No point cloud input, RGB image #

LLMs-based

  • ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | Github
    • Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
    • Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Northwestern Polytechnical University
    • [ICCV2023] https://arxiv.org/pdf/2303.16894
    • LLMs-based, enriching text description
  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
    • Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
    • Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
    • [NeurIPs2023] https://arxiv.org/abs/2306.06687
    • LLMs-based, LLM architecture
  • Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning |
    • Jiading Fang, Xiangshan Tan, Shengjie Lin, Hongyuan Mei, Matthew R. Walter
    • Toyota Technological Institute at Chicago
    • [CoRL2023] https://openreview.net/forum?id=7j3sdUZMTF
    • LLMs-based, enriching text description
  • LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent |
    • Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai1
    • University of Michigan, New York University
    • [Arxiv2023] https://arxiv.org/abs/2309.12311
    • LLMs-based, enriching text description
  • Mono3DVG: 3D Visual Grounding in Monocular Images | Github
    • Yang Zhan, Yuan Yuan, Zhitong Xiong
    • Northwestern Polytechnical University, Technical University of Munich
    • [AAAI2024] https://arxiv.org/pdf/2312.08022
    • LLMs-based, enriching text description
  • COT3DREF: Chain-of-Thoughts Data-Efficient 3D Visual Grounding | Github
    • Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
    • King Abdullah University of Science and Technology
    • [ICLR2024] https://arxiv.org/abs/2310.06214
    • LLMs-based, Chain-of-Thoughts, reasoning
  • Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
    • Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
    • The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
    • [CVPR2024] https://arxiv.org/abs/2311.15383
    • LLMs-based, construct text description
  • Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners | Github
  • 3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING | Github
    • Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu
    • Beijing University of Posts and Telecommunications, Beijing Digital Native Digital City Research Center, Peking University, Beihang University, Beijing University of Science and Technology
    • [Arxiv2024] https://arxiv.org/abs/2401.03201
    • LLMs-based, LLM architecture
  • DOrA: 3D Visual Grounding with Order-Aware Referring |
    • Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
    • National Taiwan University, NVIDIA
    • [Arxiv2024] https://arxiv.org/abs/2403.16539
    • LLMs-based, Chain-of-Thoughts
  • SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding | Github
    • Baoxiong Jia , Yixin Chen , Huanyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
    • Beijing Institute for General Artificial Intelligence
    • [Arxiv2024] https://arxiv.org/abs/2401.09340
    • A dataset, LLMs-based, LLM architecture
  • Language-Image Models with 3D Understanding | Github
    • Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone
    • UT Austin, NVIDIA Research
    • [Arxiv2024] https://arxiv.org/abs/2405.03685
    • A dataset, LLMs-based #
  • Task-oriented Sequential Grounding in 3D Scenes | Github
    • Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
    • BIGA, Tsinghua Universit, Beijing Institute of Technology
    • [Arxiv2024] https://arxiv.org/abs/2408.04034
    • A dataset, LLMs-based #
  • Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding | Github
    • Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang
    • University of Illinois Urbana-Champaign, Carnegie Mellon University
    • [Arxiv2024] https://arxiv.org/abs/2409.03757
    • Foundation model #
  • Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning |
    • Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
    • Illinois Institute of Technology, Zhejiang University, University of Central Florida, University of Illinois at Chicago
    • [Arxiv2024] https://arxiv.org/abs/2410.00255
    • LLMs-based #
  • Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems | Github
    • Qihao Yuan, Jiaming Zhang, Kailai Li, Rainer Stiefelhagen
    • Karlsruhe Institute of Technology, University of Groningen
    • [Arxiv2024] https://arxiv.org/abs/2411.14594
    • LLMs-based, zero-shot #
  • SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding | Github
    • Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang
    • HKUST, A*STAR, National University of Singapore
    • [Arxiv2024] https://arxiv.org/abs/2412.04383
    • LLMs-based, zero-shot #
  • Empowering 3D Visual Grounding with Reasoning Capabilities | Github
    • Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu
    • The University of Hong Kong, Shanghai AI Laboratory
    • [ECCV2024] https://arxiv.org/abs/2407.01525
    • LLMs-based, LLM architecture, A dataset
  • VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | Github
    • Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin
    • The Chinese University of Hong Kong, Zhejiang University, Shanghai AI Laboratory, Centre for Perceptual and Interactive Intelligence
    • [CoRL2024] https://arxiv.org/abs/2410.13860
    • LLMs-based, zero-shot #
  • ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding |
    • Austin T. Wang, ZeMing Gong, Angel X. Chang
    • Simon Fraser University, Alberta Machine Intelligence Institute
    • [Arxiv2025] https://arxiv.org/abs/2501.01366
    • LLMs-based, new dataset #

Outdoor-Scenes

  • Language Prompt for Autonomous Driving | Github
    • Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen
    • Beijing Institute of Technology, University of Macau, MEGVII Technology, Beijing Academy of Artificial Intelligence
    • [Arxiv2023] https://arxiv.org/abs/2309.04379
    • Ourdoor scene, autonomous driving #
  • Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | Github
    • Runwei Guan, Ruixiao Zhang, Ningwei Ouyang, Jianan Liu, Ka Lok Man, Xiaohao Cai, Ming Xu, Jeremy Smith, Eng Gee Lim, Yutao Yue, Hui Xiong
    • JITRI, University of Liverpool, University of Southampton, Vitalent Consulting, Xi’an Jiaotong-Liverpool University, HKUST (GZ)
    • [Arxiv2024] https://arxiv.org/abs/2405.12821
    • Ourdoor scene, autonomous driving
  • Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding |
    • Yuhang Liu, Boyi Sun, Guixu Zheng, Yishuo Wang, Jing Wang, Fei-Yue Wang
    • Chinese Academy of Sciences, South China Agricultural University, Beijing Institute of Technology
    • [Arxiv2024] https://arxiv.org/abs/2405.15274
    • Ourdoor scene, autonomous driving
  • LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers |