publications.html

---
layout: diy
title: Publications
---

<head>
<style>
a { text-decoration : none; }
a:hover { text-decoration : underline; }
a, a:visited { color : #6e6f71; }
p { font-size : 16px; }
h3 { font-size : 18px; margin : 8; padding : 0; }
h4 { font-size : 16px; margin : 6; padding : 0; }
.container { width : 1000px;}
.publogo { width: 100 px; margin-right : 20px; float : left; border : 10px;}
.publication { clear : left; padding-bottom : 0px; }
.publication p { height : 180px; padding-top : 0px;}
.publication strong { font-size : 17px; color : #990036; }
.publication strong a { font-size : 17px; color : #990036; }
</style>
</head>

<div class="container">


<h3>2024</h3>

<div class="publication">
  <img src="../static/pubs/Mov24.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
      MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
    </strong>
    <br>
     Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
    <br>
    <font color="#E89B00">
    <em>Computer Vision and Pattern Recognition (CVPR), 2024</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2307.16449">[Paper]</a>
    <a href="https://github.com/rese1f/MovieChat">[Code]</a>
    <a href="https://huggingface.co/datasets/Enxin/MovieChat-1K_train">[Dataset]</a>
    <a href="https://rese1f.github.io/MovieChat/">[Website]</a>
    <img alt="" src="https://img.shields.io/github/stars/rese1f/MovieChat?style=social">
    <br>
    <font color="grey" size="2">
      MovieChat achieves state-of-the-art performace in long video understanding by introducing memory mechanism.
    </font>
  </p>
</div>

<div class="publication">
    <img src="../static/pubs/MedM2G24.png" class="publogo" width="200 px" height="150 px">
    <p> 
      <strong>
        MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
      </strong>
      <br>
      Chenlu Zhan, Yu Lin, Gaoang Wang, Hongwei Wang, Jian Wu
      <br>
      <font color="#E89B00">
      <em>Computer Vision and Pattern Recognition (CVPR), 2024</em>
      </font>
      <br>
      <a href="https://arxiv.org/pdf/2403.04290.pdf">[Paper]</a>
      <br>
      <font color="grey" size="2">
        MedM2G has the ability for unified conversion between medical images and text, text and images, as well as the unified generation of various medical modalities such as CT, MRI, and X-ray.</font>
    </p>
</div>

<div class="publication">
  <img src="../static/pubs/Uni24.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
      UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning
    </strong>
    <br>
    Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
    <br>
    <font color="#E89B00">
    <em>Association for the Advancement of Artificial Intelligence (AAAI), 2024</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2308.09953">[Paper]</a>
    <a href="https://github.com/rese1f/UniAP">[Code]</a>
    <a href="https://rese1f.github.io/UniAP/">[Website]</a>
    <img alt="" src="https://img.shields.io/github/stars/rese1f/UniAP?style=social">
    <br>
    <font color="grey" size="2">
      UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks.
    </font>
  </p>
</div>

<div class="publication">
  <img src="../static/pubs/EBAAAI24.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
      Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models
    </strong>
    <br>
    Zhiyao Ren, Yibing Zhan, Liang Ding, Gaoang Wang, Chaoyue Wang, Zhongyi Fan, Dacheng Tao
    <br>
    <font color="#E89B00">
    <em>Association for the Advancement of Artificial Intelligence (AAAI), 2024</em>
    </font>
    <br>
    <a href="https://ojs.aaai.org/index.php/AAAI/article/download/28267/28525">[Paper]</a>
    <br>
    <font color="grey" size="2">
    We propose a multi-step denoising scheduled sampling (MDSS) strategy to alleviate the exposure bias for DDPMs.
    </font>
  </p>
</div>

<h3>2023</h3>
  

<div class="publication">
  <img src="../static/pubs/DIV23.png" class="publogo" width="200 px" height="160 px">
  <p> 
    <strong>
      DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes
    </strong>
    <br>
    Shengyu Hao, Peiyuan Liu, Yibing Zhan, Kaixun Jin, Zuozhu Liu, Mingli Song, Jenq-Neng Hwang, Gaoang Wang
    <br>
    <font color="#E89B00">
    <em>International Journal of Computer Vision (IJCV), 2023</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2302.07676">[Paper]</a>
    <a href="https://huggingface.co/datasets/syhao777/DIVOTrack">[Dataset]</a>
    <a href="https://github.com/shengyuhao/DIVOTrack">[Code]</a>
    <img alt="" src="https://img.shields.io/github/stars/shengyuhao/DIVOTrack?style=social">
    <br>
    <font color="grey" size="2">
      A new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians.
    </font>
  </p>
</div>


<div class="publication">
    <img src="../static/pubs/TIP24.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
      </strong>
      <br>
      Peihan Miao, Wei Su, Gaoang Wang, Xuewei Li, Xi Li
      <br>
      <font color="#E89B00">
      <em>IEEE Transactions on Image Processing (TIP), 2023</em>
      </font>
      <br>
      <a href="https://ieeexplore.ieee.org/abstract/document/10345481">[Paper]</a>
      <br>
      <font color="grey" size="2">
        We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism.</font>
    </p>
</div>


<div class="publication">
  <img src="../static/pubs/RFD23.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
      DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
    </strong>
    <br>
    Shidong Cao, Wenhao Chai, Shengyu Hao, Yanting Zhang, Hangyue Chen, Gaoang Wang
    <br>
    <font color="#E89B00">
    <em>IEEE Transactions on Multimedia (TMM), 2023</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2302.06826">[Paper]</a>
    <a href="https://github.com/Rem105-210/DiffFashion">[Code]</a>
    <img alt="" src="https://img.shields.io/github/stars/Rem105-210/DiffFashion?style=social">
    <br>
    <font color="grey" size="2">
      We focus on a new fashion design task, where we aim to transfer a reference appearance image onto a clothing image while preserving the structure of the clothing image. 
    </font>
  </p>
</div>


<div class="publication">
  <img src="../static/pubs/STC23.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
    StableVideo: Text-driven Consistency-aware Diffusion Video Editing
    </strong>
    <br>
    Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
    <br>
    <font color="#E89B00">
    <em>International Conference on Computer Vision (ICCV), 2023</em>
    </font>
    <br>
    <a href="https://rese1f.github.io/StableVideo/">[Website]</a>
    <a href="https://arxiv.org/abs/2308.09592">[Paper]</a>
   	<a href="https://huggingface.co/spaces/Reself/StableVideo">[Demo]</a>
    <a href="https://github.com/rese1f/StableVideo">[Code]</a>
    <img alt="" src="https://img.shields.io/github/stars/rese1f/StableVideo?style=social">
    <br>
    <font color="grey" size="2">
    We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.
    </font>
  </p>
</div>


<div class="publication">
  <img src="../static/pubs/GAM23.png" class="publogo" width="200 px" height="160 px">
  <p> 
    <strong>
    Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
    </strong>
    <br>
    Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang, Gaoang Wang
    <br>
    <font color="#E89B00">
    <em>International Conference on Computer Vision (ICCV), 2023</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2303.16456">[Paper]</a>
    <a href="https://github.com/rese1f/PoseDA">[Code]</a>
    <img alt="NPM" src="https://img.shields.io/github/stars/rese1f/PoseDA?style=social">
    <br>
    <font color="grey" size="2">
    A simple yet effective framework of unsupervised domain adaptation for 3D human pose estimation.
    </font>
  </p>
</div>


<div class="publication">
    <img src="../static/pubs/DOD23.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection
      </strong>
      <br>
      Longrong Yang, Xianpan Zhou, Xuewei Li, Liang Qiao, Zheyang Li, Ziwei Yang, Gaoang Wang, Xi Li
      <br>
      <font color="#E89B00">
      <em>International Conference on Computer Vision (ICCV), 2023</em>
      </font>
      <br>
      <a href="https://openaccess.thecvf.com/content/ICCV2023/papers/Yang_Bridging_Cross-task_Protocol_Inconsistency_for_Distillation_in_Dense_Object_Detection_ICCV_2023_paper.pdf">[Paper]</a>
      <br>
      <font color="grey" size="2">
        A novel distillation method with cross-task consistent protocols, tailored for dense object detection.</font>
    </p>
  </div>


<div class="publication">
  <img src="../static/pubs/PMP23.png" class="publogo" width="200 px" height="150 px">
  <p> 
    <strong>
    PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Enhanced 3D Human Pose Estimation
    </strong>
    <br>
    Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie 
    <br>
    <font color="#E89B00">
    <em>ACM Multimedia (ACM MM), 2023</em>
    </font>
    <br>
    <a href="https://arxiv.org/abs/2308.09678">[Paper]</a>
    <a href="https://github.com/hbing-l/PoSynDA">[Code]</a>
    <img alt="NPM" src="https://img.shields.io/github/stars/hbing-l/PoSynDA?style=social">
    <br>
    <font color="grey" size="2">
    PoSynDA offers a state-of-the-art domain adaptation solution for 3D pose estimation.
    </font>
  </p>
</div>


<div class="publication">
    <img src="../static/pubs/SGA23.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
      </strong>
      <br>
      Xuewei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li
      <br>
      <font color="#E89B00">
      <em>International Joint Conference on Artificial Intelligence (IJCAI), 2023</em>
      </font>
      <br>
      <a href="https://arxiv.org/abs/2306.03403">[Paper]</a>
      <br>
      <font color="grey" size="2">
        As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation gives complete scene perception based on an ultra-wide angle of view.</font>
    </p>
</div>


<div class="publication">
    <img src="../static/pubs/LAW23.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Language Adaptive Weight Generation for Multi-task Visual Grounding
      </strong>
      <br>
      Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, Xi Li
      <br>
      <font color="#E89B00">
      <em>IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023</em>
      </font>
      <br>
      <a href="https://arxiv.org/abs/2303.00313">[Paper]</a>
      <br>
      <font color="grey" size="2">
        Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way.
</p>
</div>


<div class="publication">
    <img src="../static/pubs/TCF23.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Temporal Constrained Feasible Subspace Learning for Human Pose Forecasting
      </strong>
      <br>
      Gaoang Wang, Mingli Song
      <br>
      <font color="#E89B00">
      <em>International Joint Conference on Artificial Intelligence (IJCAI), 2023</em>
      </font>
      <br>
      <a href="https://www.ijcai.org/proceedings/2023/0161.pdf">[Paper]</a>
      <br>
      <font color="grey" size="2">
        Human pose forecasting is a sequential modeling task that aims to predict future poses from historical motions.
</p>
</div>


<h3>2022</h3>

<div class="publication">
    <img src="../static/pubs/HSS22.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Hierarchical Semi-Supervised Contrastive Learning for Contamination-Resistant Anomaly Detection
      </strong>
      <br>
      Gaoang Wang, Yibing Zhan, Xinchao Wang, Mingli Song, Klara Nahrstedt
      <br>
      <font color="#E89B00">
      <em>European Conference on Computer Vision (ECCV), 2022</em>
      </font>
      <br>
      <a href="https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136850107.pdf">[Paper]</a>
      <br>
      <font color="grey" size="2">
        Contrastive learning has provided a successful way to sample representation that enables effective discrimination on anomalies. 
    </p>
  </div>


<div class="publication">
    <img src="../static/pubs/SCA22.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Split and Connect: A Universal Tracklet Booster for Multi-Object Tracking
      </strong>
      <br>
      Gaoang Wang, Yizhou Wang, Renshu Gu, Weijie Hu, Jenq-Neng Hwang
      <br>
      <font color="#E89B00">
      <em>IEEE Transactions on Multimedia (TMM), 2022</em>
      </font>
      <br>
      <a href="https://arxiv.org/abs/2105.02426">[Paper]</a>
      <br>
      <font color="grey" size="2">
        We propose a novel tracklet boosting model, consisting of a Splitter and a Connector, to directly address the temporal association errors that exist in almost all trackers in the MOT field.</font>
    </p>
  </div>


<h3>2021</h3>

<div class="publication">
    <img src="../static/pubs/TAL21.png" class="publogo" width="200 px" height="160 px">
    <p> 
      <strong>
        Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking
      </strong>
      <br>
      Gaoang Wang, Renshu Gu, Zuozhu Liu, Weijie Hu, Mingli Song, Jenq-Neng Hwang
      <br>
      <font color="#E89B00">
      <em>IEEE International Conference on Computer Vision (ICCV), 2021</em>
      </font>
      <br>
      <a href="https://arxiv.org/abs/2108.06029">[Paper]</a>
      <a href="https://github.com/GaoangW/LGMTracker">[Code]</a>
      <img alt="" src="https://img.shields.io/github/stars/GaoangW/LGMTracker?style=social">
      <br>
      <font color="grey" size="2">
        We try to explore the significance of motion patterns for vehicle tracking without appearance information.
      </font>
    </p>
</div>


</div>