- [2016] Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
- [2017] Unsupervised Representation Learning by Sorting Sequences
- [2017] Self-Supervised Video Representation Learning With Odd-One-Out Networks
- [2018] Geometry guided convolutional neural networks for self-supervised video representation learning.
- [2019] Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
- [2019] Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
- [2019] Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction
- [2019] Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition
- [2019] Self-supervised video representation learning with space-time cubic puzzles
- [2020] Video representation learning by recognizing temporal transformations
- [2020] Memory-Augmented Dense Predictive Coding for Video Representation Learning
- [2020] Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning
- [2020] SpeedNet: Learning the Speediness in Videos
- [2020] Video representation learning by recognizing temporal transformations
- [2020] Self-supervised Video Representation Learning by Pace Prediction
- [2020] Self-Supervised video representation using pretext-contrastive learning
- [2020] Labelling unlabelled videos from scratch with multi-modal self-supervision
- [2020] Self-supervised co-training for video representation learning
- [2020] Multi-modal Self-Supervision from Generalized Data Transformations
- [2021] Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw
- [2021] Enhancing unsupervised video representation learning by decoupling the scene and the motion
- [2016] Deep multi-scale video prediction beyond mean square error
- [2016] Generating videos with scene dynamics
- [2017] Dual Motion GAN for Future-Flow Embedded Video Prediction
- [2019] Learning Video Representations using Contrastive Bidirectional Transformer
- [2019] VideoBERT: A joint model for video and language representation learning
- [2020] Self-supervised motion representation via scattering local motion cues
- [2020] UniVL: A unified video and language pre-training model for multimodal understanding and generation
- [2020] HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training
- [2021] Less is more: Clipbert for video-and-language learning via sparse sampling
- [2021] A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
- [2021] Videomoco: Contrastive video representation learning with temporally adversarial examples
- [2021] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
- [2022] bevt: Bert pretraining of video transformers
- [2022] Masked Autoencoders As Spatiotemporal Learners
- [2022] Masked Feature Prediction for Self-Supervised Visual Pre-Training
- [2022] Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- [2022] Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
- [2020] Unsupervised Learning From Video With Deep Neural Embeddings
- [2020] Unsupervised learning of video representations via dense trajectory clustering
- [2020] Temporally coherent embeddings for self-supervised video representation learning
- [2020] Temporal contrastive pretraining for video action recognition
- [2020] Contrastive multiview coding
- [2020] Self-supervised Co-training for Video Representation Learning
- [2020] Representation learning with video deep infomax
- [2020] Self-supervised video representation learning by maximizing mutual information
- [2020] Self-supervisedvideorepresentation learning using inter-intra contrastive framework
- [2020] A simple framework for contrastive learning of visual representations
- [2020] Unsupervised learning of visual features by contrasting cluster assignments
- [2020] Bootstrap your own latent-a new approach to self-supervised learning
- [2021] CoCon: Cooperative-Contrastive Learning
- [2021] Seco: Exploring sequence supervision for unsupervised representation learning
- [2021] Spatiotemporal contrastive video representation learning
- [2022] TCLR: Temporal contrastive learning for video representation
- [2018] Learning a text-video embedding from incomplete and heterogeneous data
- [2018] Cooperative learning of audio and video models from self-supervised synchronization
- [2019] Learning Video Representations using Contrastive Bidirectional Transformer
- [2019] VideoBERT: A joint model for video and language representation learning
- [2019] Use What You Have: Video retrieval using representations from collaborative experts
- [2019] Fine-grained action retrieval through multiple parts-of-speech embeddings
- [2020] End-to-end learning of visual representations from uncurated instructional videos
- [2020] COOT: Cooperative hierarchical transformer for video-text representation learning
- [2020] ActBERT: Learning global-local video-text representations
- [2020] UniVL: A unified video and language pre-training model for multimodal understanding and generation
- [2020] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
- [2020] HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training
- [2020] Multi-modal transformer for video retrieval
- [2020] Support-set bottlenecks for video-text representation learning
- [2020] Self-supervised multimodal versatile networks
- [2020] Self-supervised learning by cross-modal audio-video clustering
- [2021] A Joint Video and Image Encoder for End-to-End Retrieval
- [2021] Less is more: Clipbert for video-and-language learning via sparse sampling
- [2021] VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
- [2021] VideoCLIP: Contrastive pre-training for zero-shot video-text understanding
- [2021] Noise estimation using density estimation for self-supervised multimodal learning
- [2021] VATT: Transformers for multimodal self-supervised learning from raw video, audio and text
- [2021] Audio-visual instance discrimination with cross-modal agreement