A curated list of self-supervised representation learning.
- Self-supervised Vedio Representation Learning
- Self-supervised Visual Representation Learning
- Self-supervised Multi-modal Representation Learning
- [STS] Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics - Jiangliu Wang et al,
TPAMI 2021 - [VideoMoCo] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples - Tian Pan et al,
CVPR 2021 - [BE] Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning - Jinpeng Wang et al,
CVPR 2021 - [RSPNet] RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning - Peihao Chen et al,
AAAI 2021 - [DSM] Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion - Jinpeng Wang et al,
AAAI 2021 - [TCLR] TCLR: Temporal Contrastive Learning for Video Representation - Ishan Dave et al,
Arxiv 2021
-
[CCL] Cycle-Contrast for Self-Supervised Video Representation Learning - Quan Kong et al,
NeurIPS 2020 -
[CoCLR] Self-supervised Co-training for Video Representation Learning - Tengda Han et al,
NeurIPS 2020 -
[PRP] Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning - Yuan Yao et al,
CVPR 2020 -
[VRE-MRA] Exploiting Motion Information from Unlabeled Videos for Static Image Action Recognition - Yiyi Zhang et al,
AAAI 2020 -
[VCP] Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning - Dezhao Luo et al,
AAAI 2020 -
[IIC] Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework - Li Tao et al,
ACMMM 2020 -
[RTT] Video Representation Learning by Recognizing Temporal Transformations - Simon Jenni et al,
ECCV 2020 -
[VPP] Self-Supervised Video Representation Learning by Pace Prediction - Jiangliu Wang et al,
ECCV 2020 -
[DTG-Net] DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition - Ziming Liu et al,
Arxiv 2020 -
[VTDL] Self-supervised Temporal Discriminative Learning for Video Representation Learning - Jinpeng Wang et al,
Arxiv 2020 -
[CVRL] Spatiotemporal Contrastive Video Representation Learning - Rui Qian et al,
Arxiv 2020 -
[PCL] Self-Supervised Video Representation Using Pretext-Contrastive Learning - Li Tao et al,
Arxiv 2020 -
[TCE] Temporally Coherent Embeddings for Self-Supervised Video Representation Learning - Joshua Knights et al,
Arxiv 2020 -
[HDC] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning - Zehua Zhang et al,
Arxiv 2020 -
[CEP] Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning - Xinyu Yang et al,
Arxiv 2020 -
[TaCo] Can Temporal Information Help with Contrastive Self-Supervised Learning? - Yutong Bai et al,
Arxiv 2020
- [DPC] Video Representation Learning by Dense Predictive Coding - Tengda Han et al,
ICCV Workshops 2019
- [] Space-Time Correspondence as a Contrastive Random Walk - Allan Jabri et al,
NeurIPS 2020 - [] Exploiting Temporal Coherence for Self-Supervised One-shot Video Re-identification - Dripta S. Raychaudhuri et al,
ECCV 2020 - [] Self-Supervision by Prediction for Object Discovery in Videos - Beril Besbinar et al,
Arxiv 2021
- UCF101
- HMDB51
| Method | Conference | Network | Input size | Pretrain Dataset | UCF101 Acc.(%) | HMDB51 Acc.(%) |
|---|---|---|---|---|---|---|
| [TCLR] | Arxiv 2021 | R(2+1)D | 112x112 | Kinetic-400 | 84.3 | 54.2 |
| R3D | 112x112 | Kinetic-400 | 84.1 | 53.6 | ||
| [DSM] | AAAI 2021 | R3D-34 | 224x224 | Kinetic-400 | 78.2 | 52.8 |
| I3D | 224x224 | Kinetic-400 | 74.8 | 52.5 | ||
| [RSPNet] | AAAI 2021 | S3D-G | 112x112 | Kinetic-400 | 93.7 | 64.7 |
| R(2+1)D | 112x112 | Kinetic-400 | 81.1 | 44.6 | ||
| C3D | 112x112 | Kinetic-400 | 76.7 | 44.6 | ||
| R3D | 112x112 | Kinetic-400 | 74.3 | 41.8 | ||
| [BE] | CVPR 2021 | R3D | 224x224 | Kinetic-400 | 87.1 | 56.2 |
| I3D | 224x224 | Kinetic-400 | 86.8 | 55.4 | ||
| [VMoCo] | CVPR 2021 | R(2+1)D | 112x112 | Kinetic-400 | 78.7 | 49.2 |
| R3D | 112x112 | Kinetic-400 | 74.1 | 43.6 | ||
| [STS] | TPAMI 2021 | S3D-G | 224x224 | Kinetic-400 | 89.0 | 62.0 |
| [CEP] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 76.3 | 36.8 |
| SlowFast | 128x128 | Kinetic-400 | 68.5 | 36.8 | ||
| [HDC] | Arxiv 2020 | R(2+1)D | 112x112 | Kinetic-400 | 76.2 | 39.8 |
| C3D | 112x112 | Kinetic-400 | 72.3 | 39.3 | ||
| R3D | 112x112 | Kinetic-400 | 68.5 | 38.1 | ||
| [TCE] | Arxiv 2020 | R2D-50 | 224x224 | Kinetic-400 | 71.2 | 36.6 |
| R2D-18 | 224x224 | Kinetic-400 | 68.8 | 34.2 | ||
| [PCL] | Arxiv 2020 | R3D | 112x112 | Kinetic-400 | 82.3 | 43.2 |
| R(2+1)D | 112x112 | Kinetic-400 | 80.7 | 44.6 | ||
| C3D | 112x112 | Kinetic-400 | 79.7 | 42.3 | ||
| R3D | 112x112 | Kinetic-400 | 79.5 | 41.7 | ||
| [CVRL] | Arxiv 2020 | R3D-101 | 224x224 | Kinetic-600 | 93.6 | 69.4 |
| R3D-101 | 224x224 | Kinetic-400 | 92.9 | 66.7 | ||
| [VTDL] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 84.9 | 52.5 |
| I3D | 224x224 | Kinetic-400 | 82.1 | 52.9 | ||
| R3D | 224x224 | Kinetic-400 | 78.4 | 49.1 | ||
| [DTG-Net] | Arxiv 2020 | TSN-ResNet18 | - | Kinetic-400 | 69.1 | - |
| [VPP] | Arxiv 2020 | R(2+1)D | 224x224 | Kinetic-400 | 77.1 | 36.6 |
| [RTT] | ECCV 2020 | R3D | 112x112 | Kinetic-400 | 79.3 | 49.8 |
| C3D | 112x112 | Kinetic-400 | 69.9 | 39.6 |
- [] Context Matters: Graph-based Self-supervised Representation Learning for Medical Images - Li Sun et al,
AAAI 2021
- [] CompRess: Self-Supervised Learning by Compressing Representations - Soroush Abbasi Koohpayegani et al,
NeurIPS 2020 - [] Self-Supervised Visual Representation Learning from Hierarchical Grouping - Xiao Zhang et al,
NeurIPS 2020 - [BYOL] Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning - Jean-Bastien Grill et al,
NeurIPS 2020
- Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning - Jingran Zhang et al,
AAAI 2021
- Self-Supervised Learning by Cross-Modal Audio-Video Clustering - Humam Alwassel et al,
NeurIPS 2020 - Self-Supervised MultiModal Versatile Networks - Jean-Baptiste Alayrac et al,
NeurIPS 2020 - Labelling unlabelled videos from scratch with multi-modal self-supervision - Yuki Asano et al,
NeurIPS 2020 - [AVSA] Learning Representations from Audio-Visual Spatial Alignment - Pedro Morgado et al,
NeurIPS 2020 - [ELO] Evolving Losses for Unsupervised Video Representation Learning - AJ Piergiovanni et al,
CVPR 2020 - Audio-Visual Instance Discrimination with Cross-Modal Agreement - Pedro Morgado et al,
Arxiv 2020