Video Representation Learning for Global and Local Features

This ongoing project attempts to use self-supervised learning to learn video representation that are both useful for coarser tasks involving global informationand finer-grained tasks involving local information.


  • Franklin Wang, UC Berkeley
  • Avideh Zakhor, UC Berkeley
  • Yale Song, Microsoft
  • Du Tran, Facebook
  • Aravind Kalaiah, Facebook


Self-supervised video representation learning provides new opportunities to computer vision: It can take full advantage of the wealth of unlabeled video data available, and when successful, it can improve the performance of a variety of downstream tasks, reduce the large storage impact of video, and reduce time expended onto feature engineering. Existing video representation learning frameworks generally learn global representations of videos, usually at the clip-level. These types of representations are generally evaluated on action recognition baselines (which have a strong bias towards global appearance information), and are ill-suited for local tasks involving fine details and dense prediction, like action segmentation and tracking. In this work, we propose to learn representations that are optimized both for global tasks and local tasks by developing contrastive learning methods that operate at a spatiotemporally denser regime beyond the clip-level. Our self-supervised framework will learn global and local representations for RGB frames and motion features like optical flow to learn coarse and fine-grained representations of appearance and motion.