Low-Data Learning for Assistive Video Description

Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Unsupervised Learning, Semi Supervised Learning, and Active Learning are methods which are designed to improve the learning efficacy of models in low-data domains. In this project, we explore how such techniques can be used to develop state of the art video description systems leveraging multimodal data.



Globally, over 285 million people suffer from some form of visual impairment, of which more than 40 million are fully blind. In many cases, traditional media has remained accessible to the visually impaired through Descriptive Video Services (DVS), an additional narration track for videos which is intended to make video content available to visually impaired users. DVS is a time-intensive manual process requiring annotators to watch videos, decide which elements of the video are important to convey the visual information in the scene, and write/perform a script which captures that visual information. Such annotation is possible, but expensive (~$20/min).  Online video media has therefore remained inaccessible to those with visual impairments due to the cost and scalability of current DVS annotation procedures. With online interaction becoming a societal norm, the task of making video content available to all users has become an even higher priority.

The goal of the proposed project is to continue the development of a human-in-the-loop system for automated DVS (ADVS). Designing such a system requires investigation into a number of complex tasks including: understanding useful description formats, developing data collection methods such as active learning systems, semi-supervised and unsupervised video to text translation, video understanding and modeling, and evaluation of video descriptions. This project has the potential to significantly expand the development of ADVS, with the aim of developing user-testable end-to-end systems which can watch, understand, and describe videos.

Currently, the development of ADVS is heavily limited by the availability of high-quality training data. Professionally annotated datasets are primarily focused on long-form media such as TV shows and movies, with difficult semantic syntax only containing a few hundred videos. Large-scale data is available, but relies on inexperienced mechanical turk users or automated speech recognition, leading to descriptions that are inadequate for visually impaired users.

Recently, major advances have been made in semi-supervised and unsupervised translation algorithms. Such algorithms have been highly under-explored in video description and can to further increase the data-efficiency of our models, providing complementary benefits to our recent advances in active learning.  In year two of this project, we propose expanding the scope beyond active learning, to make use of such techniques - primarily focusing on unsupervised video to text understanding, drawing on the large unsupervised translation literature.

Unsupervised ADVS is an exciting, emerging research area that aligns with Google Research's priorities to advance weakly/unsupervised and cross-modal vision+language learning methods. Further, due to the cost of annotating descriptive video data, and the fact that most pre-existing data is controlled by movie/TV studios (and thus unavailable for use), unpaired video to text translation may be the only viable, scalable approach to create ADVS systems for the visually impaired.