Long Term Video Understanding

While comprehending the long range temporal structure of events in a video stream is a fundamental problem in computer vision, little progress has been made towards this goal beyond short range comprehension. The roadblocks to progress include (1) Severe memory constraints imposed by on device RAM on GPUs (2) Lack of effective learning techniques that can avoid spurious correlations and learn in low signal to noise ratio regimes such as long term comprehension and (3) Scarcity of annotated high quality long term video data. In this project, we aim to make progress on these hard challenges and make progress towards holistic long term video understanding.




We aim to design a causal hierarchical video model that can serve as a backbone for several video understanding tasks together using long term temporal context.

To study the problem, we aim to use multiple datasets for different sources of task-based supervision such as Charades , Atomic Visual Actions and the Epic Kitchens Dataset (8 Million Videos, 2-5 minutes per video). We propose to build a model that uses explicit hierarchical representation being updated at different temporal frequencies . Starting from a new short term clip from the whole video, the model uses a short term video understanding model like SlowFast or I3D to extract local features that get aggregated with the previously observed ones to form long-term features, which can be used for multiple video understanding tasks like video object localization, action recognition and tracking. We perform late fusion for aggregation using an autoregressive structure such as a suitably modified transformer encoder that uses attention over previous extracted context for effective fusion within a tractable computational budget. In conclusion, this design of understanding long term video has several advantages. i. The proposed fusion method allows adaptive frame selection through sparse attention thus choosing fewer frames in case of predictable low entropy dynamics without losing out information on high entropy chaotic scenes. ii. Our proposed model can play as a useful holistic backbone for a range of video understanding tasks rather than a narrow specialized task. iii. Multi-scale hierarchical approach has been shown to provide effective temporal context for several short term understanding tasks [13, 14] and this model aims to extend the benefits to longer time horizons, while overcoming memory issues.