Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on supervised learning require large amounts of annotated data, whose scarce availability is one of the main limiting factors to the development of general solutions. Unsupervised learning can instead leverage the vast amount of videos available on the web and it is a promising solution for overcoming the existing limitations. In this paper, we propose an adversarial GAN-based framework that learns video representations and dynamics through a self-supervision mechanism in order to perform dense and global prediction in videos. Our approach synthesizes videos by (1) factorizing the process into the generation of static visual content and motion, (2) learning a suitable representation of a motion latent space in order to enforce spatio-temporal coherency of object trajectories, and (3) incorporating motion estimation and pixel-wise dense prediction into the training procedure. Self-supervision is enforced by using motion masks produced by the generator, as a co-product of its generation process, to supervise the discriminator network in performing dense prediction. Performance evaluation, carried out on standard benchmarks, shows that our approach is able to learn, in an unsupervised way, both local and global video dynamics. The learned representations, then, support the training of video object segmentation methods with sensibly less (about 50%) annotations, giving performance comparable to the state of the art.

Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos

Spampinato C
;
Palazzo S;Giordano D;
2019

Abstract

Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on supervised learning require large amounts of annotated data, whose scarce availability is one of the main limiting factors to the development of general solutions. Unsupervised learning can instead leverage the vast amount of videos available on the web and it is a promising solution for overcoming the existing limitations. In this paper, we propose an adversarial GAN-based framework that learns video representations and dynamics through a self-supervision mechanism in order to perform dense and global prediction in videos. Our approach synthesizes videos by (1) factorizing the process into the generation of static visual content and motion, (2) learning a suitable representation of a motion latent space in order to enforce spatio-temporal coherency of object trajectories, and (3) incorporating motion estimation and pixel-wise dense prediction into the training procedure. Self-supervision is enforced by using motion masks produced by the generator, as a co-product of its generation process, to supervise the discriminator network in performing dense prediction. Performance evaluation, carried out on standard benchmarks, shows that our approach is able to learn, in an unsupervised way, both local and global video dynamics. The learned representations, then, support the training of video object segmentation methods with sensibly less (about 50%) annotations, giving performance comparable to the state of the art.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/20.500.11769/373622
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 2
social impact