Egocentric action anticipation consists in predicting future actions from videos collected by means of a wearable camera. Action anticipation methods should be able to continuously 1) summarize the past and 2) predict possible future actions. We observe that action anticipation benefits from explicitly disentangling the two tasks. To this aim, we introduce a learning architecture which makes use of a 'rolling' LSTM to continuously summarize the past and an 'unrolling' LSTM to anticipate future actions at multiple temporal scales. The model includes a spatial and a temporal branch which process RGB images and optical flow fields independently. The predictions performed by the two branches are fused using a novel modality attention mechanism which leverages the complementary nature of the modalities. Experiments on the EPIC-KITCHENS dataset show that the proposed method surpasses the state-of-the-art by +4.02% and +6.39% when considering Top-1 and Top-5 accuracy respectively. Please see the project webpage at http://iplab.dmi.unict.it/rulstm/.
|Titolo:||Egocentric Action Anticipation by Disentangling Encoding and Inference|
|Data di pubblicazione:||2019|
|Appare nelle tipologie:||4.1 Contributo in Atti di convegno|