The idea of multi-sensor data fusion is to combine the data coming from different sensors to provide more accurate and complementary information to solve a specific task. Our goal is to build a shared representation related to data coming from different domains, such as images, audio signal, heart rate, acceleration, etc., in order to anticipate daily activities of a user wearing multimodal sensors. To this aim, we consider the Stanford-ECM Dataset which contains syncronized data acquired with different sensors: video, acceleration and heart rate signals. The dataset is adapted to our action prediction task by identifying the transitions from the generic “Unknown” class to a specific “Activity”. We discuss and compare a Siamese Network with the Multi Layer Perceptron and the 1D CNN where the input is an unknown observation and the output is the next activity to be observed. The feature representations obtained with the considered deep architecture are classified with SVM or KNN classifiers. Experimental results pointed out that prediction from multimodal data seems a feasible task, suggesting that multimodality improves both classification and prediction. Nevertheless, the task of reliably predicting next actions is still open and requires more investigations as well as the availability of multimodal dataset, specifically built for prediction purposes.

Action anticipation from multimodal data

ROTONDO, TIZIANA;Farinella G. M.;Battiato S.
2019

Abstract

The idea of multi-sensor data fusion is to combine the data coming from different sensors to provide more accurate and complementary information to solve a specific task. Our goal is to build a shared representation related to data coming from different domains, such as images, audio signal, heart rate, acceleration, etc., in order to anticipate daily activities of a user wearing multimodal sensors. To this aim, we consider the Stanford-ECM Dataset which contains syncronized data acquired with different sensors: video, acceleration and heart rate signals. The dataset is adapted to our action prediction task by identifying the transitions from the generic “Unknown” class to a specific “Activity”. We discuss and compare a Siamese Network with the Multi Layer Perceptron and the 1D CNN where the input is an unknown observation and the output is the next activity to be observed. The feature representations obtained with the considered deep architecture are classified with SVM or KNN classifiers. Experimental results pointed out that prediction from multimodal data seems a feasible task, suggesting that multimodality improves both classification and prediction. Nevertheless, the task of reliably predicting next actions is still open and requires more investigations as well as the availability of multimodal dataset, specifically built for prediction purposes.
Action Anticipation; Multimodal Learning; Siamese Network
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/20.500.11769/369893
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact