We address the challenge of adapting a temporal action segmentation model originally developed for exocentric (fixed) cameras to an egocentric setting, where wearable cameras capture the video data. Standard supervised methods require extensive, manually labeled egocentric videos for model adaptation, which are costly and labor-intensive to obtain. In contrast, we introduce a novel approach that leverages available labeled exocentric videos alongside a set of synchronized exocentric-egocentric video pairs that do not require new temporal action segmentation labels. Our method is based on knowledge distillation, which we explore at both the feature level and the Temporal Action Segmentation model level. Experiments on the Assembly101, Ego-Exo4D and CMU-MMAC datasets validate the effectiveness of our approach which outperforms traditional unsupervised domain adaptation and temporal alignment techniques. Our best model matches the performance of fully supervised models trained on labeled egocentric data, without using any egocentric labels. On the Assembly101 dataset, our method improves the edit score by +15.99 (28.59 vs. 12.60) compared to a baseline trained only on exocentric data. Similarly, we observe a +3.32 improvement on the challenging Ego-Exo4D benchmark and a +12.89 improvement on the CMU-MMAC dataset. Code is available at the following link: https://github.com/fpv-iplab/synchronization-is-all-you-need.

Exocentric-to-Egocentric Adaptation for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Quattrocchi, Camillo;Furnari, Antonino;Mauro, Daniele Di;Giuffrida, Mario Valerio;Farinella, Giovanni Maria
2026-01-01

Abstract

We address the challenge of adapting a temporal action segmentation model originally developed for exocentric (fixed) cameras to an egocentric setting, where wearable cameras capture the video data. Standard supervised methods require extensive, manually labeled egocentric videos for model adaptation, which are costly and labor-intensive to obtain. In contrast, we introduce a novel approach that leverages available labeled exocentric videos alongside a set of synchronized exocentric-egocentric video pairs that do not require new temporal action segmentation labels. Our method is based on knowledge distillation, which we explore at both the feature level and the Temporal Action Segmentation model level. Experiments on the Assembly101, Ego-Exo4D and CMU-MMAC datasets validate the effectiveness of our approach which outperforms traditional unsupervised domain adaptation and temporal alignment techniques. Our best model matches the performance of fully supervised models trained on labeled egocentric data, without using any egocentric labels. On the Assembly101 dataset, our method improves the edit score by +15.99 (28.59 vs. 12.60) compared to a baseline trained only on exocentric data. Similarly, we observe a +3.32 improvement on the challenging Ego-Exo4D benchmark and a +12.89 improvement on the CMU-MMAC dataset. Code is available at the following link: https://github.com/fpv-iplab/synchronization-is-all-you-need.
2026
Egocentric Vision
Temporal Action Segmentation
View Adaptation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/713510
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact