Exocentric-to-Egocentric Adaptation for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

IRIS

We address the challenge of adapting a temporal action segmentation model originally developed for exocentric (fixed) cameras to an egocentric setting, where wearable cameras capture the video data. Standard supervised methods require extensive, manually labeled egocentric videos for model adaptation, which are costly and labor-intensive to obtain. In contrast, we introduce a novel approach that leverages available labeled exocentric videos alongside a set of synchronized exocentric-egocentric video pairs that do not require new temporal action segmentation labels. Our method is based on knowledge distillation, which we explore at both the feature level and the Temporal Action Segmentation model level. Experiments on the Assembly101, Ego-Exo4D and CMU-MMAC datasets validate the effectiveness of our approach which outperforms traditional unsupervised domain adaptation and temporal alignment techniques. Our best model matches the performance of fully supervised models trained on labeled egocentric data, without using any egocentric labels. On the Assembly101 dataset, our method improves the edit score by +15.99 (28.59 vs. 12.60) compared to a baseline trained only on exocentric data. Similarly, we observe a +3.32 improvement on the challenging Ego-Exo4D benchmark and a +12.89 improvement on the CMU-MMAC dataset. Code is available at the following link: https://github.com/fpv-iplab/synchronization-is-all-you-need.

Exocentric-to-Egocentric Adaptation for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Quattrocchi, Camillo;Furnari, Antonino;Di Mauro, Daniele;Giuffrida, Mario Valerio;Farinella, Giovanni Maria

2026-01-01

Abstract

We address the challenge of adapting a temporal action segmentation model originally developed for exocentric (fixed) cameras to an egocentric setting, where wearable cameras capture the video data. Standard supervised methods require extensive, manually labeled egocentric videos for model adaptation, which are costly and labor-intensive to obtain. In contrast, we introduce a novel approach that leverages available labeled exocentric videos alongside a set of synchronized exocentric-egocentric video pairs that do not require new temporal action segmentation labels. Our method is based on knowledge distillation, which we explore at both the feature level and the Temporal Action Segmentation model level. Experiments on the Assembly101, Ego-Exo4D and CMU-MMAC datasets validate the effectiveness of our approach which outperforms traditional unsupervised domain adaptation and temporal alignment techniques. Our best model matches the performance of fully supervised models trained on labeled egocentric data, without using any egocentric labels. On the Assembly101 dataset, our method improves the edit score by +15.99 (28.59 vs. 12.60) compared to a baseline trained only on exocentric data. Similarly, we observe a +3.32 improvement on the challenging Ego-Exo4D benchmark and a +12.89 improvement on the CMU-MMAC dataset. Code is available at the following link: https://github.com/fpv-iplab/synchronization-is-all-you-need.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Egocentric Vision
Temporal Action Segmentation
View Adaptation
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
International Journal of Computer Vision (2026).pdf solo gestori archivio Descrizione: Articolo Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 3.41 MB Formato Adobe PDF Visualizza/Apri	3.41 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/713510

Citazioni

ND

0

0

social impact