Knowledge distillation meets video foundation models: A video saliency prediction case study

IRIS

Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.

Knowledge distillation meets video foundation models: A video saliency prediction case study

Moradi M.;Moradi M.;Spampinato C.;Borji A.;Palazzo S.

2026-01-01

Abstract

Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Knowledge distillation
Spatio-temporal transformer
Video foundation model
Video saliency prediction
Visual attention
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/715573

Citazioni

ND

0

0

social impact