Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.
Knowledge distillation meets video foundation models: A video saliency prediction case study
Moradi M.;Moradi M.;Spampinato C.;Palazzo S.
2026-01-01
Abstract
Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


