Video saliency prediction (VSP) aims to estimate regions of interest in dynamic scenes by modeling human visual attention over time. Despite recent advances, existing methods often struggle to effectively leverage temporal structure. Inspired by cognitive theories of human attention, we propose LSTD-Net, a novel VSP framework that integrates both long- and short-term temporal modeling. At the encoding stage, we employ UnMasked Teacher (UMT), a transformer-based video foundation model pretrained on large-scale video data, to extract high-level spatio-temporal features with strong generalization capabilities. To mimic the human ability to segment and prioritize dynamic information over time, the encoded features are partitioned into four temporal segments, each processed independently by specialized decoder branches. A hierarchical multi-stage decoding strategy progressively fuses local and global temporal dependencies while preserving spatial structure, enhancing prediction accuracy. Extensive experiments across six benchmark datasets, including general-purpose, audio-visual, and driver attention scenarios, show that LSTD-Net consistently outperforms state-of-the-art models across multiple evaluation metrics. These results underscore the effectiveness of combining foundation models with structured temporal reasoning to advance video saliency prediction.
Learning long- and short-term dynamics for human attention prediction using large video models
Moradi M.;Moradi M.;Proietto Salanitri F.;Bellitto G.;Rundo F.;Palazzo S.;Spampinato C.
2026-01-01
Abstract
Video saliency prediction (VSP) aims to estimate regions of interest in dynamic scenes by modeling human visual attention over time. Despite recent advances, existing methods often struggle to effectively leverage temporal structure. Inspired by cognitive theories of human attention, we propose LSTD-Net, a novel VSP framework that integrates both long- and short-term temporal modeling. At the encoding stage, we employ UnMasked Teacher (UMT), a transformer-based video foundation model pretrained on large-scale video data, to extract high-level spatio-temporal features with strong generalization capabilities. To mimic the human ability to segment and prioritize dynamic information over time, the encoded features are partitioned into four temporal segments, each processed independently by specialized decoder branches. A hierarchical multi-stage decoding strategy progressively fuses local and global temporal dependencies while preserving spatial structure, enhancing prediction accuracy. Extensive experiments across six benchmark datasets, including general-purpose, audio-visual, and driver attention scenarios, show that LSTD-Net consistently outperforms state-of-the-art models across multiple evaluation metrics. These results underscore the effectiveness of combining foundation models with structured temporal reasoning to advance video saliency prediction.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


