Video saliency prediction (VSP) aims to estimate regions of interest in dynamic scenes by modeling human visual attention over time. Despite recent advances, existing methods often struggle to effectively leverage temporal structure. Inspired by cognitive theories of human attention, we propose LSTD-Net, a novel VSP framework that integrates both long- and short-term temporal modeling. At the encoding stage, we employ UnMasked Teacher (UMT), a transformer-based video foundation model pretrained on large-scale video data, to extract high-level spatio-temporal features with strong generalization capabilities. To mimic the human ability to segment and prioritize dynamic information over time, the encoded features are partitioned into four temporal segments, each processed independently by specialized decoder branches. A hierarchical multi-stage decoding strategy progressively fuses local and global temporal dependencies while preserving spatial structure, enhancing prediction accuracy. Extensive experiments across six benchmark datasets, including general-purpose, audio-visual, and driver attention scenarios, show that LSTD-Net consistently outperforms state-of-the-art models across multiple evaluation metrics. These results underscore the effectiveness of combining foundation models with structured temporal reasoning to advance video saliency prediction.

Learning long- and short-term dynamics for human attention prediction using large video models

Moradi M.;Moradi M.;Proietto Salanitri F.;Bellitto G.;Rundo F.;Palazzo S.;Spampinato C.
2026-01-01

Abstract

Video saliency prediction (VSP) aims to estimate regions of interest in dynamic scenes by modeling human visual attention over time. Despite recent advances, existing methods often struggle to effectively leverage temporal structure. Inspired by cognitive theories of human attention, we propose LSTD-Net, a novel VSP framework that integrates both long- and short-term temporal modeling. At the encoding stage, we employ UnMasked Teacher (UMT), a transformer-based video foundation model pretrained on large-scale video data, to extract high-level spatio-temporal features with strong generalization capabilities. To mimic the human ability to segment and prioritize dynamic information over time, the encoded features are partitioned into four temporal segments, each processed independently by specialized decoder branches. A hierarchical multi-stage decoding strategy progressively fuses local and global temporal dependencies while preserving spatial structure, enhancing prediction accuracy. Extensive experiments across six benchmark datasets, including general-purpose, audio-visual, and driver attention scenarios, show that LSTD-Net consistently outperforms state-of-the-art models across multiple evaluation metrics. These results underscore the effectiveness of combining foundation models with structured temporal reasoning to advance video saliency prediction.
2026
Spatio-temporal transformer
Video foundation model
Video saliency prediction
Visual attention
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/715574
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact