Distilling Knowledge from Large Video Models for Driver Visual Attention Prediction

IRIS

Driver attention prediction has gained significant attention recently due to its role in developing advanced driver assistance systems (ADAS) and intelligent vehicles. The emergence of video foundation models (VFMs) has opened up new possibilities for improving video understanding tasks like video saliency prediction (VSP). However, these large models are often not cost-effective for ADAS and intelligent vehicles due to their size and resource demands. To address this, we present an early effort to use knowledge distillation for predicting driver visual attention, employing the first VFM-based VSP model, SalFoM, as the teacher network. Given that driver attention prediction datasets are smaller than those used for large models, fine-tuning such models is challenging due to their high parameter count. To overcome this, we designed a VFM-based driver attention prediction network with fewer parameters than the teacher network. Experimental results show our model’s effectiveness on benchmark datasets. © 2025 IEEE.

Distilling Knowledge from Large Video Models for Driver Visual Attention Prediction

Moradi Morteza;Moradi Mohammad;Sparripinato Concetto;Borjit Ali;Palazzo Simone

2025-01-01

Abstract

Driver attention prediction has gained significant attention recently due to its role in developing advanced driver assistance systems (ADAS) and intelligent vehicles. The emergence of video foundation models (VFMs) has opened up new possibilities for improving video understanding tasks like video saliency prediction (VSP). However, these large models are often not cost-effective for ADAS and intelligent vehicles due to their size and resource demands. To address this, we present an early effort to use knowledge distillation for predicting driver visual attention, employing the first VFM-based VSP model, SalFoM, as the teacher network. Given that driver attention prediction datasets are smaller than those used for large models, fine-tuning such models is challenging due to their high parameter count. To overcome this, we designed a VFM-based driver attention prediction network with fewer parameters than the teacher network. Experimental results show our model’s effectiveness on benchmark datasets. © 2025 IEEE.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Distilling_Knowledge_from_Large_Video_Models_for_Driver_Visual_Attention_Prediction_compressed (1).pdf solo gestori archivio Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 226.38 kB Formato Adobe PDF Visualizza/Apri	226.38 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/714789

Citazioni

ND

5

1

social impact