Driver attention prediction has gained significant attention recently due to its role in developing advanced driver assistance systems (ADAS) and intelligent vehicles. The emergence of video foundation models (VFMs) has opened up new possibilities for improving video understanding tasks like video saliency prediction (VSP). However, these large models are often not cost-effective for ADAS and intelligent vehicles due to their size and resource demands. To address this, we present an early effort to use knowledge distillation for predicting driver visual attention, employing the first VFM-based VSP model, SalFoM, as the teacher network. Given that driver attention prediction datasets are smaller than those used for large models, fine-tuning such models is challenging due to their high parameter count. To overcome this, we designed a VFM-based driver attention prediction network with fewer parameters than the teacher network. Experimental results show our model’s effectiveness on benchmark datasets. © 2025 IEEE.
Distilling Knowledge from Large Video Models for Driver Visual Attention Prediction
Moradi Morteza;Moradi Mohammad;Palazzo Simone
2025-01-01
Abstract
Driver attention prediction has gained significant attention recently due to its role in developing advanced driver assistance systems (ADAS) and intelligent vehicles. The emergence of video foundation models (VFMs) has opened up new possibilities for improving video understanding tasks like video saliency prediction (VSP). However, these large models are often not cost-effective for ADAS and intelligent vehicles due to their size and resource demands. To address this, we present an early effort to use knowledge distillation for predicting driver visual attention, employing the first VFM-based VSP model, SalFoM, as the teacher network. Given that driver attention prediction datasets are smaller than those used for large models, fine-tuning such models is challenging due to their high parameter count. To overcome this, we designed a VFM-based driver attention prediction network with fewer parameters than the teacher network. Experimental results show our model’s effectiveness on benchmark datasets. © 2025 IEEE.| File | Dimensione | Formato | |
|---|---|---|---|
|
Distilling_Knowledge_from_Large_Video_Models_for_Driver_Visual_Attention_Prediction_compressed (1).pdf
solo gestori archivio
Tipologia:
Versione Editoriale (PDF)
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
226.38 kB
Formato
Adobe PDF
|
226.38 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


