How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

IRIS

We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with ∼3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104–105 times more memory-efficient. Extensive ablations provide insights into the role of each component and design choice and highlight directions for improvement in future research.

How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

Lando G.;Forte R.;Farinella G. M.;Furnari A.

2026-01-01

Abstract

We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with ∼3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104–105 times more memory-efficient. Extensive ablations provide insights into the role of each component and design choice and highlight directions for improvement in future research.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Codice ISBN
	
				9783032101914
9783032101921
			
	Parole chiave
	
				Episodic memory
Multimodal LLM
Online VideoQA
Prompt engineering
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/713536

Citazioni

ND

0

ND

social impact