We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with ∼3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104–105 times more memory-efficient. Extensive ablations provide insights into the role of each component and design choice and highlight directions for improvement in future research.

How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

Lando G.;Forte R.;Farinella G. M.;Furnari A.
2026-01-01

Abstract

We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with ∼3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104–105 times more memory-efficient. Extensive ablations provide insights into the role of each component and design choice and highlight directions for improvement in future research.
2026
9783032101914
9783032101921
Episodic memory
Multimodal LLM
Online VideoQA
Prompt engineering
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/713536
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact