TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos

IRIS

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.

TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos

Plini L.;Scofano L.;De Matteis E.;di Melendugno G. M. D.;Flaborea A.;Sanchietti A.;Farinella G. M.;Galasso F.;Furnari A.

2026-01-01

Abstract

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Chain of Thought
Egocentric vision
In-context learning
Large language models
Procedural mistake detection
Video understanding
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/713531

Citazioni

ND

0

0

social impact