Humans engage daily in procedural activities such as cooking a recipe or fixing a bike, which can be described as goal-oriented sequences of key-steps following certain ordering constraints. Task graphs mined from videos or textual descriptions have recently gained popularity as a human-readable, holistic representation of procedural activities encoding a partial ordering over key-steps, and have shown promise in supporting downstream video understanding tasks. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, this paper introduces an approach based on gradient-based maximum likelihood optimization of edge weights, which can be used to directly estimate an adjacency matrix and can also be naturally plugged into more complex neural network architectures. We validate the ability of the proposed approach to generate accurate task graphs on the CaptainCook4D and EgoPER datasets. Moreover, we extend our validation analysis to the EgoProceL dataset, which we manually annotate with task graphs as an additional contribution. The three datasets together constitute a new benchmark for task graph learning, where our approach obtains improvements of +14.5%, +10.2% and +13.6% in F1 score, respectively, over previous approaches. Thanks to the differentiability of the proposed framework, we also introduce a feature-based approach for predicting task graphs from key-step textual or video embeddings, which exhibits emerging video understanding abilities. Beyond that, task graphs learned with our approach obtain top performance in the Ego-Exo4D procedure understanding benchmark including 5 different downstream tasks, with gains of up to +4.61%, +0.10%, +5.02%, +8.62%, and +15.16% in finding Previous Keysteps, Optional Keysteps, Procedural Mistakes, Missing Keysteps, and Future Keysteps, respectively. We finally show significant enhancements to the challenging task of online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +6.4% in the Assembly101-O and EPIC-Tent-O datasets, respectively, compared to the state of the art.
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
Seminara, Luigi;Farinella, Giovanni Maria;Furnari, Antonino
2026-01-01
Abstract
Humans engage daily in procedural activities such as cooking a recipe or fixing a bike, which can be described as goal-oriented sequences of key-steps following certain ordering constraints. Task graphs mined from videos or textual descriptions have recently gained popularity as a human-readable, holistic representation of procedural activities encoding a partial ordering over key-steps, and have shown promise in supporting downstream video understanding tasks. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, this paper introduces an approach based on gradient-based maximum likelihood optimization of edge weights, which can be used to directly estimate an adjacency matrix and can also be naturally plugged into more complex neural network architectures. We validate the ability of the proposed approach to generate accurate task graphs on the CaptainCook4D and EgoPER datasets. Moreover, we extend our validation analysis to the EgoProceL dataset, which we manually annotate with task graphs as an additional contribution. The three datasets together constitute a new benchmark for task graph learning, where our approach obtains improvements of +14.5%, +10.2% and +13.6% in F1 score, respectively, over previous approaches. Thanks to the differentiability of the proposed framework, we also introduce a feature-based approach for predicting task graphs from key-step textual or video embeddings, which exhibits emerging video understanding abilities. Beyond that, task graphs learned with our approach obtain top performance in the Ego-Exo4D procedure understanding benchmark including 5 different downstream tasks, with gains of up to +4.61%, +0.10%, +5.02%, +8.62%, and +15.16% in finding Previous Keysteps, Optional Keysteps, Procedural Mistakes, Missing Keysteps, and Future Keysteps, respectively. We finally show significant enhancements to the challenging task of online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +6.4% in the Assembly101-O and EPIC-Tent-O datasets, respectively, compared to the state of the art.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


