Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

IRIS

First-person vision is gaining interest as it offers a unique viewpoint on people’s interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55h of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Damen, Dima;Doughty, Hazel;Farinella, Giovanni Maria;Fidler, Sanja;Furnari, Antonino;Kazakos, Evangelos;Moltisanti, Davide;Munro, Jonathan;Perrett, Toby;Price, Will;Wray, Michael

2018-01-01

Abstract

First-person vision is gaining interest as it offers a unique viewpoint on people’s interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55h of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2018
		
	Codice ISBN
	
			9783030012243
		
	Parole chiave
	
			Action recognition and anticipation; Benchmarks; Dataset; Egocentric object detection; Egocentric vision; First-person vision; Theoretical Computer Science; Computer Science (all)
		
	Appare nelle tipologie:
	
			4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/365585

Citazioni

ND

392

ND

social impact