Image representation using consensus vocabulary and food images classification

Moltisanti, Marco

Digital images are the result of many physical factors, such as illumination, point of view an thermal noise of the sensor. These elements may be irrelevant for a specific Computer Vision task; for instance, in the object detection task, the viewpoint and the color of the object should not be relevant in order to answer the question "Is the object present in the image?". Nevertheless, an image depends crucially on all such parameters and it is simply not possible to ignore them in analysis. Hence, finding a representation that, given a specific task, is able to keep the significant features of the image and discard the less useful ones is the first step to build a robust system in Computer Vision. One of the most popular model to represent images is the Bag-of-Visual-Words (BoW) model. Derived from text analysis, this model is based on the generation of a codebook (also called vocabulary) which is subsequently used to provide the actual image representation. Considering a set of images, the typical pipeline, consists in: 1. Select a subset of images to be the training set for the model; 2. Extract the desired features from the all the images; 3. Run a clustering algorithm on the features extracted from the training set: each cluster is a codeword, the set containing all the clusters is the codebook; 4. For each feature point, nd the closest codeword according to a distance function or metric; 5. Build a normalized histogram of the occurrences of each word. The choices made in the design phase influence strongly the final outcome of the representation. In this work we will discuss how to aggregate di fferent kind of features to obtain more powerful representations, presenting some state-of-the-art methods in Computer Vision community. We will focus on Clustering Ensemble techniques, presenting the theoretical framework and a new approach (Section 2.5). Understanding food in everyday life (e.g., the recognition of dishes and the related ingredients, the estimation of quantity, etc.) is a problem which has been considered in different research areas due its important impact under the medical, social and anthropological aspects. For instance, an insane diet can cause problems in the general health of the people. Since health is strictly linked to the diet, advanced Computer Vision tools to recognize food images (e.g., acquired with mobile/wearable cameras), as well as their properties (e.g., calories, volume), can help the diet monitoring by providing useful information to the experts (e.g., nutritionists) to assess the food intake of patients (e.g., to combat obesity). On the other hand, the great diffusion of low cost image acquisition devices embedded in smartphones allows people to take pictures of food and share them on Internet (e.g., on social media); the automatic analysis of the posted images could provide information on the relationship between people and their meals and can be exploited by food retailer to better understand the preferences of a person for further recommendations of food and related products. Image representation plays a key role while trying to infer information about food items depicted in the image. We propose a deep review of the state-of-the-art two different novel representation techniques.

Image representation using consensus vocabulary and food images classification / Moltisanti, Marco. - (2015 Dec 11).