Image captioning represents a significant challenge within the field of Computer Vision. This task involves processing an image as input, identifying objects within it, comprehending the relationships between these objects, including their implicit characteristics, and generating a concise description as output. Given the vast number of potential interactions, acquiring sufficient training examples is a formidable task. Prior research has demonstrated that objects and predicates, when involved in less common relationships, occur more frequently when considered independently. Consequently, the proposed solution involves training two distinct visual models for objects and predicates, which are subsequently combined to capture as many relationships as possible. This study aims to leverage the information obtained regarding detected objects and their relationships to generate a comprehensive description of the image. By facilitating user interaction through Visual Question Answering, a task that bridges Computer Vision and Natural Language Processing, we can create an interactive approach to image captioning. Given a question and an image, the system is designed to reason based on both the image content and general knowledge, and generate an accurate answer. We believe that such a system provides an effective method for visually impaired individuals to understand the content of an image.
A NLP and YOLOv8-Integrated Approach for Enabling Visually Impaired Individuals to Interpret Their Environment
Avanzato R.;Randieri C.
2023-01-01
Abstract
Image captioning represents a significant challenge within the field of Computer Vision. This task involves processing an image as input, identifying objects within it, comprehending the relationships between these objects, including their implicit characteristics, and generating a concise description as output. Given the vast number of potential interactions, acquiring sufficient training examples is a formidable task. Prior research has demonstrated that objects and predicates, when involved in less common relationships, occur more frequently when considered independently. Consequently, the proposed solution involves training two distinct visual models for objects and predicates, which are subsequently combined to capture as many relationships as possible. This study aims to leverage the information obtained regarding detected objects and their relationships to generate a comprehensive description of the image. By facilitating user interaction through Visual Question Answering, a task that bridges Computer Vision and Natural Language Processing, we can create an interactive approach to image captioning. Given a question and an image, the system is designed to reason based on both the image content and general knowledge, and generate an accurate answer. We believe that such a system provides an effective method for visually impaired individuals to understand the content of an image.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.