From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends

IRIS

Text classification stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature includes datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline. In the NLP domain, a plethora of text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. The multidisciplinary nature of text classification, encompassing data mining, linguistics, and information retrieval, highlights the importance of collaborative research to advance the field. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification. This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field.

From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends

Siino, Marco^Primo;Tinnirello, Ilenia;La Cascia, Marco

2025-01-01

Abstract

Text classification stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature includes datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline. In the NLP domain, a plethora of text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. The multidisciplinary nature of text classification, encompassing data mining, linguistics, and information retrieval, highlights the importance of collaborative research to advance the field. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification. This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Pubblicazione_nr_1.pdf solo gestori archivio Tipologia: Versione Editoriale (PDF) Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 3.56 MB Formato Adobe PDF Visualizza/Apri	3.56 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/681850

Citazioni

ND

11

9

social impact