Histopathology reports are a primary data source for the case definition phase of a Cancer Registry. By reading the histopathology report, the operator that evaluates an oncology case can define the morphology and topography of cancer, and validate the case with the highest diagnosis base. The key problem of the Catania-Messina-Enna Integrated Cancer Registry (RTI) is that these reports are written in natural language and relevant information for cancer evaluation is only a little part of the total annual histopathological reports. In this population-based retrospective cohort study, we try to optimize the working time spent by the RTI operators in seeking and selecting the right information among the histopathology reports in the east Sicily population, by developing a binary classifier on a training set of labeled historical data and validating its outcome by a test set of labeled data created by the operators during the years. Using a machine learning algorithm we built a classification model that evaluates each free text report and returns a score that indicates the probability that it contains oncologic relevant information. The best performing algorithm, among the eight analyzed in this study, was the LightGBM that reached an F1-Score of 98.9%. Using the chosen classifier we shortened the time for case evaluation, improving the timeliness of cancer statistics.

Selection of diagnosis with oncologic relevance information from histopathology free text reports: A machine learning approach

Carmelo Viscosi;Paolo Fidelbo;Andrea Benedetto;Massimo Varvarà;Margherita Ferrante
Ultimo
2022-01-01

Abstract

Histopathology reports are a primary data source for the case definition phase of a Cancer Registry. By reading the histopathology report, the operator that evaluates an oncology case can define the morphology and topography of cancer, and validate the case with the highest diagnosis base. The key problem of the Catania-Messina-Enna Integrated Cancer Registry (RTI) is that these reports are written in natural language and relevant information for cancer evaluation is only a little part of the total annual histopathological reports. In this population-based retrospective cohort study, we try to optimize the working time spent by the RTI operators in seeking and selecting the right information among the histopathology reports in the east Sicily population, by developing a binary classifier on a training set of labeled historical data and validating its outcome by a test set of labeled data created by the operators during the years. Using a machine learning algorithm we built a classification model that evaluates each free text report and returns a score that indicates the probability that it contains oncologic relevant information. The best performing algorithm, among the eight analyzed in this study, was the LightGBM that reached an F1-Score of 98.9%. Using the chosen classifier we shortened the time for case evaluation, improving the timeliness of cancer statistics.
2022
Machine learning, Binary classification, Natural language processing, Cancer registry
File in questo prodotto:
File Dimensione Formato  
MED INFO.pdf

solo gestori archivio

Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 723.52 kB
Formato Adobe PDF
723.52 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/520818
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact