Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.

Machine Learning Approaches for Handling Missing Data in Antimicrobial Resistance Databases

Condorelli, Chiara;Carchiolo, Vincenza;Frasca, Mattia;Gambuzza, Lucia Valentina
2025-01-01

Abstract

Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.
2025
Imputation , Machine learning , Immune system , Robustness , Databases , Data models , Accuracy , Surveillance , Strain , Standards
File in questo prodotto:
File Dimensione Formato  
Condorelli_Machine_Learning_Approaches_for_Handling_Missing_Data_in_Antimicrobial_Resistance_Databases 2025.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.91 MB
Formato Adobe PDF
1.91 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/710732
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact