Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.
Machine Learning Approaches for Handling Missing Data in Antimicrobial Resistance Databases
Condorelli, Chiara;Carchiolo, Vincenza;Frasca, Mattia;Gambuzza, Lucia Valentina
2025-01-01
Abstract
Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
Condorelli_Machine_Learning_Approaches_for_Handling_Missing_Data_in_Antimicrobial_Resistance_Databases 2025.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
1.91 MB
Formato
Adobe PDF
|
1.91 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


