Machine Learning Approaches for Handling Missing Data in Antimicrobial Resistance Databases

IRIS

Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.

Machine Learning Approaches for Handling Missing Data in Antimicrobial Resistance Databases

Condorelli, Chiara;Carchiolo, Vincenza;Frasca, Mattia;Gambuzza, Lucia Valentina

2025-01-01

Abstract

Antimicrobial resistance (AMR) is a growing global health threat, and data-driven approaches play a critical role in monitoring, understanding, and predicting resistance patterns. However, AMR datasets often suffer from missing values, which can significantly compromise the performance and reliability of statistical analyses and machine learning models. In this study, we investigate the effectiveness of various machine learning-based imputation techniques to handle missing data in AMR datasets. Specifically, we address two types of missing data, missing completely at random and missing not at random, and evaluate their impact in both binary datasets, where entries indicate the presence or absence of resistance genes, and continuous datasets, where values represent the relative abundance of antimicrobial resistance genes. For binary datasets, we assess the robustness of each imputation method using standard classification metrics, including accuracy, precision, recall, F1 -score, and the area under the receiver operating characteristic curve. For continuous datasets, we assess imputation performance using regression metrics, including normalized mean absolute error and normalized mean root squared error. Our results demonstrate that advanced imputation techniques substantially improve data completeness and model performance across both data types. These findings highlight the importance of tailored imputation strategies in enhancing the quality and reliability of AMR surveillance and predictive systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Parole chiave
	
				Imputation , Machine learning , Immune system , Robustness , Databases , Data models , Accuracy , Surveillance , Strain , Standards
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Condorelli_Machine_Learning_Approaches_for_Handling_Missing_Data_in_Antimicrobial_Resistance_Databases 2025.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.91 MB Formato Adobe PDF Visualizza/Apri	1.91 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/710732

Citazioni

ND

0

0

social impact