Prediction of non-coding RNAs in Fusobacterium nucleatum-infected mice using machine learning

IRIS

Background. The anaerobic commensal Fusobacterium nucleatum is scarce in healthy subgingival dental biofilms but is highly prevalent in periodontal pockets. Numerous genome-wide association studies and gene expression studies using microarrays or RNA sequencing (RNA-Seq) have been performed to better understand the genetic architecture of periodontal disease. However, these investigations have limited predictive capacity for identifying RNAs, particularly non-coding RNAs (ncRNAs). The mechanism of regulation of ncRNAs by F. nucleatum to alter disease progression in mice has not been thoroughly investigated. Objectives. The aim of the study was to predict previously uncharacterized ncRNAs in F. nucleatum-infected mice using machine learning (ML). Material and methods. Long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) were identified from the periodontitis gene expression dataset (GSE225589) obtained from the Gene Expression Omnibus (GEO) database and subsequently preprocessed. Long non-coding RNAs and circRNAs were labeled based on the gene expression. Transcriptomic features were analyzed using 3 ML algorithms: random forest (RF); adaptive boosting (AdaBoost); and naïve Bayes (NB). The dataset was labeled and divided into training (80%) and testing (20%) subsets with cross-validation. Additionally, receiver operating characteristic (ROC) curves, confusion matrices and area under the ROC curve (AUC) values were determined. Results. The RF and AdaBoost models outperformed the NB model in classifying lncRNAs and circRNAs. Both RF and AdaBoost achieved an AUC of 100%, whereas the NB model achieved a slightly lower AUC of 92%. Conclusions. This study is the first to apply ML to predict ncRNAs in F. nucleatum-infected mice using transcriptomic data. Random forest and AdaBoost showed superior classification performance in identifying lncRNAs and circRNAs associated with the infection. Further studies with larger cohorts and external validation are needed to confirm these findings.

Prediction of non-coding RNAs in Fusobacterium nucleatum-infected mice using machine learning

Yadalam P. K.;Sivasankari T.;Saravanan M.;Srivastava K. C.;Shrivastava D.;Marrapodi M. M.;Cicciu' M.;Minervini G.

2026-01-01

Abstract

Background. The anaerobic commensal Fusobacterium nucleatum is scarce in healthy subgingival dental biofilms but is highly prevalent in periodontal pockets. Numerous genome-wide association studies and gene expression studies using microarrays or RNA sequencing (RNA-Seq) have been performed to better understand the genetic architecture of periodontal disease. However, these investigations have limited predictive capacity for identifying RNAs, particularly non-coding RNAs (ncRNAs). The mechanism of regulation of ncRNAs by F. nucleatum to alter disease progression in mice has not been thoroughly investigated. Objectives. The aim of the study was to predict previously uncharacterized ncRNAs in F. nucleatum-infected mice using machine learning (ML). Material and methods. Long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) were identified from the periodontitis gene expression dataset (GSE225589) obtained from the Gene Expression Omnibus (GEO) database and subsequently preprocessed. Long non-coding RNAs and circRNAs were labeled based on the gene expression. Transcriptomic features were analyzed using 3 ML algorithms: random forest (RF); adaptive boosting (AdaBoost); and naïve Bayes (NB). The dataset was labeled and divided into training (80%) and testing (20%) subsets with cross-validation. Additionally, receiver operating characteristic (ROC) curves, confusion matrices and area under the ROC curve (AUC) values were determined. Results. The RF and AdaBoost models outperformed the NB model in classifying lncRNAs and circRNAs. Both RF and AdaBoost achieved an AUC of 100%, whereas the NB model achieved a slightly lower AUC of 92%. Conclusions. This study is the first to apply ML to predict ncRNAs in F. nucleatum-infected mice using transcriptomic data. Random forest and AdaBoost showed superior classification performance in identifying lncRNAs and circRNAs associated with the infection. Further studies with larger cohorts and external validation are needed to confirm these findings.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				machine learning
non-coding RNAs
periodontal disease
transcriptomics
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/715537

Citazioni

1

0

0

social impact