Speaker recognition is the task of identifying or verifying a person’s identity using their voice. This problem involves challenges like variations in speech due to emotional states, health conditions, heterogeneity of microphone models, different environments and background noise. Accurate speaker recognition is critical for security, personalization, and forensic applications. Applying a CNN with Monte Carlo dropout can enhance Speaker Recognition by enabling robust uncertainty-aware predictions, making the presented architecture particularly effective for smaller, noisy datasets without the need for large-scale pre-training. This approach helps mitigate overfitting and improves generalization, making it effective in handling diverse speech patterns. The designed deep learning model showcases superior performance in multiple dimensions, achieving a peak validation accuracy of 93.27% for speaker recognition on a specific dataset recorded in the wild by phone, and 0.030 of EER, showing competitive performance with respect to state-of-the-art baselines.
CNNMC: a convolutional neural network with Monte Carlo dropout for speaker recognition
Spata M. O.
Primo
Writing – Review & Editing
;Ortis A.Secondo
Conceptualization
;Fargetta G.Penultimo
Conceptualization
;Battiato S.Ultimo
Supervision
2025-01-01
Abstract
Speaker recognition is the task of identifying or verifying a person’s identity using their voice. This problem involves challenges like variations in speech due to emotional states, health conditions, heterogeneity of microphone models, different environments and background noise. Accurate speaker recognition is critical for security, personalization, and forensic applications. Applying a CNN with Monte Carlo dropout can enhance Speaker Recognition by enabling robust uncertainty-aware predictions, making the presented architecture particularly effective for smaller, noisy datasets without the need for large-scale pre-training. This approach helps mitigate overfitting and improves generalization, making it effective in handling diverse speech patterns. The designed deep learning model showcases superior performance in multiple dimensions, achieving a peak validation accuracy of 93.27% for speaker recognition on a specific dataset recorded in the wild by phone, and 0.030 of EER, showing competitive performance with respect to state-of-the-art baselines.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


