Recent advances in artificial intelligence have facilitated the creation of audio deepfakes, synthetic voice recordings that closely mimic human speech. While this technology offers potential benefits in various fields, it also raises significant ethical and legal concerns, particularly regarding its potential misuse in fraud, misinformation, and identity theft. This paper addresses the critical need for reliable and explainable deepfake audio detection methods by developing and evaluating an approach that prioritizes handcrafted audio features over complex neural networks. Our method leverages the Random Forest Classifier model to analyze these handcrafted features, with a particular focus on the silent segments of audio samples. Utilizing the DEEP-VOICE dataset, which includes audio samples obtained through voice conversion techniques, our research aims to demonstrate that analyzing silent parts of the audio alone can achieve high-performance deepfake detection. Experiments were conducted on both entire audio samples and their silent segments. The results indicate no significant difference in accuracy between the two approaches, highlighting that discriminative features of audio deepfakes are present even in the quietest moments. This finding underscores the efficacy of our method and suggests potential advantages in computational efficiency and robustness, as well as enhanced explainability due to the use of interpretable, handcrafted features.
Breaking the Silence: Detecting AI-Converted Voices in the Quietest Moments
Stanco F.Secondo
Writing – Review & Editing
;Battiato S.Penultimo
Funding Acquisition
;Allegra D.
Ultimo
Supervision
2025-01-01
Abstract
Recent advances in artificial intelligence have facilitated the creation of audio deepfakes, synthetic voice recordings that closely mimic human speech. While this technology offers potential benefits in various fields, it also raises significant ethical and legal concerns, particularly regarding its potential misuse in fraud, misinformation, and identity theft. This paper addresses the critical need for reliable and explainable deepfake audio detection methods by developing and evaluating an approach that prioritizes handcrafted audio features over complex neural networks. Our method leverages the Random Forest Classifier model to analyze these handcrafted features, with a particular focus on the silent segments of audio samples. Utilizing the DEEP-VOICE dataset, which includes audio samples obtained through voice conversion techniques, our research aims to demonstrate that analyzing silent parts of the audio alone can achieve high-performance deepfake detection. Experiments were conducted on both entire audio samples and their silent segments. The results indicate no significant difference in accuracy between the two approaches, highlighting that discriminative features of audio deepfakes are present even in the quietest moments. This finding underscores the efficacy of our method and suggests potential advantages in computational efficiency and robustness, as well as enhanced explainability due to the use of interpretable, handcrafted features.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


