0% found this document useful (0 votes)

83 views

Deepfake Audio Detection Via MFCC Features Using Machine Learning

Uploaded by

speech6 lab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

Deepfake Audio Detection Via MFCC Features Using Machine Learning

Uploaded by

speech6 lab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received 14 December 2022, accepted 20 December 2022, date of publication 21 December 2022,

date of current version 29 December 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3231480

Deepfake Audio Detection via MFCC Features

Using Machine Learning
AMEER HAMZA1 , ABDUL REHMAN JAVED 2,3 , (Member, IEEE),
FARKHUND IQBAL 4 , (Member, IEEE), NATALIA KRYVINSKA 5 ,
AHMAD S. ALMADHOR 6 , (Member, IEEE), ZUNERA JALIL 2 ,
AND ROUBA BORGHOL 7
1 Faculty of Computing and AI, Air University, Islamabad 44000, Pakistan
2 Department of Cyber Security, Air University, Islamabad 44000, Pakistan
3 Department of Electrical and Computer Engineering, Lebanese American University, Byblos, Lebanon
4 College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates
5 Department of Information Systems, Faculty of Management, Comenius University in Bratislava, 82005 Bratislava, Slovakia
6 College of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia
7 Rochester Institute of Technology of Dubai, Dubai, United Arab Emirates

Corresponding authors: Abdul Rehman Javed (abdulrehman.cs@au.edu.pk) and Natalia Kryvinska (natalia.kryvinska@fm.uniba.sk)

ABSTRACT Deepfake content is created or altered synthetically using artificial intelligence (AI) approaches
to appear real. It can include synthesizing audio, video, images, and text. Deepfakes may now produce
natural-looking content, making them harder to identify. Much progress has been achieved in identifying
video deepfakes in recent years; nevertheless, most investigations in detecting audio deepfakes have
employed the ASVSpoof or AVSpoof dataset and various machine learning, deep learning, and deep learning
algorithms. This research uses machine and deep learning-based approaches to identify deepfake audio.
Mel-frequency cepstral coefficients (MFCCs) technique is used to acquire the most useful information from
the audio. We choose the Fake-or-Real dataset, which is the most recent benchmark dataset. The dataset
was created with a text-to-speech model and is divided into four sub-datasets: for-rece, for-2-sec, for-
norm and for-original. These datasets are classified into sub-datasets mentioned above according to audio
length and bit rate. The experimental results show that the support vector machine (SVM) outperformed
the other machine learning (ML) models in terms of accuracy on for-rece and for-2-sec datasets, while
the gradient boosting model performed very well using for-norm dataset. The VGG-16 model produced
highly encouraging results when applied to the for-original dataset. The VGG-16 model outperforms other
state-of-the-art approaches.

INDEX TERMS Deepfakes, deepfake audio, synthetic audio, machine learning, acoustic data.

I. INTRODUCTION on personal interests like portraiture rights, reputation rights,

Deepfake is a portmanteau of deep learning and fake. and copyright and inflicts economic and reputational harm
Deepfake is a type of digitally-created content in which to businesses [3], [4]. Furthermore, a fabricated video of a
the original human faces in a photo, video, or recording politician or government being released will cause a media
have been swapped out for computer-generated ones [1], crisis, social instability, and national instability [5], [6], [7].
[2]. Deepfake first surfaced on Reddit in 2017 when a Audio deepfakes are AI-generated or modified audio that
user named ‘‘deepfakes’’ submitted a falsified video on this appears to be real. Since audio deepfakes have been employed
website with a different actor’s face. As a new technology, in several criminal activities in recent years, the ability to
it unavoidably carries a slew of legal difficulties that infringe detect them is crucial. Detecting deepfake in audio, video,
and text is a broad and active research domain. Between
The associate editor coordinating the review of this manuscript and 2018 and 2019, there was a significant increase in the number
approving it for publication was Liang-Bi Chen . of articles about deepfake (from 60 to 309) [8]. Articles about

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
134018 VOLUME 10, 2022
A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

deepfakes were expected to increase to over 730 by 2020’s produced highly encouraging results when applied to the
end, according to predictions made on July 24 [9]. In [10] for-original dataset.
found that most focus is on video deepfakes, particularly in
The paper will proceed as described below. Section II
developing video deepfakes.
explains the literature review methods. The suggested
Deep Fakes are increasingly detrimental to privacy, social
security, and Authenticity. However, recent works have approach and algorithms are described in III. The analysis
focused on deepfake video detection, achieving greater and results of the experiments are provided in Section IV.
accuracy. However, audio spoofing and calls from malicious Section V presents the discussion on the proposed approach.
Finally, Section VI presents the overall conclusion.
sources are generated through deep fakes, which need a
specially trained model for handling this. The deepfake audio
detection based purely on audio is less explored than image II. LITERATURE REVIEW
and video-based approaches, as these works simultaneously Audio deepfakes audio is generated, edited, or synthesized
utilize the audio and Spatio-temporal information in the using artificial intelligence, which appears real. Detecting
video to train the deep learning model. However, only the audio deepfakes is critical since audio deepfakes have
audio-based classifier’s classification and detection are very been used in several illegal actions in banking, customer
significant. Hence, to this end, we proposed an approach service, and call centers. To detect audio deepfakes, one
based on multiple machine learning algorithms to improve the must first understand the procedures of generation. As the
accuracy of the classification models using Random Forest, name suggests, audio deepfake algorithms are classified
Decision Tree, and SVM algorithms. We provide comparative into three types: Replay attack, speech synthesis, and voice
results and analysis of the baseline models. We conducted conversion are all possible. This section gives the reader each
our experiments on Fake-or-Real Dataset, and there were four subcategory’s most recent and relevant frameworks.
sub-detests. Audio forensics is a branch of forensics used to authen-
The ASVspoof2015 [11] is the first automatic speaker ticate, enhance, and analyze audio information to aid in
verification spoofing and countermeasures dataset that stim- investigating various crimes. Audio as forensic evidence must
ulates research in this field. It decreases the equal error rate be modified and analyzed before criminal prosecution. How-
(EER) by less the 1.5%. Some attacks have even 50% EER. ever, more significantly, it must be validated to demonstrate
However, unknown attacks can have five times more EER. that it is genuine and has not been tampered with. Several
Further, in ASVspoof2017 [12], the limits of replay spoofing methods, primarily AI/ML-based techniques, have been used
attack detection are worked upon. The EER of 6.73% to detect audio events in the last decade. A deep learning
and the Instantaneous frequency cosine coefficients (EFCC) framework was employed by the authors of the study [16]
drastically improve countermeasures performance. Then for audio-deep fake detection. The model separability is
ASVspoof2019 [13] put more emphasis on countermeasures increased using a Long-short term memory (LSTM)-the
concerning automatic speaker verification and spoofed audio based network is used to recognize events in sub-sampled
detection. Other than that, computer vision algorithms such signals [17]. To reduce the audio signal complexity and ease
as convolutional neural networks (CNN) is used low-quality of reconstruction encode, the frequencies higher than the
audio spectrograms for synthetic speech detection [14]. The Nyquist frequency [18] are used, and the authors [19] utilized
time information can be lost in CNN-based models. Hence, non-uniform sampling for audio subsampling.
probabilistic forecasting with a temporal convolutional neural Replay attacks consist of repeatedly playing back a
network is used for improving automatic speaker verification recording of the voice of the intended victim. Replay attacks
and spoofed audio detection [15]. come in two forms, the first is far field detection, and
This research aims to derive a methodology for iden- the second is copy-and-paste detection [20], [21]. As of
tifying deepfake audio from non-synthetic or real audio. now, deep convolutional networks are used as a method for
It provides the following contributions to identify deepfake detecting complete replay attacks [22]. Several methods have
audios effectively by resolving the restrictions discussed been developed for identifying replay attacks, and they center
above: on the characteristics that are provided in the network. The
• Propose a transfer learning-based approach to detect method of using deep convolutional networks to detect replay
deepfake. attacks was found to have an Equal Error Rate (EER) of zero
• Extend work on deepfake audio detection on the Fake- percent on the ASVspoof2017 training and test dataset [12].
or-Real dataset by conducting detailed experiments on Speech synthesis (SS) is recreating human speech digitally,
Fake-or-Real datasets and sub-datasets using machine typically using computer software or hardware. TTS is
and deep learning-based approaches. a component of SS that takes in written material and
• Use a superior feature extraction approach to obtain outputs spoken language based on that text according to
MFCC features from audio sources. predetermined linguistic rules. Text reading and AI personal
• Results reveal that the SVM model outperforms assistants are just two applications of speech synthesis.
other ML models compared to other dataset sub-sets Another perk of speech synthesis is that it can mimic various
except for the for-original dataset. The VGG-16 model voices and dialects without relying on canned recordings.

VOLUME 10, 2022 134019

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

Lyrebird,1 a powerful speech synthesis company, employs concept that cannot be implemented practically. Hence, the
deep learning models to synthesize 1,000 sentences in a dataset Fake-or-Real [28] is divided into four datasets:for-
second. TTS largely relies on the quality of the speech corpus rece, for-2-sec, for-norm, for-original, where for-original
to build the system, and regrettably, creating speech corpora dataset is the collection of other three datasets and without
is expensive [23]. much preprocessing.
Speech synthesis is the artificial reproduction of human This research aimed to develop a technique to classify
speech using software or hardware system programs. To syn- deep fake synthetic audio under different background noise
thesize 1,000 sentences per second, Lyrebird uses deep and audio sizes and duration. We proposed a framework that
learning models. The success of a TTS system is highly handles the big data training set and performs detection using
dependent on the quality of the speech corpus upon which different supervised and unsupervised machine learning
it is built, and it is costly to collect and annotate speech algorithms. The following section explains the proposed
samples. Char2Wav is a framework for speech synthesis framework for all sub-datasets, including data handling, pre-
production from start to finish. PixleCNN is also the processing, feature engineering, and the classification phase.
foundation of WaveNet [24], an SS framework. WaveGlow Figure 1 Shows the detailed architecture diagram of the
prioritizes stage two of the two-stage process generally used proposed framework, consisting of 1) data preprocessing,
by text-to-speech synthesis systems (encoder and decoder). 2) feature extraction 3) Classification models. The detailed
Therefore, WaveGlow is concerned with modifying specific description of each phase is as follows:
time-aligned data. Incorporating information into sound
files by using encoding techniques like a mel-spectrogram. A. DATA PREPROCESSING
The Tacotron 2 [25] system comprises two parts. The More than 195,000 real human and synthetic computer-
first component is an attention-based recurrent sequence- generated speech samples are included in the Fake-or-Real
to-sequence feature prediction network. This component’s (FoR) collection. Classifiers may be trained on the dataset
output is a mel anticipated sequence. Frames of a spectrogram to identify fake speech better. Information from Deep Voice
A modified WaveNet vocoder is the second component. 3 is included [29] and Google Wavenet [24] TTS and various
For audio data, [26], [27] used GAN-based generative human sound recordings. This dataset may be accessed in
models. It operates on Mel spectrograms and employs a four different varieties[ 1) for-original,2) for-norm, 3) for-
fully convolutional feed-forward network as the generator. 2sec, and 4) for-rerec]. The original version includes the files
The authors give a summary of their recently created data without any changes from when they were first extracted from
set. It comprises 117,985 created audio segments of 16-bit the speech sources. The latest volume (For-norm) contains the
Pulse Code Modulation(PCM) wav format and is available duplicate files as the first, but they have been standardized
on zenodo.2 in terms of sampling rate, volume, and various channels to
The current study has poor performance validation and achieve gender and class parity. The second is the basis
testing results detecting deep false audios. Feature-based for the third (for-2sec), except that the files are truncated
techniques are required to improve the outputs of machine after 2 seconds instead of the original length. The third and
learning models. The deep learning approaches show better final version (for-rerec) is a re-recorded version of the for-
results but require greater training time and computa- 2second dataset created to simulate an attacker transmitting
tional resources. Hence, the potential for machine learning an utterance via a voice channel. However, these datasets
approaches in deepfake detection is explored, while the suffer from duplicate files, 0-bit files, and different bit-rate
limitation of handling higher feature sets and complexities in audio signals. They negatively affect the ML model’s
can be solved through a transfer learning-based deep learning training and performance. Hence, we preprocess the dataset
approach. to remove the duplicate and 0-bit file, which does not
contribute to model training. Also, the bit rate is standardized
III. PROPOSED METHODOLOGY to zero-padding for an audio waveform with less than 16,000
In machine learning, training a model always involves the samples, conforming to an operationally viable bit rate for the
trade-off of over-fitting and under-fitting, which negatively TensorFlow audio signal processing library. Also, the data is
impacts the model’s real-time performance. It is difficult normalized using a standard scaler to ease model training.
to handle this trade-off so that models do not over-fit or
under-fit. One of the major issues in deepfake is the high B. FEATURE EXTRACTION
false-positive rate ratio, which occurs when most models Deepfake audio signal often consists of similar feature
classify an unseen pattern as abnormal if it is not included in sets to the original signal. However, distinguishability is
the training set. It is due to the model’s inability to be trained challenging to advance in deep learning approaches in
on a large dataset. A dataset that covers all possible patterns generating deepfakes. Hence, extracted features can strongly
and cases, deepfake and real. It is regarded as a theoretical affect the model’s predictive power and accuracy. It is
observed that audio signals in the frequency domain can
1 https://www.descript.com/lyrebird provide us the features which are helpful in the detection
2 https://zenodo.org/record/5642694 and classification of deepfake audios, which can deceive a

134020 VOLUME 10, 2022

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

FIGURE 1. Graphical representation of proposed approach for detection of deepfake audios.

human under specific scenarios. For this purpose, we use

Mel-frequency Cepstral Coefficient (MFCC), a widely used
feature for speech recognition [30], [31]. The dataset (Fake
or Real Audio dataset) used in this study is a more recent
dataset, which was only used once in research. This study
is not limited to only MFCC features; we also employed
cepstral (MFCC), Spectral (Roll-off point, centroid, contrast,
bandwidth), Raw signal (zero cross rate), and signal energy
features. We made a featured ensemble, but our primary FIGURE 2. Melspectrogram representation of audio signal where the
focus is on MFCC features because MFCC used the amplitude is depicted in terms of decibel.
log function and Mel-filter to mimic the human hearing
system. is convincing. We set different values of PCA and find the
Furthermore, MFCC applied triangular band-pass filters to most important features where n_component is 65.
convert the frequency information to mimic what a human
perceived. Figure 2 shows the MFCC series of audio files; C. CLASSIFICATION MODELS
also, for showing the auditory power of the signal, the 1) RANDOM FOREST
amplitudes are presented in decibels (db). In this work, Random Forest is a decision tree-based algorithm except
MFCC is used for deepfake audio detection. Following that it fits many categorizing decision trees on different
the initial processing of audio signals, a vector group sub-samples of the dataset and then uses averaging to
representing the MFCC will be generated out of each frame of integrate all the decision trees. It helps in the mitigation
the sound waveform. This study uses Mel-frequency cepstral of dataset overfitting problems. Random forest is used in
coefficient and short-time Fourier transform (STFT), which calculating the feature importance by adding the gain of
transform the waveforms from the time-domain signals into each feature and scaling the number of samples passing
the time-frequency-domain signals. through the node. Let us assume that k is the node, Xj is the
Figure 3 compares the fake and accurate audio signals importance of features, and the total samples toward all nodes
in their spectrogram representation. First, the spectrogram are Yk . The importance of a feature can be represented as in
is shown in terms of their different amplitude, and then, equation 1:
for better auditory inspection, the signal is further analyzed X
using the decibel (db) of the given signal. It helps us under- Xj = k : jYk GK (1)
stand which auditory features are relevant in distinguishing
where, nodes k split on feature j. The final feature importance
between the deepfake and real audio signal. In this section,
Xj for each feature is calculated by normalizing the Xj0 for
we explain the feature extraction and selection process. In our
each tree and then summing those normalized values for each
case, the sampling (frame) rate is 44100. The 270 retrieved
tree in the random forest.
features of each audio file are stored in a data frame.
We reduced the characteristics to only those that would be Xj0
Xj0 = P (2)
beneficial and got rid of the rest using Principal Component z Xjz
P 0
Analysis (PCA) [32]. 65 characteristics are crucial enough z Xjjz
to be sent to deepfake detection models. We employ PCA’s RFXjj0 = P 0 (3)
z,t Xjjzt
explained variance ratio metric to determine the value of
carefully chosen features. The value of explained variance As in equation 2 and 3, z indicates all features, and t
ratio is (97%), indicating that the selected data’s usefulness depicts all trees in a random forest. The Xj is the importance

VOLUME 10, 2022 134021

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

FIGURE 3. In (a) and (b), the comparison is shown between the deepfake and real audio signal in spectrogram where the
difference in amplitude is apparent. In (c) and (d), the amplitude is shown in terms of decibels (db) for understanding the
auditory parts of the audio signal.

of a feature for node j, while Xj0 is its normalized feature binary classification with linearly separable vectors xi ∈ Rn ,
importance. RFXjj0 is the feature importance for all trees in as the decision surface used to classify a pattern as belonging
a random forest. Moreover, Xjz is normalized importance of to one of the two classes is the hyperplane H0 . If x is a random
feature j w.r.t tree t. The model makes predictions based on vector n ∗ R, we define
the important features obtained, as mentioned in Figure 3.
f(x) = w.x + b (4)
2) SUPPORT VECTOR MACHINE (SVM)
SVM is a supervised learning method that relies primarily on Dot product is represented d (.) in equation 4. The set of all
two assumptions: 1) Converting data into a high-dimensional x-vectors that satisfy the equation f(x) = 0 is denoted by H0 .
space may reduce complex classification issues with complex Assuming two hyperplanes, H1 and H2 , the distance between
decision surfaces to more minor problems that may be solved them is referred to as their margin which can be represented
by making it linearly separable, and 2) only training patterns as follows:
near the decision surface provide the most sensitive details 2
for classification. Assume a deepfake detection problem as a (5)
kwk
134022 VOLUME 10, 2022
A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

The decision hyperplane H0 depends on vectors closest to of the gradient boosting algorithm. We use a learning rate of
the two parallel hyperplanes called support vectors. The 0.1 and an estimator of 10000 for the XGBoost algorithm.
margin must be maximal to obtain a classifier that is not However, it is vulnerable to outliers because each successive
much adapted to the training data. Consider a collection of classifier is compelled to correct the mistakes made by its
training data vectors X = xi , . . . xL , xi ∈ Rn and a set of prior learners. This is because the estimators rely on historical
matching labels, Y = yi , . . . yL , yi ∈ {1, −1}. We consider predictions to determine their accuracy. For this reason,
the hyperplane H0 to be optimally separated if the vectors streamlining the process is complex.
are categorized without error and the margin is greatest. The
vectors must verify to be accurately categorized. IV. EXPERIMENTS AND RESULTS
About 195,000 human and synthetic speech samples were
fxi > +1 for yi = +1 (6)
used to create the Fake-or-Real (FoR) dataset. In Table 1,
fxi > −1 for yi = −1 (7) we offer a summary of the data set. Classifiers may be
Hence, finding the SVM classifying function H0 can be stated trained on the dataset to identify fake speech better. The
as follows: dataset is an amalgamation of information from the following
1 recent datasets: first, Text-to-speech programs, such as Deep
minimize kwk2 (8) Voice 3 and Google Wavenet TTS [29]. Secondly, includes
2 many different types of the recorded human voice, including
yi f(xi ) ≥ 1, ∀i (9)
those from the Arctic Dataset, LJSpeech Dataset, VoxForge
The SVM was chosen for its properties that aid in Dataset, and user-submitted recordings [33], [34], [35].
classifying deepfake audios. It performs well with a clear The four dataset versions available for public consumption
margin of separation between samples and is effective are for-original, for-norm, for-2sec, and for-rerec. The
in high-dimensional environments. It employs a subset of for-original folder stores the raw data from the speech
training points in the decision function, making it memory sources.
efficient. It works well when the number of dimensions is The for-norm has some duplicate files but is otherwise
more than the size of the sample set. SVM does not perform well-balanced across demographic categories (gender and
very well on our for-original dataset data set because the socioeconomic status) and technical parameters (sample rate,
required training time and the noise in the data set are higher. volume, and multiple channels). The third one is like the
It does not directly provide probability estimates, calculated second one, only the files are cut off after 2 seconds, and
using an expensive five-fold cross-validation that takes a it is called for-2sec. The last variant, dubbed for-here, is a
long time to train. However, the clean datasets extracted re-recording of the for-2second dataset meant to mimic a
from for-original dataset perform better on the classification situation in which an attacker transmits a speech over a vocal
task. SVM has been shown to perform effectively in higher- channel like a phone call or voice message. We provide
dimensional data, most notably when detecting events in the outcomes of our binary classification analysis of the
audio data. Hence, for deepfake audio, we implemented it by suggested method. Table 2 shows the experimental findings
utilizing the Scikit-learn library. We use radial basis function for spotting deepfakes.
(RBF) kernel, C = 4, and probability = True. The experiments were also performed using noisy audio
sound signals. For this purpose, we added synthetic noise to
3) MULTI-LAYER PERCEPTRON (MLP) each audio signal of three datasets (for-2sec, for-norm, and
MLP is adequate for classification tasks; a multilayer for-rerec dataset). This method kept both original and noisy
perceptron, through layers, can effectively filter the relevant audio in the dataset and increased the audio signal sample.
features from data and tune the parameters of the models The length of the original for-2sec dataset is 17870 audio
for optimal predictions. There are at least three levels in the samples, and after adding noise to the dataset, the new dataset
MLP model: an input layer, a hidden layer of calculation will be composed of 35740 audio samples, the same for the
nodes, and an output layer of processing nodes. In this study, for-rerec and for-norm datasets.
we use hyperparameters of MLP classifier as a hidden-layer-
size staple:length = 100, solver = adam and RMSprop, while A. FOR-REREC DATASET
RMSprop is used for smaller datasets, shuffle = True and The results of the for-rerec dataset are presented in Table 2.
verbose = False, activation-function = relu. Multiple ML models are applied to obtain better results. The
machine learning algorithms such as Support Vector Machine
4) EXTREME GRADIENT BOOSTING (XGB) (SVM) have 98.83% accuracy, Decision Tree 88.28%, Ran-
XGB is a parallel and optimized version of gradient dom Forest Classifier 96.60%, AdaBoot 87.67%, Gradient
boosting algorithms that combines efficiency and resource Booting 93.51%, XGB Classifier 93.40%. The SVM model
management. It implements gradient-boosted decision trees exhibited the highest results using the for-rerec dataset.
in an iterative model by combining weak base models into The result of the for-rerec dataset noisy audio signals
a stronger learner. The residual is utilized to refine the loss classification is presented in Table 3. Results depict that the
function and improve the prior prediction at each iteration MLP and SVM models obtained the highest accuracy score

VOLUME 10, 2022 134023

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

TABLE 1. Dataset description.

of 98.66% and 98.43% compared to other ML models. The TABLE 2. Accuracy comparison for machine learning models.
other ML models like; DT, LR, and XGB obtained 82.12%,
88%, and 88.92% accuracy.

B. FOR-2sec DATASET
In for-2sec dataset consist of audio with two-second intervals.
The audio is complex, as the information in that small interval
is little. However, it is much easier for machine learning
algorithms to process data in this form. Hence, we observe
better performance. The results are depicted in Table 2.
We observe MLP classifier accuracy of 94.69, Random Forest
of 94.44, and SVM 97.57. gradient boosting of 94.30 and
Adaboosing of 90.23. The MLP model outperforms the other TABLE 3. Accuracy comparison for noisy audio signals using machine
learning models.
ML model in terms of accuracy.
Table 3 shows the results of the for-2sec dataset with
noisy audio signal classification. To get better outcomes,
several ML models are used. The ML algorithms such as
SVM obtained 99.59% accuracy, MLP obtained 99.49%,
DT 87.52% accuracy, and so on. The SVM exhibited the
highest accuracy compared to other ML models using noisy
audio signals.

C. FOR-NORM DATASET
It contains recorded audio at 12-second intervals. The result
of the for-norm dataset is shown in Table 2. It shows MLP
Classifier 86.82, Random Forest Classifier 90.60, extra trees
91.46, Gradient Booting 92.63, XGB Classifier 92.60, LDA same visual features of MFCC from the audio data. These
91.35, Gaussian NB 81.81, and Adabost 89.40. However, visual features train the VGG-16-based model and LSTM
some algorithms show average results, like QDA 61.36 and to perform deepfake or real audio classification. Finally, the
KNN 64.21. The Gradient Boosting classifier obtained the VGG16 model outperformed the LSTM model with a testing
highest results compared to the other ML models. accuracy of 93%. The LSTM model obtained 91% accuracy.
The results of noisy audio from the for-norm dataset are The VGG-16 model uses ImageNet weights and input shapes
presented in Table 3. The results of the for-norm dataset are (64 x 64 x 3). The validation accuracy of 0.94 and validation
less than the other two datasets. The XGB model obtained the loss of 0.14 is obtained, while the testing accuracy is 93%.
highest results using noisy audio from the for-norm dataset. Figure 4a shows the training and validation graph, while
All other ML models obtained quite well results but not so Figure 4b shows the training and validation loss of the
impressive. VGG16 model.

D. FOR-ORIGINAL DATASET E. MODEL COMPARISON

The for-original datasets are compiled from various datasets This study compares the model accuracy with the other
and consist of audio samples of various lengths, bit-rates, baseline paper [36] to assess the efficacy of our proposed
and noise levels. The machine learning models did not model.
produce comparatively better results. These models would It is easier to compare results when the experimental
not be able to handle this data’s complexities and feature conditions (dataset, data samples) are identical to those used
variations. To this end, We used a transfer learning-based in the initial study. As presented in this section, the dataset
deep learning approach. In this approach, we extracted the utilized in this investigation has only been used once in a

134024 VOLUME 10, 2022

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

FIGURE 4. Comparison between the validation and training (accuracy and loss).

previous study [36]. Because of this reason, the suggested learning models are employed in this research. The proposed
method cannot be compared to any other studies. approach employed VGG16 and LSTM model with a feature
Our technique shows potential in terms of classification ensemble of MFCC-40, Roll-off point, centroid, contrast,
accuracy. This work obtained comparatively better results in and bandwidth features. The features extracted from each
ensemble-based machine learning models such as boosting method are combined for model classification. The VGG16
algorithms, as the XGboost algorithm shows greater accuracy model obtained the highest results compared to the existing
than the baseline model. The model’s accuracies for three study with an accuracy of 93%. Furthermore, the LSTM
sub-datasets are shown in Table 2 and 3. model obtained an accuracy of 91%. The existing approach
Table 2 and 3 compare the machine learning model proposed by khochare et al. used MFCC features and various
results of the feature-based approach to training machine machine learning models for deepfake audio detection [36].
learning algorithms. Our approach to selecting the best They have utilized 20 MFCC features for each audio. The
feature and ML classifiers obtained promising results on three author employed multiple machine learning models (SVM,
datasets (FOR-REREC DATASET, OR-2SEC DATASET, RF, KNN, and XGB). The author used 20 MFCC features
and FOR-NORM DATASET). However, The for-norm with the SVM model and obtained the highest accuracy rate
dataset does not perform well on our approach using a of 67%. Another study proposed by Reimao and Tzerpos
simple SVM algorithm as the data is of high dimensions. used both machine learning and deep learning techniques
Without the dimensionality reduction on a complex dataset, along with various feature extraction methods [28]. The
it performs poorly. This dataset contains audio of a length author used Timbre Model Analysis (Brightness, Hardness,
greater than 12 seconds. Hence, the windowing technique can Depth, Roughness) features with multiple ML models
perform better in combination with MFCC. The proposed (NB, SVM, DT, and RF). According to the ML model
approach is compared with the baseline approach that used classification results, the SVM model using the various
FOR-ORIGINAL DATASET for experimentation [36]. The feature extraction methods obtained a 73.46% accuracy rate.
existing approach used various ML models (SVM, RF, KNN, Furthermore, STFT, Mel-Spectrograms, MFCC, and CQT
XGB) to detect deepfake from FOR-ORIGINAL-DATASET. feature extraction methods are used with the VGG19 model
The proposed approach obtains the highest testing score of and obtained 89.79% accuracy. Compared to the previous
93%, which is 26% higher than the best score of existing research, the VGG16 model achieved the highest results, with
work using the SVM model. It is concluded that the proposed an accuracy of 93%. More so, the LSTM model achieved 91%
approach can efficiently detect deepfake audio. accuracy. The VGG16 model loss and training and validation
The dataset used in this study is only used in only one accuracy are shown in Figure 4. The proposed approach
previous study. The proposed and existing approaches’ exper- with the features mentioned in section III-B outperforms
imental settings are similar (dataset, data split). In addition, the previous state-of-the-art feature extraction techniques is
the comparative analysis of the proposed method with the presented in Table 4.
state-of-the-art feature extraction techniques is presented
in Table 4. The proposed approach combines features V. DISCUSSION
from multiple feature extraction techniques and extracts This research extended the work on deepfake audios by
the most optimal features for classification. The two deep extending the work on the Fake-or-Real dataset. This dataset

VOLUME 10, 2022 134025

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

TABLE 4. Comparison between results of the proposed approach and existing approach.

comprises a state-of-the-art dataset in audio detection and classification. It is further compiled into four sub-datasets.
classification. We improved upon the algorithm’s perfor- This study conducted experiments with multiple audio data
mance, which was previously trained on feature-based features to detect deepfakes in audio data. This work extracts
approaches by using the MFCC-based features, indicating MFCC features from audio for feature engineering. Several
considerable improvements in inaccuracy. Our feature out- machine learning algorithms are applied to the selected
performs the feature-based approach by 10 to 20 percent on feature set to detect the deepfake audio. This approach gave
average across these datasets. The for-norm dataset performs higher accuracy and results in all cases than other state-of-
poorly on our approach using simple SVM algorithms. the-art studies for audio data. This study obtained 97.57%
Windowing techniques, in combination with MFCC, can accuracy with SVM using the for-2sec dataset compared
perform better. We conduct additional experiments on to other ML models, while 92.63% was obtained by the
machine learning algorithms categorizing into (1) Statistical Gradient Boosting classifier using the for-norm dataset. This
models like QDA, LDA, and Gaussian Naive Bayes for study obtained 98.83% highest accuracy using the SVM
dimensionality reduction to reduce noise in the data. Then model on the for-rerec dataset. We plan to explore the
(2) Tree-based models such as Decision Tree, Extra Tree, different window sizes for MFCC and various input sizes
and Random Forest these algorithms can handle multidimen- for models in the future. Future work can be done on
sional data. Therefore, they do not involve domain knowledge evaluating these models against potential fluctuation and
or parameter setting and are appropriate for exploratory distortion in the audio signal, understanding which signal
pattern detection. Lastly, (3) Boosting models, namely Ada is greater. Moreover, studies on the state-of-the-art few-
Boost, Gradient Boosting, and XGBoost, these algorithms shot learning and Bidirectional Encoder Representations
fundamentally create several weak learners and combine their from Transformers (BERT) based models can be conducted.
predictions with building a strong rule, which helps increase Furthermore, we plan to evaluate our models in ambient noise
the accuracy of a model on feature-rich audio data. These and reverberation circumstances. We intend to use feature
three classes of ML algorithms are chosen for our approach extraction methods like i-vector, x-vector, a combination of
to explore and improve these performances on MFCC-based MFCC and GFCC, and a combination of DWT and MFCC,
feature sets. Besides this, we proposed a VGG-16-based which were not taken into account in the current scenario
deep learning model for the bigger dataset, which is the of experiments because it is the beginning of our journey to
superset of the other three datasets. It uses transfer learning identify Deepfake audio.
and trained on MFCC images feature for training the model. REFERENCES
We obtained an accuracy of 93% while using half of the
[1] A. Abbasi, A. R. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, and
original dataset. A large amount of data correlated with higher U. Tariq, ‘‘A large-scale benchmark dataset for anomaly detection and
model accuracy. We tried to obtain a limited performance rare event classification for audio forensics,’’ IEEE Access, vol. 10,
pp. 38885–38894, 2022.
dataset. The entire dataset can be explored for even better
[2] A. R. Javed, W. Ahmed, M. Alazab, Z. Jalil, K. Kifayat, and
results in the future. T. R. Gadekallu, ‘‘A comprehensive survey on computer forensics: State-
of-the-art, tools, techniques, challenges, and future directions,’’ IEEE
Access, vol. 10, pp. 11065–11089, 2022.
VI. CONCLUSION
[3] A. R. Javed, Z. Jalil, W. Zehra, T. R. Gadekallu, D. Y. Suh, and M. J.
The detection of audio data is significant as an essential Piran, ‘‘A comprehensive survey on digital video forensics: Taxonomy,
tool for enhancing security against scamming and spoofing. challenges, and future directions,’’ Eng. Appl. Artif. Intell., vol. 106,
Deepfake audios have garnered significant public attention Nov. 2021, Art. no. 104456.
[4] A. Ahmed, A. R. Javed, Z. Jalil, G. Srivastava, and T. R. Gadekallu,
as society rapidly recognizes its possible security danger. ‘‘Privacy of web browsers: A challenge in digital forensics,’’ in Proc. Int.
However, deepfake audio is extensively studied in combina- Conf. Genetic Evol. Comput. Springer, 2021, pp. 493–504.
tion with Spatio-temporal data of video. This study improves [5] A. R. Javed, F. Shahzad, S. U. Rehman, Y. B. Zikria, I. Razzak, Z. Jalil,
and G. Xu, ‘‘Future smart cities: Requirements, emerging technologies,
upon the Fake-or-Real (FoR) dataset, which comprises state- applications, challenges, and future aspects,’’ Cities, vol. 129, Oct. 2022,
of-the audio datasets and custom audios for deepfake audio Art. no. 103794.

134026 VOLUME 10, 2022

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

[6] A. Abbasi, A. R. Javed, F. Iqbal, Z. Jalil, T. R. Gadekallu, and [29] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
N. Kryvinska, ‘‘Authorship identification using ensemble learning,’’ Sci. J. Raiman, and J. Miller, ‘‘Deep voice 3: Scaling text-to-speech with
Rep., vol. 12, no. 1, pp. 1–16, Jun. 2022. convolutional sequence learning,’’ 2017, arXiv:1710.07654.
[7] S. Anwar, M. O. Beg, K. Saleem, Z. Ahmed, A. R. Javed, and U. Tariq, [30] F. M. Rammo and M. N. Al-Hamdani, ‘‘Detecting the speaker language
‘‘Social relationship analysis using state-of-the-art embeddings,’’ ACM using CNN deep learning algorithm,’’ Iraqi J. Comput. Sci. Math., vol. 3,
Trans. Asian Low-Resource Lang. Inf. Process., Jun. 2022. no. 1, pp. 43–52, Jan. 2022.
[8] C. Stupp, ‘‘Fraudsters used Ai to mimic CEO’s voice in unusual cybercrime [31] S. Ahmed, Z. A. Abbood, H. M. Farhan, B. T. Yasen, M. R. Ahmed,
case,’’ Wall Street J., vol. 30, no. 8, pp. 1–2, 2019. and A. D. Duru, ‘‘Speaker identification model based on deep neural
[9] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen, T. Huynh- networks,’’ Iraqi J. Comput. Sci. Math., vol. 3, no. 1, pp. 108–114,
The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, and C. M. Nguyen, Jan. 2022.
‘‘Deep learning for deepfakes creation and detection: A survey,’’ 2019, [32] A. Winursito, R. Hidayat, and A. Bejo, ‘‘Improvement of MFCC
arXiv:1909.11573. feature extraction accuracy using PCA in Indonesian speech recognition,’’
in Proc. Int. Conf. Inf. Commun. Technol. (ICOIACT), Mar. 2018,
[10] Z. Khanjani, G. Watson, and V. P. Janeja, ‘‘How deep are the fakes?
pp. 379–383.
Focusing on audio deepfake: A survey,’’ 2021, arXiv:2111.14203.
[33] J. Kominek and A. W. Black, ‘‘The CMU Arctic speech databases,’’ in
[11] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah,
Proc. 5th ISCA Workshop Speech Synth., 2004.
and A. Sizov, ‘‘ASVspoof 2015: The first automatic speaker verifi-
[34] K. Ito and L. Johnson. (2017). The LJ Speech Dataset. [Online]. Available:
cation spoofing and countermeasures challenge,’’ in Proc. Interspeech,
https://keithito.com/LJ-Speech-Dataset/
Sep. 2015.
[35] K. MacLean. (2018). Voxforge. [Online]. Available:
[12] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, http://www.voxforge.org/home.[Acedidoem2012]
J. Yamagishi, and K. A. Lee, ‘‘The ASVSPOOF 2017 challenge: Assessing [36] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, ‘‘A deep
the limits of replay spoofing attack detection,’’ in Proc. 18th Annu. Conf. learning framework for audio deepfake detection,’’ Arabian J. Sci. Eng.,
Int. Speech Commun. Assoc., 2017, pp. 2–6. vol. 47, pp. 1–12, Nov. 2021.
[13] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans,
T. Kinnunen, K. A. Lee, V. Vestman, and A. Nautsch. (2019). ASVSPOOF
2019: Automatic Speaker Verification Spoofing and Countermeasures
Challenge Evaluation Plan. [Online]. Available: http://www.asvspoof.org
[14] S. Ö. Arık, H. Jun, and G. Diamos, ‘‘Fast spectrogram inversion using
multi-head convolutional neural networks,’’ IEEE Signal Process. Lett.,
vol. 26, no. 1, pp. 94–98, Jan. 2019.
[15] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, ‘‘Probabilistic forecasting
with temporal convolutional neural network,’’ Neurocomputing, vol. 399, AMEER HAMZA is currently pursuing the mas-
pp. 491–501, Jul. 2020. ter’s degree in artificial intelligence with the
[16] Y. Kawaguchi, ‘‘Anomaly detection based on feature reconstruction from Department of Creative Technology, Air Univer-
subsampled audio signals,’’ in Proc. 26th Eur. Signal Process. Conf. sity, Islamabad, Pakistan.
(EUSIPCO), Sep. 2018, pp. 2524–2528.
[17] Y. Kawaguchi and T. Endo, ‘‘How can we detect anomalies from
subsampled audio signals?’’ in Proc. IEEE 27th Int. Workshop Mach.
Learn. Signal Process. (MLSP), Sep. 2017, pp. 1–6.
[18] H. J. Landau, ‘‘Sampling, data transmission, and the Nyquist rate,’’ Proc.
IEEE, vol. 55, no. 10, pp. 1701–1706, Oct. 1967.
[19] H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, ‘‘Spoofing detection in
automatic speaker verification systems using DNN classifiers and dynamic
acoustic features,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10,
pp. 4633–4644, Oct. 2018.
[20] S. Pradhan, W. Sun, G. Baig, and L. Qiu, ‘‘Combating replay attacks
against voice assistants,’’ Proc. ACM Interact., Mobile, Wearable Ubiq-
uitous Technol., vol. 3, no. 3, pp. 1–26, Sep. 2019.
[21] J. Villalba and E. Lleida, ‘‘Preventing replay attacks on speaker ABDUL REHMAN JAVED (Member, IEEE)
verification systems,’’ in Proc. Carnahan Conf. Secur. Technol., Oct. 2011, received the master’s degree in computer science
pp. 1–8. from the FAST National University of Computer
[22] F. Tom, M. Jain, and P. Dey, ‘‘End-to-end audio replay attack detection and Emerging Sciences, Islamabad, Pakistan.
using deep convolutional networks with attention,’’ in Proc. Interspeech, He has worked at the National Cybercrimes and
Hyderabad, 2018, pp. 681–685. Forensics Laboratory, Air University, Islamabad,
[23] K. Kuligowska, P. Kisielewicz, and A. Włodarz, ‘‘Speech synthesis where he is currently a Lecturer with the Depart-
systems: Disadvantages and limitations,’’ Int. J. Res. Eng. Technol., vol. 7, ment of Cyber Security. He is also a cyber
no. 83, pp. 234–239, 2018. security researcher and a practitioner with industry
[24] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, and academic experience. He is supervising/
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ‘‘WaveNet: co-supervising several graduate (B.S. and M.S.) students on health infor-
A generative model for raw audio,’’ 2016, arXiv:1609.03499.
matics, cybersecurity, mobile computing, and digital forensics topics. He has
[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
reviewed more than 150 scientific research articles for various well-known
Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis,
journals. He has authored more than 50 peer-reviewed research articles. His
and Y. Wu, ‘‘Natural TTS synthesis by conditioning wavenet on MEL
spectrogram predictions,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
research interests include mobile and ubiquitous computing, data analysis,
Process. (ICASSP), Apr. 2018, pp. 4779–4783. knowledge discovery, data mining, natural language processing, smart
[26] J. Frank and L. Schönherr, ‘‘WaveFake: A data set to facilitate audio homes, and their applications in human activity analysis, human motion
deepfake detection,’’ 2021, arXiv:2111.02813. analysis, and e-health. He aims to contribute to interdisciplinary research
[27] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, ‘‘Introduction in computer science and human-related disciplines. He is a member of
to digital image steganography,’’ in Digital Media Steganography, ACM and a TPC Member of CID2021 (Fourth International Workshop
Amsterdam, The Netherlands: Elsevier, 2020, pp. 1–15. on Cybercrime Investigation and Digital forensics-CID2021) and the 44th
[28] R. Reimao and V. Tzerpos, ‘‘FoR: A dataset for synthetic speech International Conference on Telecommunications and Signal Processing.
detection,’’ in Proc. Int. Conf. Speech Technol. Hum.-Comput. Dialogue He has served as a Moderator for the 1st IEEE International Conference on
(SpeD), Oct. 2019, pp. 1–10. Cyber Warfare and Security (ICCWS).

VOLUME 10, 2022 134027

A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning

FARKHUND IQBAL (Member, IEEE) received of Denver, in 2013 and 2019, respectively. He is currently an Assistant
the master’s and Ph.D. degrees from Concordia Professor of CEN and VD with the College of Computer and Information
University, Canada, in 2005 and 2011, respec- Science, Jouf University. His research interests include AI, blockchain,
tively. He is currently working as an Associate networks, smart and microgrid cyber security, integration, image processing,
Professor at the College of Technological Inno- video surveillance systems, PV, EV, and machine and deep learning. He was
vation, Zayed University, United Arab Emirates. a recipient of awards and honors, include the Aljouf University Scholarship
He is also an Affiliate Professor with the School of (Royal Embassy of Saudi Arabia in D.C.) and the Aljouf’s Governor Award
Information Studies, McGill University, Canada, for Excellency.
and an Adjunct Professor with the Faculty of
Business and Information Technology, Ontario
Tech University, Canada. He leads the Cyber Security and Digital Forensics
(CAD) Research Group, Center for Smart Cities and Intelligent Systems,
Zayed University. He has published more than 120 papers in high-ranked
journals and conferences. His research interests include artificial intelli- ZUNERA JALIL received the master’s degree
gence, machine learning, and data analytics techniques for problem-solving in computer science from the Higher Education
in cybersecurity, health care, and cybercrime investigation in the smart city Commission of Pakistan, in 2007, and the Ph.D.
domain. He has served as the chair and co-chair for several IEEE/ACM degree in computer science with a specialization
conferences. He has been a guest editor and reviewer for multiple high-rank in information security from the FAST National
journals. University of Computer and Emerging Sciences,
Islamabad, Pakistan, in 2010. She has served as a
full-time Faculty Member at International Islamic
University, Islamabad; Iqra University, Islamabad;
NATALIA KRYVINSKA received the Ph.D. degree
and Saudi Electronic University, Riyadh, Saudi
in electrical and IT engineering from the Vienna
Arabia. She is currently an Assistant Professor with the Department of Cyber
University of Technology, Austria, and the Habil-
Security, Faculty of Computing and Artificial Intelligence, Air University,
itation (Docent Title) degree in management
Islamabad, and a Senior Researcher with the National Cybercrimes and
information systems from Comenius University in
Forensics Laboratory, National Center for Cyber Security, Islamabad. Her
Bratislava, Bratislava, Slovakia. She got her Pro-
research interests include computer forensics, machine learning, criminal
fessor title and was appointed for the professorship
profiling, software watermarking, intelligent systems, and data privacy
by the President of the Slovak Republic. She is
protection. She received the scholarship for her master’s degree.
currently a Full Professor and the Head of the
Department of the Information Systems, Faculty
of Management, Comenius University in Bratislava. Previously, she has
served as a University Lecturer and a Senior Researcher at the Department
of e-Business, School of Business Economics and Statistics, University of
Vienna. Her research interests include complex service systems engineering,
ROUBA BORGHOL received the master’s degree
service analytics, and applied mathematics.
in applied mathematics from the University of
Claude Bernard II, Lyon, and the Ph.D. degree
in mathematics from the University of Tours,
AHMAD S. ALMADHOR (Member, IEEE) France, in December 2005. She was a Lecturer
received the B.S.E. degree in computer science at the University of Tours, from 2005 to 2007.
from Jouf University (formerly Jouf College), She was a Research Fellow at the Polytechnic
Saudi Arabia, in 2005, the M.E. degree in com- School, Palaiseau, France, in 2008, and an Assis-
puter science and engineering from the University tant Professor at Lebanese University, Lebanon,
of South Carolina, Columbia, SC, USA, in 2010, from 2008 to 2009, and the College of Applied
and the Ph.D. degree in electrical and computer Science and Dhofar University, from 2010 to 2013. She is currently
engineering from the University of Denver, Den- an Assistant Professor of mathematics with the Rochester Institute of
ver, CO, USA, in 2019. From 2006 to 2008, he was Technology of Dubai. Throughout her 15 years in academia, she has
a Teaching Assistant and the College of Sciences taught several courses and topics, such as pure mathematics and applied
Manager, then a Lecturer, from 2011 to 2012, at Jouf University. Then, mathematics courses for both undergraduate and graduate programs.
he was a Senior Graduate Assistant and a Tutor Advisor at the University

134028 VOLUME 10, 2022

Deepfake Audio Detection Via MFCC Features Using M
No ratings yet
Deepfake Audio Detection Via MFCC Features Using M
11 pages
AI Audio Deepfake
No ratings yet
AI Audio Deepfake
18 pages
Detection of Fake AudioA Deep
No ratings yet
Detection of Fake AudioA Deep
11 pages
Deepfakes_Audio_Detection_Techniques_Using_Deep_Convolutional_Neural_Network-Paper3
No ratings yet
Deepfakes_Audio_Detection_Techniques_Using_Deep_Convolutional_Neural_Network-Paper3
6 pages
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
No ratings yet
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
8 pages
Anomaly Detection of Deepfake Audio Based on Real Audio Using Generative Adversarial Network Model
No ratings yet
Anomaly Detection of Deepfake Audio Based on Real Audio Using Generative Adversarial Network Model
16 pages
Report
No ratings yet
Report
7 pages
A Deep Learning Framework For Audio Deepfake Detection
No ratings yet
A Deep Learning Framework For Audio Deepfake Detection
12 pages
Unmasking_the_Truth_A_Deep_Learning_Approach_to_Detecting_Deepfake_Audio_Through_MFCC_Features_ P
No ratings yet
Unmasking_the_Truth_A_Deep_Learning_Approach_to_Detecting_Deepfake_Audio_Through_MFCC_Features_ P
8 pages
FakeAVCeleb A Novel Audio-Video Multimodal DeepFake Dataset
No ratings yet
FakeAVCeleb A Novel Audio-Video Multimodal DeepFake Dataset
22 pages
Audio_Deepfake_Detection_Using_Deep_Learning Paper2
No ratings yet
Audio_Deepfake_Detection_Using_Deep_Learning Paper2
6 pages
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
No ratings yet
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
20 pages
Audio_Deepfake_Approaches
No ratings yet
Audio_Deepfake_Approaches
31 pages
computers-13-00256
No ratings yet
computers-13-00256
13 pages
Audio Deepfake Detection Paper
No ratings yet
Audio Deepfake Detection Paper
6 pages
Audio Deepfake (Camera Ready Paper)
No ratings yet
Audio Deepfake (Camera Ready Paper)
13 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Base Paper Audio Deep Fake Detection
No ratings yet
Base Paper Audio Deep Fake Detection
16 pages
Electronics 13 00095
No ratings yet
Electronics 13 00095
27 pages
Project PPT Bhu
No ratings yet
Project PPT Bhu
12 pages
AReviewofModernAudioDeepfakeDetectionMethods PDF
No ratings yet
AReviewofModernAudioDeepfakeDetectionMethods PDF
20 pages
Detection of Synthetically Generated Speech
No ratings yet
Detection of Synthetically Generated Speech
5 pages
Summary of Audio Deepfakes - A Survey
No ratings yet
Summary of Audio Deepfakes - A Survey
2 pages
Project
No ratings yet
Project
14 pages
Applsci 13 08488 v2
No ratings yet
Applsci 13 08488 v2
15 pages
Vocal Tract
No ratings yet
Vocal Tract
19 pages
2310.03827v1
No ratings yet
2310.03827v1
5 pages
Unmasking_the_Fake_Machine_Learning_Approach_for_Deepfake_Voice_Detection
No ratings yet
Unmasking_the_Fake_Machine_Learning_Approach_for_Deepfake_Voice_Detection
12 pages
s11042-024-18217-9 (3)
No ratings yet
s11042-024-18217-9 (3)
25 pages
deepfake researchpaper
No ratings yet
deepfake researchpaper
34 pages
LR4
No ratings yet
LR4
5 pages
Deepfakes Generation and Detection
No ratings yet
Deepfakes Generation and Detection
54 pages
Minor Project Ms
No ratings yet
Minor Project Ms
12 pages
A Survey On The Detection and Impacts of Deepfakes in Visual Audio and Textual Formats
No ratings yet
A Survey On The Detection and Impacts of Deepfakes in Visual Audio and Textual Formats
33 pages
(IJCST-V11I4P16) :nikhil Sontakke, Sejal Utekar, Shivansh Rastogi, Shriraj Sonawane
No ratings yet
(IJCST-V11I4P16) :nikhil Sontakke, Sejal Utekar, Shivansh Rastogi, Shriraj Sonawane
7 pages
Paper6
No ratings yet
Paper6
10 pages
Digital Forensics and Analysis of Deepfa
No ratings yet
Digital Forensics and Analysis of Deepfa
6 pages
Working Paper
No ratings yet
Working Paper
5 pages
Bhavyasri Naraharisetty Synthesis Paper
No ratings yet
Bhavyasri Naraharisetty Synthesis Paper
19 pages
RBPRATYUSH448
No ratings yet
RBPRATYUSH448
20 pages
_Main Report Draft UNTOCHED
No ratings yet
_Main Report Draft UNTOCHED
82 pages
TSP_IASC_30486
No ratings yet
TSP_IASC_30486
14 pages
wijethunga2020
No ratings yet
wijethunga2020
6 pages
Ai For Detecting Deep Fakes
No ratings yet
Ai For Detecting Deep Fakes
9 pages
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
No ratings yet
Does Audio Deepfake Detection Generalize?: Fraunhofer Aisec Technical University Munich Why Do Birds GMBH
5 pages
2411.19537v1 survey
No ratings yet
2411.19537v1 survey
24 pages
Recent Advancements in The Field of Deepfake Detection
No ratings yet
Recent Advancements in The Field of Deepfake Detection
11 pages
Deepfake Video Detection Challenges and Opportunities
No ratings yet
Deepfake Video Detection Challenges and Opportunities
48 pages
Ijset v11 Issue6 571
No ratings yet
Ijset v11 Issue6 571
5 pages
Allmodels
No ratings yet
Allmodels
4 pages
Computers 12 00216 With Cover
No ratings yet
Computers 12 00216 With Cover
27 pages
Recurrent Convolutional Structures For Audio
No ratings yet
Recurrent Convolutional Structures For Audio
14 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deepfake Generation and Detection - An Exploratory Study
No ratings yet
Deepfake Generation and Detection - An Exploratory Study
6 pages
K024 K006 DWM ResearchPaper
No ratings yet
K024 K006 DWM ResearchPaper
16 pages
2022 ICASSP Audio Deepfake Emotions+
No ratings yet
2022 ICASSP Audio Deepfake Emotions+
5 pages
DeepFake_Detection_System
No ratings yet
DeepFake_Detection_System
60 pages
computers-12-00216-v2
No ratings yet
computers-12-00216-v2
26 pages
Deep Learning For Deepfakes Creation and Detection - A Survey - ScienceDirect
No ratings yet
Deep Learning For Deepfakes Creation and Detection - A Survey - ScienceDirect
6 pages
2410.03487v1
No ratings yet
2410.03487v1
22 pages
Lit Review
No ratings yet
Lit Review
10 pages
MEDCHECK
No ratings yet
MEDCHECK
17 pages
Msds iuFUXPCU
No ratings yet
Msds iuFUXPCU
47 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
B.Tech CSE - AI ML 2021 Scheme & Syllabi
No ratings yet
B.Tech CSE - AI ML 2021 Scheme & Syllabi
22 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
A Machine Learning Based Classification Model To Support University Students With Dyslexia With Personalized Tools and Strategies
No ratings yet
A Machine Learning Based Classification Model To Support University Students With Dyslexia With Personalized Tools and Strategies
12 pages
Fuzzy Logic & Machine Learning - PPT
No ratings yet
Fuzzy Logic & Machine Learning - PPT
138 pages
1 PB
No ratings yet
1 PB
8 pages
EYE Blinking For Password Authentication
No ratings yet
EYE Blinking For Password Authentication
11 pages
XGBoost Tuning 1597155827
No ratings yet
XGBoost Tuning 1597155827
7 pages
Unit
No ratings yet
Unit
13 pages
MID-3 ML Question Bank
No ratings yet
MID-3 ML Question Bank
2 pages
Advanced Certificate Programme DS 1669897036711 PDF
No ratings yet
Advanced Certificate Programme DS 1669897036711 PDF
34 pages
AIML unit 4
No ratings yet
AIML unit 4
26 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
1 page
Analysis and Detection of Autism Spectrum Disorder Using Machine Learning Techniques-Ppt-1
100% (3)
Analysis and Detection of Autism Spectrum Disorder Using Machine Learning Techniques-Ppt-1
28 pages
NGBoost Natural Gradient Boosting For Probabilistic Prediction
No ratings yet
NGBoost Natural Gradient Boosting For Probabilistic Prediction
11 pages
ASIA_A_Federated_Boosting_Tree_Model_Against_Sequence_Inference_Attacks_in_Financial_Networks
100% (1)
ASIA_A_Federated_Boosting_Tree_Model_Against_Sequence_Inference_Attacks_in_Financial_Networks
14 pages
Machine Learning Interview Questions & Answers - MIQ
No ratings yet
Machine Learning Interview Questions & Answers - MIQ
17 pages
Intel Technology Journal
No ratings yet
Intel Technology Journal
14 pages
Combining Classifiers: Outline
No ratings yet
Combining Classifiers: Outline
15 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
Information 11 00332
No ratings yet
Information 11 00332
21 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
89 pages
FINAL PROJECT REPORT - Rohit Singhal
No ratings yet
FINAL PROJECT REPORT - Rohit Singhal
29 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
IET Nanodielectrics - 2024 - Hechifa - Enhancing power transformer health assessment through dimensional reduction and
No ratings yet
IET Nanodielectrics - 2024 - Hechifa - Enhancing power transformer health assessment through dimensional reduction and
13 pages
What Are The Types of Machine Learning?
100% (1)
What Are The Types of Machine Learning?
24 pages
Face Detection and Recognition Using Image Processing
No ratings yet
Face Detection and Recognition Using Image Processing
43 pages

Deepfake Audio Detection Via MFCC Features Using Machine Learning

Uploaded by

Deepfake Audio Detection Via MFCC Features Using Machine Learning

Uploaded by

Received 14 December 2022, accepted 20 December 2022, date of publication 21 December 2022,

date of current version 29 December 2022.

Deepfake Audio Detection via MFCC Features

I. INTRODUCTION on personal interests like portraiture rights, reputation rights,

VOLUME 10, 2022 134019

134020 VOLUME 10, 2022

FIGURE 1. Graphical representation of proposed approach for detection of deepfake audios.

human under specific scenarios. For this purpose, we use

VOLUME 10, 2022 134021

VOLUME 10, 2022 134023

TABLE 1. Dataset description.

D. FOR-ORIGINAL DATASET E. MODEL COMPARISON

134024 VOLUME 10, 2022

VOLUME 10, 2022 134025

134026 VOLUME 10, 2022

VOLUME 10, 2022 134027

134028 VOLUME 10, 2022

You might also like