Deepfake Audio Detection Via MFCC Features Using Machine Learning
Deepfake Audio Detection Via MFCC Features Using Machine Learning
Corresponding authors: Abdul Rehman Javed (abdulrehman.cs@au.edu.pk) and Natalia Kryvinska (natalia.kryvinska@fm.uniba.sk)
ABSTRACT Deepfake content is created or altered synthetically using artificial intelligence (AI) approaches
to appear real. It can include synthesizing audio, video, images, and text. Deepfakes may now produce
natural-looking content, making them harder to identify. Much progress has been achieved in identifying
video deepfakes in recent years; nevertheless, most investigations in detecting audio deepfakes have
employed the ASVSpoof or AVSpoof dataset and various machine learning, deep learning, and deep learning
algorithms. This research uses machine and deep learning-based approaches to identify deepfake audio.
Mel-frequency cepstral coefficients (MFCCs) technique is used to acquire the most useful information from
the audio. We choose the Fake-or-Real dataset, which is the most recent benchmark dataset. The dataset
was created with a text-to-speech model and is divided into four sub-datasets: for-rece, for-2-sec, for-
norm and for-original. These datasets are classified into sub-datasets mentioned above according to audio
length and bit rate. The experimental results show that the support vector machine (SVM) outperformed
the other machine learning (ML) models in terms of accuracy on for-rece and for-2-sec datasets, while
the gradient boosting model performed very well using for-norm dataset. The VGG-16 model produced
highly encouraging results when applied to the for-original dataset. The VGG-16 model outperforms other
state-of-the-art approaches.
INDEX TERMS Deepfakes, deepfake audio, synthetic audio, machine learning, acoustic data.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
134018 VOLUME 10, 2022
A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning
deepfakes were expected to increase to over 730 by 2020’s produced highly encouraging results when applied to the
end, according to predictions made on July 24 [9]. In [10] for-original dataset.
found that most focus is on video deepfakes, particularly in
The paper will proceed as described below. Section II
developing video deepfakes.
explains the literature review methods. The suggested
Deep Fakes are increasingly detrimental to privacy, social
security, and Authenticity. However, recent works have approach and algorithms are described in III. The analysis
focused on deepfake video detection, achieving greater and results of the experiments are provided in Section IV.
accuracy. However, audio spoofing and calls from malicious Section V presents the discussion on the proposed approach.
Finally, Section VI presents the overall conclusion.
sources are generated through deep fakes, which need a
specially trained model for handling this. The deepfake audio
detection based purely on audio is less explored than image II. LITERATURE REVIEW
and video-based approaches, as these works simultaneously Audio deepfakes audio is generated, edited, or synthesized
utilize the audio and Spatio-temporal information in the using artificial intelligence, which appears real. Detecting
video to train the deep learning model. However, only the audio deepfakes is critical since audio deepfakes have
audio-based classifier’s classification and detection are very been used in several illegal actions in banking, customer
significant. Hence, to this end, we proposed an approach service, and call centers. To detect audio deepfakes, one
based on multiple machine learning algorithms to improve the must first understand the procedures of generation. As the
accuracy of the classification models using Random Forest, name suggests, audio deepfake algorithms are classified
Decision Tree, and SVM algorithms. We provide comparative into three types: Replay attack, speech synthesis, and voice
results and analysis of the baseline models. We conducted conversion are all possible. This section gives the reader each
our experiments on Fake-or-Real Dataset, and there were four subcategory’s most recent and relevant frameworks.
sub-detests. Audio forensics is a branch of forensics used to authen-
The ASVspoof2015 [11] is the first automatic speaker ticate, enhance, and analyze audio information to aid in
verification spoofing and countermeasures dataset that stim- investigating various crimes. Audio as forensic evidence must
ulates research in this field. It decreases the equal error rate be modified and analyzed before criminal prosecution. How-
(EER) by less the 1.5%. Some attacks have even 50% EER. ever, more significantly, it must be validated to demonstrate
However, unknown attacks can have five times more EER. that it is genuine and has not been tampered with. Several
Further, in ASVspoof2017 [12], the limits of replay spoofing methods, primarily AI/ML-based techniques, have been used
attack detection are worked upon. The EER of 6.73% to detect audio events in the last decade. A deep learning
and the Instantaneous frequency cosine coefficients (EFCC) framework was employed by the authors of the study [16]
drastically improve countermeasures performance. Then for audio-deep fake detection. The model separability is
ASVspoof2019 [13] put more emphasis on countermeasures increased using a Long-short term memory (LSTM)-the
concerning automatic speaker verification and spoofed audio based network is used to recognize events in sub-sampled
detection. Other than that, computer vision algorithms such signals [17]. To reduce the audio signal complexity and ease
as convolutional neural networks (CNN) is used low-quality of reconstruction encode, the frequencies higher than the
audio spectrograms for synthetic speech detection [14]. The Nyquist frequency [18] are used, and the authors [19] utilized
time information can be lost in CNN-based models. Hence, non-uniform sampling for audio subsampling.
probabilistic forecasting with a temporal convolutional neural Replay attacks consist of repeatedly playing back a
network is used for improving automatic speaker verification recording of the voice of the intended victim. Replay attacks
and spoofed audio detection [15]. come in two forms, the first is far field detection, and
This research aims to derive a methodology for iden- the second is copy-and-paste detection [20], [21]. As of
tifying deepfake audio from non-synthetic or real audio. now, deep convolutional networks are used as a method for
It provides the following contributions to identify deepfake detecting complete replay attacks [22]. Several methods have
audios effectively by resolving the restrictions discussed been developed for identifying replay attacks, and they center
above: on the characteristics that are provided in the network. The
• Propose a transfer learning-based approach to detect method of using deep convolutional networks to detect replay
deepfake. attacks was found to have an Equal Error Rate (EER) of zero
• Extend work on deepfake audio detection on the Fake- percent on the ASVspoof2017 training and test dataset [12].
or-Real dataset by conducting detailed experiments on Speech synthesis (SS) is recreating human speech digitally,
Fake-or-Real datasets and sub-datasets using machine typically using computer software or hardware. TTS is
and deep learning-based approaches. a component of SS that takes in written material and
• Use a superior feature extraction approach to obtain outputs spoken language based on that text according to
MFCC features from audio sources. predetermined linguistic rules. Text reading and AI personal
• Results reveal that the SVM model outperforms assistants are just two applications of speech synthesis.
other ML models compared to other dataset sub-sets Another perk of speech synthesis is that it can mimic various
except for the for-original dataset. The VGG-16 model voices and dialects without relying on canned recordings.
Lyrebird,1 a powerful speech synthesis company, employs concept that cannot be implemented practically. Hence, the
deep learning models to synthesize 1,000 sentences in a dataset Fake-or-Real [28] is divided into four datasets:for-
second. TTS largely relies on the quality of the speech corpus rece, for-2-sec, for-norm, for-original, where for-original
to build the system, and regrettably, creating speech corpora dataset is the collection of other three datasets and without
is expensive [23]. much preprocessing.
Speech synthesis is the artificial reproduction of human This research aimed to develop a technique to classify
speech using software or hardware system programs. To syn- deep fake synthetic audio under different background noise
thesize 1,000 sentences per second, Lyrebird uses deep and audio sizes and duration. We proposed a framework that
learning models. The success of a TTS system is highly handles the big data training set and performs detection using
dependent on the quality of the speech corpus upon which different supervised and unsupervised machine learning
it is built, and it is costly to collect and annotate speech algorithms. The following section explains the proposed
samples. Char2Wav is a framework for speech synthesis framework for all sub-datasets, including data handling, pre-
production from start to finish. PixleCNN is also the processing, feature engineering, and the classification phase.
foundation of WaveNet [24], an SS framework. WaveGlow Figure 1 Shows the detailed architecture diagram of the
prioritizes stage two of the two-stage process generally used proposed framework, consisting of 1) data preprocessing,
by text-to-speech synthesis systems (encoder and decoder). 2) feature extraction 3) Classification models. The detailed
Therefore, WaveGlow is concerned with modifying specific description of each phase is as follows:
time-aligned data. Incorporating information into sound
files by using encoding techniques like a mel-spectrogram. A. DATA PREPROCESSING
The Tacotron 2 [25] system comprises two parts. The More than 195,000 real human and synthetic computer-
first component is an attention-based recurrent sequence- generated speech samples are included in the Fake-or-Real
to-sequence feature prediction network. This component’s (FoR) collection. Classifiers may be trained on the dataset
output is a mel anticipated sequence. Frames of a spectrogram to identify fake speech better. Information from Deep Voice
A modified WaveNet vocoder is the second component. 3 is included [29] and Google Wavenet [24] TTS and various
For audio data, [26], [27] used GAN-based generative human sound recordings. This dataset may be accessed in
models. It operates on Mel spectrograms and employs a four different varieties[ 1) for-original,2) for-norm, 3) for-
fully convolutional feed-forward network as the generator. 2sec, and 4) for-rerec]. The original version includes the files
The authors give a summary of their recently created data without any changes from when they were first extracted from
set. It comprises 117,985 created audio segments of 16-bit the speech sources. The latest volume (For-norm) contains the
Pulse Code Modulation(PCM) wav format and is available duplicate files as the first, but they have been standardized
on zenodo.2 in terms of sampling rate, volume, and various channels to
The current study has poor performance validation and achieve gender and class parity. The second is the basis
testing results detecting deep false audios. Feature-based for the third (for-2sec), except that the files are truncated
techniques are required to improve the outputs of machine after 2 seconds instead of the original length. The third and
learning models. The deep learning approaches show better final version (for-rerec) is a re-recorded version of the for-
results but require greater training time and computa- 2second dataset created to simulate an attacker transmitting
tional resources. Hence, the potential for machine learning an utterance via a voice channel. However, these datasets
approaches in deepfake detection is explored, while the suffer from duplicate files, 0-bit files, and different bit-rate
limitation of handling higher feature sets and complexities in audio signals. They negatively affect the ML model’s
can be solved through a transfer learning-based deep learning training and performance. Hence, we preprocess the dataset
approach. to remove the duplicate and 0-bit file, which does not
contribute to model training. Also, the bit rate is standardized
III. PROPOSED METHODOLOGY to zero-padding for an audio waveform with less than 16,000
In machine learning, training a model always involves the samples, conforming to an operationally viable bit rate for the
trade-off of over-fitting and under-fitting, which negatively TensorFlow audio signal processing library. Also, the data is
impacts the model’s real-time performance. It is difficult normalized using a standard scaler to ease model training.
to handle this trade-off so that models do not over-fit or
under-fit. One of the major issues in deepfake is the high B. FEATURE EXTRACTION
false-positive rate ratio, which occurs when most models Deepfake audio signal often consists of similar feature
classify an unseen pattern as abnormal if it is not included in sets to the original signal. However, distinguishability is
the training set. It is due to the model’s inability to be trained challenging to advance in deep learning approaches in
on a large dataset. A dataset that covers all possible patterns generating deepfakes. Hence, extracted features can strongly
and cases, deepfake and real. It is regarded as a theoretical affect the model’s predictive power and accuracy. It is
observed that audio signals in the frequency domain can
1 https://www.descript.com/lyrebird provide us the features which are helpful in the detection
2 https://zenodo.org/record/5642694 and classification of deepfake audios, which can deceive a
FIGURE 3. In (a) and (b), the comparison is shown between the deepfake and real audio signal in spectrogram where the
difference in amplitude is apparent. In (c) and (d), the amplitude is shown in terms of decibels (db) for understanding the
auditory parts of the audio signal.
of a feature for node j, while Xj0 is its normalized feature binary classification with linearly separable vectors xi ∈ Rn ,
importance. RFXjj0 is the feature importance for all trees in as the decision surface used to classify a pattern as belonging
a random forest. Moreover, Xjz is normalized importance of to one of the two classes is the hyperplane H0 . If x is a random
feature j w.r.t tree t. The model makes predictions based on vector n ∗ R, we define
the important features obtained, as mentioned in Figure 3.
f(x) = w.x + b (4)
2) SUPPORT VECTOR MACHINE (SVM)
SVM is a supervised learning method that relies primarily on Dot product is represented d (.) in equation 4. The set of all
two assumptions: 1) Converting data into a high-dimensional x-vectors that satisfy the equation f(x) = 0 is denoted by H0 .
space may reduce complex classification issues with complex Assuming two hyperplanes, H1 and H2 , the distance between
decision surfaces to more minor problems that may be solved them is referred to as their margin which can be represented
by making it linearly separable, and 2) only training patterns as follows:
near the decision surface provide the most sensitive details 2
for classification. Assume a deepfake detection problem as a (5)
kwk
134022 VOLUME 10, 2022
A. Hamza et al.: Deepfake Audio Detection via MFCC Features Using Machine Learning
The decision hyperplane H0 depends on vectors closest to of the gradient boosting algorithm. We use a learning rate of
the two parallel hyperplanes called support vectors. The 0.1 and an estimator of 10000 for the XGBoost algorithm.
margin must be maximal to obtain a classifier that is not However, it is vulnerable to outliers because each successive
much adapted to the training data. Consider a collection of classifier is compelled to correct the mistakes made by its
training data vectors X = xi , . . . xL , xi ∈ Rn and a set of prior learners. This is because the estimators rely on historical
matching labels, Y = yi , . . . yL , yi ∈ {1, −1}. We consider predictions to determine their accuracy. For this reason,
the hyperplane H0 to be optimally separated if the vectors streamlining the process is complex.
are categorized without error and the margin is greatest. The
vectors must verify to be accurately categorized. IV. EXPERIMENTS AND RESULTS
About 195,000 human and synthetic speech samples were
fxi > +1 for yi = +1 (6)
used to create the Fake-or-Real (FoR) dataset. In Table 1,
fxi > −1 for yi = −1 (7) we offer a summary of the data set. Classifiers may be
Hence, finding the SVM classifying function H0 can be stated trained on the dataset to identify fake speech better. The
as follows: dataset is an amalgamation of information from the following
1 recent datasets: first, Text-to-speech programs, such as Deep
minimize kwk2 (8) Voice 3 and Google Wavenet TTS [29]. Secondly, includes
2 many different types of the recorded human voice, including
yi f(xi ) ≥ 1, ∀i (9)
those from the Arctic Dataset, LJSpeech Dataset, VoxForge
The SVM was chosen for its properties that aid in Dataset, and user-submitted recordings [33], [34], [35].
classifying deepfake audios. It performs well with a clear The four dataset versions available for public consumption
margin of separation between samples and is effective are for-original, for-norm, for-2sec, and for-rerec. The
in high-dimensional environments. It employs a subset of for-original folder stores the raw data from the speech
training points in the decision function, making it memory sources.
efficient. It works well when the number of dimensions is The for-norm has some duplicate files but is otherwise
more than the size of the sample set. SVM does not perform well-balanced across demographic categories (gender and
very well on our for-original dataset data set because the socioeconomic status) and technical parameters (sample rate,
required training time and the noise in the data set are higher. volume, and multiple channels). The third one is like the
It does not directly provide probability estimates, calculated second one, only the files are cut off after 2 seconds, and
using an expensive five-fold cross-validation that takes a it is called for-2sec. The last variant, dubbed for-here, is a
long time to train. However, the clean datasets extracted re-recording of the for-2second dataset meant to mimic a
from for-original dataset perform better on the classification situation in which an attacker transmits a speech over a vocal
task. SVM has been shown to perform effectively in higher- channel like a phone call or voice message. We provide
dimensional data, most notably when detecting events in the outcomes of our binary classification analysis of the
audio data. Hence, for deepfake audio, we implemented it by suggested method. Table 2 shows the experimental findings
utilizing the Scikit-learn library. We use radial basis function for spotting deepfakes.
(RBF) kernel, C = 4, and probability = True. The experiments were also performed using noisy audio
sound signals. For this purpose, we added synthetic noise to
3) MULTI-LAYER PERCEPTRON (MLP) each audio signal of three datasets (for-2sec, for-norm, and
MLP is adequate for classification tasks; a multilayer for-rerec dataset). This method kept both original and noisy
perceptron, through layers, can effectively filter the relevant audio in the dataset and increased the audio signal sample.
features from data and tune the parameters of the models The length of the original for-2sec dataset is 17870 audio
for optimal predictions. There are at least three levels in the samples, and after adding noise to the dataset, the new dataset
MLP model: an input layer, a hidden layer of calculation will be composed of 35740 audio samples, the same for the
nodes, and an output layer of processing nodes. In this study, for-rerec and for-norm datasets.
we use hyperparameters of MLP classifier as a hidden-layer-
size staple:length = 100, solver = adam and RMSprop, while A. FOR-REREC DATASET
RMSprop is used for smaller datasets, shuffle = True and The results of the for-rerec dataset are presented in Table 2.
verbose = False, activation-function = relu. Multiple ML models are applied to obtain better results. The
machine learning algorithms such as Support Vector Machine
4) EXTREME GRADIENT BOOSTING (XGB) (SVM) have 98.83% accuracy, Decision Tree 88.28%, Ran-
XGB is a parallel and optimized version of gradient dom Forest Classifier 96.60%, AdaBoot 87.67%, Gradient
boosting algorithms that combines efficiency and resource Booting 93.51%, XGB Classifier 93.40%. The SVM model
management. It implements gradient-boosted decision trees exhibited the highest results using the for-rerec dataset.
in an iterative model by combining weak base models into The result of the for-rerec dataset noisy audio signals
a stronger learner. The residual is utilized to refine the loss classification is presented in Table 3. Results depict that the
function and improve the prior prediction at each iteration MLP and SVM models obtained the highest accuracy score
of 98.66% and 98.43% compared to other ML models. The TABLE 2. Accuracy comparison for machine learning models.
other ML models like; DT, LR, and XGB obtained 82.12%,
88%, and 88.92% accuracy.
B. FOR-2sec DATASET
In for-2sec dataset consist of audio with two-second intervals.
The audio is complex, as the information in that small interval
is little. However, it is much easier for machine learning
algorithms to process data in this form. Hence, we observe
better performance. The results are depicted in Table 2.
We observe MLP classifier accuracy of 94.69, Random Forest
of 94.44, and SVM 97.57. gradient boosting of 94.30 and
Adaboosing of 90.23. The MLP model outperforms the other TABLE 3. Accuracy comparison for noisy audio signals using machine
learning models.
ML model in terms of accuracy.
Table 3 shows the results of the for-2sec dataset with
noisy audio signal classification. To get better outcomes,
several ML models are used. The ML algorithms such as
SVM obtained 99.59% accuracy, MLP obtained 99.49%,
DT 87.52% accuracy, and so on. The SVM exhibited the
highest accuracy compared to other ML models using noisy
audio signals.
C. FOR-NORM DATASET
It contains recorded audio at 12-second intervals. The result
of the for-norm dataset is shown in Table 2. It shows MLP
Classifier 86.82, Random Forest Classifier 90.60, extra trees
91.46, Gradient Booting 92.63, XGB Classifier 92.60, LDA same visual features of MFCC from the audio data. These
91.35, Gaussian NB 81.81, and Adabost 89.40. However, visual features train the VGG-16-based model and LSTM
some algorithms show average results, like QDA 61.36 and to perform deepfake or real audio classification. Finally, the
KNN 64.21. The Gradient Boosting classifier obtained the VGG16 model outperformed the LSTM model with a testing
highest results compared to the other ML models. accuracy of 93%. The LSTM model obtained 91% accuracy.
The results of noisy audio from the for-norm dataset are The VGG-16 model uses ImageNet weights and input shapes
presented in Table 3. The results of the for-norm dataset are (64 x 64 x 3). The validation accuracy of 0.94 and validation
less than the other two datasets. The XGB model obtained the loss of 0.14 is obtained, while the testing accuracy is 93%.
highest results using noisy audio from the for-norm dataset. Figure 4a shows the training and validation graph, while
All other ML models obtained quite well results but not so Figure 4b shows the training and validation loss of the
impressive. VGG16 model.
FIGURE 4. Comparison between the validation and training (accuracy and loss).
previous study [36]. Because of this reason, the suggested learning models are employed in this research. The proposed
method cannot be compared to any other studies. approach employed VGG16 and LSTM model with a feature
Our technique shows potential in terms of classification ensemble of MFCC-40, Roll-off point, centroid, contrast,
accuracy. This work obtained comparatively better results in and bandwidth features. The features extracted from each
ensemble-based machine learning models such as boosting method are combined for model classification. The VGG16
algorithms, as the XGboost algorithm shows greater accuracy model obtained the highest results compared to the existing
than the baseline model. The model’s accuracies for three study with an accuracy of 93%. Furthermore, the LSTM
sub-datasets are shown in Table 2 and 3. model obtained an accuracy of 91%. The existing approach
Table 2 and 3 compare the machine learning model proposed by khochare et al. used MFCC features and various
results of the feature-based approach to training machine machine learning models for deepfake audio detection [36].
learning algorithms. Our approach to selecting the best They have utilized 20 MFCC features for each audio. The
feature and ML classifiers obtained promising results on three author employed multiple machine learning models (SVM,
datasets (FOR-REREC DATASET, OR-2SEC DATASET, RF, KNN, and XGB). The author used 20 MFCC features
and FOR-NORM DATASET). However, The for-norm with the SVM model and obtained the highest accuracy rate
dataset does not perform well on our approach using a of 67%. Another study proposed by Reimao and Tzerpos
simple SVM algorithm as the data is of high dimensions. used both machine learning and deep learning techniques
Without the dimensionality reduction on a complex dataset, along with various feature extraction methods [28]. The
it performs poorly. This dataset contains audio of a length author used Timbre Model Analysis (Brightness, Hardness,
greater than 12 seconds. Hence, the windowing technique can Depth, Roughness) features with multiple ML models
perform better in combination with MFCC. The proposed (NB, SVM, DT, and RF). According to the ML model
approach is compared with the baseline approach that used classification results, the SVM model using the various
FOR-ORIGINAL DATASET for experimentation [36]. The feature extraction methods obtained a 73.46% accuracy rate.
existing approach used various ML models (SVM, RF, KNN, Furthermore, STFT, Mel-Spectrograms, MFCC, and CQT
XGB) to detect deepfake from FOR-ORIGINAL-DATASET. feature extraction methods are used with the VGG19 model
The proposed approach obtains the highest testing score of and obtained 89.79% accuracy. Compared to the previous
93%, which is 26% higher than the best score of existing research, the VGG16 model achieved the highest results, with
work using the SVM model. It is concluded that the proposed an accuracy of 93%. More so, the LSTM model achieved 91%
approach can efficiently detect deepfake audio. accuracy. The VGG16 model loss and training and validation
The dataset used in this study is only used in only one accuracy are shown in Figure 4. The proposed approach
previous study. The proposed and existing approaches’ exper- with the features mentioned in section III-B outperforms
imental settings are similar (dataset, data split). In addition, the previous state-of-the-art feature extraction techniques is
the comparative analysis of the proposed method with the presented in Table 4.
state-of-the-art feature extraction techniques is presented
in Table 4. The proposed approach combines features V. DISCUSSION
from multiple feature extraction techniques and extracts This research extended the work on deepfake audios by
the most optimal features for classification. The two deep extending the work on the Fake-or-Real dataset. This dataset
TABLE 4. Comparison between results of the proposed approach and existing approach.
comprises a state-of-the-art dataset in audio detection and classification. It is further compiled into four sub-datasets.
classification. We improved upon the algorithm’s perfor- This study conducted experiments with multiple audio data
mance, which was previously trained on feature-based features to detect deepfakes in audio data. This work extracts
approaches by using the MFCC-based features, indicating MFCC features from audio for feature engineering. Several
considerable improvements in inaccuracy. Our feature out- machine learning algorithms are applied to the selected
performs the feature-based approach by 10 to 20 percent on feature set to detect the deepfake audio. This approach gave
average across these datasets. The for-norm dataset performs higher accuracy and results in all cases than other state-of-
poorly on our approach using simple SVM algorithms. the-art studies for audio data. This study obtained 97.57%
Windowing techniques, in combination with MFCC, can accuracy with SVM using the for-2sec dataset compared
perform better. We conduct additional experiments on to other ML models, while 92.63% was obtained by the
machine learning algorithms categorizing into (1) Statistical Gradient Boosting classifier using the for-norm dataset. This
models like QDA, LDA, and Gaussian Naive Bayes for study obtained 98.83% highest accuracy using the SVM
dimensionality reduction to reduce noise in the data. Then model on the for-rerec dataset. We plan to explore the
(2) Tree-based models such as Decision Tree, Extra Tree, different window sizes for MFCC and various input sizes
and Random Forest these algorithms can handle multidimen- for models in the future. Future work can be done on
sional data. Therefore, they do not involve domain knowledge evaluating these models against potential fluctuation and
or parameter setting and are appropriate for exploratory distortion in the audio signal, understanding which signal
pattern detection. Lastly, (3) Boosting models, namely Ada is greater. Moreover, studies on the state-of-the-art few-
Boost, Gradient Boosting, and XGBoost, these algorithms shot learning and Bidirectional Encoder Representations
fundamentally create several weak learners and combine their from Transformers (BERT) based models can be conducted.
predictions with building a strong rule, which helps increase Furthermore, we plan to evaluate our models in ambient noise
the accuracy of a model on feature-rich audio data. These and reverberation circumstances. We intend to use feature
three classes of ML algorithms are chosen for our approach extraction methods like i-vector, x-vector, a combination of
to explore and improve these performances on MFCC-based MFCC and GFCC, and a combination of DWT and MFCC,
feature sets. Besides this, we proposed a VGG-16-based which were not taken into account in the current scenario
deep learning model for the bigger dataset, which is the of experiments because it is the beginning of our journey to
superset of the other three datasets. It uses transfer learning identify Deepfake audio.
and trained on MFCC images feature for training the model. REFERENCES
We obtained an accuracy of 93% while using half of the
[1] A. Abbasi, A. R. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, and
original dataset. A large amount of data correlated with higher U. Tariq, ‘‘A large-scale benchmark dataset for anomaly detection and
model accuracy. We tried to obtain a limited performance rare event classification for audio forensics,’’ IEEE Access, vol. 10,
pp. 38885–38894, 2022.
dataset. The entire dataset can be explored for even better
[2] A. R. Javed, W. Ahmed, M. Alazab, Z. Jalil, K. Kifayat, and
results in the future. T. R. Gadekallu, ‘‘A comprehensive survey on computer forensics: State-
of-the-art, tools, techniques, challenges, and future directions,’’ IEEE
Access, vol. 10, pp. 11065–11089, 2022.
VI. CONCLUSION
[3] A. R. Javed, Z. Jalil, W. Zehra, T. R. Gadekallu, D. Y. Suh, and M. J.
The detection of audio data is significant as an essential Piran, ‘‘A comprehensive survey on digital video forensics: Taxonomy,
tool for enhancing security against scamming and spoofing. challenges, and future directions,’’ Eng. Appl. Artif. Intell., vol. 106,
Deepfake audios have garnered significant public attention Nov. 2021, Art. no. 104456.
[4] A. Ahmed, A. R. Javed, Z. Jalil, G. Srivastava, and T. R. Gadekallu,
as society rapidly recognizes its possible security danger. ‘‘Privacy of web browsers: A challenge in digital forensics,’’ in Proc. Int.
However, deepfake audio is extensively studied in combina- Conf. Genetic Evol. Comput. Springer, 2021, pp. 493–504.
tion with Spatio-temporal data of video. This study improves [5] A. R. Javed, F. Shahzad, S. U. Rehman, Y. B. Zikria, I. Razzak, Z. Jalil,
and G. Xu, ‘‘Future smart cities: Requirements, emerging technologies,
upon the Fake-or-Real (FoR) dataset, which comprises state- applications, challenges, and future aspects,’’ Cities, vol. 129, Oct. 2022,
of-the audio datasets and custom audios for deepfake audio Art. no. 103794.
[6] A. Abbasi, A. R. Javed, F. Iqbal, Z. Jalil, T. R. Gadekallu, and [29] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
N. Kryvinska, ‘‘Authorship identification using ensemble learning,’’ Sci. J. Raiman, and J. Miller, ‘‘Deep voice 3: Scaling text-to-speech with
Rep., vol. 12, no. 1, pp. 1–16, Jun. 2022. convolutional sequence learning,’’ 2017, arXiv:1710.07654.
[7] S. Anwar, M. O. Beg, K. Saleem, Z. Ahmed, A. R. Javed, and U. Tariq, [30] F. M. Rammo and M. N. Al-Hamdani, ‘‘Detecting the speaker language
‘‘Social relationship analysis using state-of-the-art embeddings,’’ ACM using CNN deep learning algorithm,’’ Iraqi J. Comput. Sci. Math., vol. 3,
Trans. Asian Low-Resource Lang. Inf. Process., Jun. 2022. no. 1, pp. 43–52, Jan. 2022.
[8] C. Stupp, ‘‘Fraudsters used Ai to mimic CEO’s voice in unusual cybercrime [31] S. Ahmed, Z. A. Abbood, H. M. Farhan, B. T. Yasen, M. R. Ahmed,
case,’’ Wall Street J., vol. 30, no. 8, pp. 1–2, 2019. and A. D. Duru, ‘‘Speaker identification model based on deep neural
[9] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen, T. Huynh- networks,’’ Iraqi J. Comput. Sci. Math., vol. 3, no. 1, pp. 108–114,
The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, and C. M. Nguyen, Jan. 2022.
‘‘Deep learning for deepfakes creation and detection: A survey,’’ 2019, [32] A. Winursito, R. Hidayat, and A. Bejo, ‘‘Improvement of MFCC
arXiv:1909.11573. feature extraction accuracy using PCA in Indonesian speech recognition,’’
in Proc. Int. Conf. Inf. Commun. Technol. (ICOIACT), Mar. 2018,
[10] Z. Khanjani, G. Watson, and V. P. Janeja, ‘‘How deep are the fakes?
pp. 379–383.
Focusing on audio deepfake: A survey,’’ 2021, arXiv:2111.14203.
[33] J. Kominek and A. W. Black, ‘‘The CMU Arctic speech databases,’’ in
[11] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah,
Proc. 5th ISCA Workshop Speech Synth., 2004.
and A. Sizov, ‘‘ASVspoof 2015: The first automatic speaker verifi-
[34] K. Ito and L. Johnson. (2017). The LJ Speech Dataset. [Online]. Available:
cation spoofing and countermeasures challenge,’’ in Proc. Interspeech,
https://keithito.com/LJ-Speech-Dataset/
Sep. 2015.
[35] K. MacLean. (2018). Voxforge. [Online]. Available:
[12] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, http://www.voxforge.org/home.[Acedidoem2012]
J. Yamagishi, and K. A. Lee, ‘‘The ASVSPOOF 2017 challenge: Assessing [36] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, ‘‘A deep
the limits of replay spoofing attack detection,’’ in Proc. 18th Annu. Conf. learning framework for audio deepfake detection,’’ Arabian J. Sci. Eng.,
Int. Speech Commun. Assoc., 2017, pp. 2–6. vol. 47, pp. 1–12, Nov. 2021.
[13] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans,
T. Kinnunen, K. A. Lee, V. Vestman, and A. Nautsch. (2019). ASVSPOOF
2019: Automatic Speaker Verification Spoofing and Countermeasures
Challenge Evaluation Plan. [Online]. Available: http://www.asvspoof.org
[14] S. Ö. Arık, H. Jun, and G. Diamos, ‘‘Fast spectrogram inversion using
multi-head convolutional neural networks,’’ IEEE Signal Process. Lett.,
vol. 26, no. 1, pp. 94–98, Jan. 2019.
[15] Y. Chen, Y. Kang, Y. Chen, and Z. Wang, ‘‘Probabilistic forecasting
with temporal convolutional neural network,’’ Neurocomputing, vol. 399, AMEER HAMZA is currently pursuing the mas-
pp. 491–501, Jul. 2020. ter’s degree in artificial intelligence with the
[16] Y. Kawaguchi, ‘‘Anomaly detection based on feature reconstruction from Department of Creative Technology, Air Univer-
subsampled audio signals,’’ in Proc. 26th Eur. Signal Process. Conf. sity, Islamabad, Pakistan.
(EUSIPCO), Sep. 2018, pp. 2524–2528.
[17] Y. Kawaguchi and T. Endo, ‘‘How can we detect anomalies from
subsampled audio signals?’’ in Proc. IEEE 27th Int. Workshop Mach.
Learn. Signal Process. (MLSP), Sep. 2017, pp. 1–6.
[18] H. J. Landau, ‘‘Sampling, data transmission, and the Nyquist rate,’’ Proc.
IEEE, vol. 55, no. 10, pp. 1701–1706, Oct. 1967.
[19] H. Yu, Z.-H. Tan, Z. Ma, R. Martin, and J. Guo, ‘‘Spoofing detection in
automatic speaker verification systems using DNN classifiers and dynamic
acoustic features,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10,
pp. 4633–4644, Oct. 2018.
[20] S. Pradhan, W. Sun, G. Baig, and L. Qiu, ‘‘Combating replay attacks
against voice assistants,’’ Proc. ACM Interact., Mobile, Wearable Ubiq-
uitous Technol., vol. 3, no. 3, pp. 1–26, Sep. 2019.
[21] J. Villalba and E. Lleida, ‘‘Preventing replay attacks on speaker ABDUL REHMAN JAVED (Member, IEEE)
verification systems,’’ in Proc. Carnahan Conf. Secur. Technol., Oct. 2011, received the master’s degree in computer science
pp. 1–8. from the FAST National University of Computer
[22] F. Tom, M. Jain, and P. Dey, ‘‘End-to-end audio replay attack detection and Emerging Sciences, Islamabad, Pakistan.
using deep convolutional networks with attention,’’ in Proc. Interspeech, He has worked at the National Cybercrimes and
Hyderabad, 2018, pp. 681–685. Forensics Laboratory, Air University, Islamabad,
[23] K. Kuligowska, P. Kisielewicz, and A. Włodarz, ‘‘Speech synthesis where he is currently a Lecturer with the Depart-
systems: Disadvantages and limitations,’’ Int. J. Res. Eng. Technol., vol. 7, ment of Cyber Security. He is also a cyber
no. 83, pp. 234–239, 2018. security researcher and a practitioner with industry
[24] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, and academic experience. He is supervising/
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, ‘‘WaveNet: co-supervising several graduate (B.S. and M.S.) students on health infor-
A generative model for raw audio,’’ 2016, arXiv:1609.03499.
matics, cybersecurity, mobile computing, and digital forensics topics. He has
[25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
reviewed more than 150 scientific research articles for various well-known
Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis,
journals. He has authored more than 50 peer-reviewed research articles. His
and Y. Wu, ‘‘Natural TTS synthesis by conditioning wavenet on MEL
spectrogram predictions,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
research interests include mobile and ubiquitous computing, data analysis,
Process. (ICASSP), Apr. 2018, pp. 4779–4783. knowledge discovery, data mining, natural language processing, smart
[26] J. Frank and L. Schönherr, ‘‘WaveFake: A data set to facilitate audio homes, and their applications in human activity analysis, human motion
deepfake detection,’’ 2021, arXiv:2111.02813. analysis, and e-health. He aims to contribute to interdisciplinary research
[27] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, ‘‘Introduction in computer science and human-related disciplines. He is a member of
to digital image steganography,’’ in Digital Media Steganography, ACM and a TPC Member of CID2021 (Fourth International Workshop
Amsterdam, The Netherlands: Elsevier, 2020, pp. 1–15. on Cybercrime Investigation and Digital forensics-CID2021) and the 44th
[28] R. Reimao and V. Tzerpos, ‘‘FoR: A dataset for synthetic speech International Conference on Telecommunications and Signal Processing.
detection,’’ in Proc. Int. Conf. Speech Technol. Hum.-Comput. Dialogue He has served as a Moderator for the 1st IEEE International Conference on
(SpeD), Oct. 2019, pp. 1–10. Cyber Warfare and Security (ICCWS).
FARKHUND IQBAL (Member, IEEE) received of Denver, in 2013 and 2019, respectively. He is currently an Assistant
the master’s and Ph.D. degrees from Concordia Professor of CEN and VD with the College of Computer and Information
University, Canada, in 2005 and 2011, respec- Science, Jouf University. His research interests include AI, blockchain,
tively. He is currently working as an Associate networks, smart and microgrid cyber security, integration, image processing,
Professor at the College of Technological Inno- video surveillance systems, PV, EV, and machine and deep learning. He was
vation, Zayed University, United Arab Emirates. a recipient of awards and honors, include the Aljouf University Scholarship
He is also an Affiliate Professor with the School of (Royal Embassy of Saudi Arabia in D.C.) and the Aljouf’s Governor Award
Information Studies, McGill University, Canada, for Excellency.
and an Adjunct Professor with the Faculty of
Business and Information Technology, Ontario
Tech University, Canada. He leads the Cyber Security and Digital Forensics
(CAD) Research Group, Center for Smart Cities and Intelligent Systems,
Zayed University. He has published more than 120 papers in high-ranked
journals and conferences. His research interests include artificial intelli- ZUNERA JALIL received the master’s degree
gence, machine learning, and data analytics techniques for problem-solving in computer science from the Higher Education
in cybersecurity, health care, and cybercrime investigation in the smart city Commission of Pakistan, in 2007, and the Ph.D.
domain. He has served as the chair and co-chair for several IEEE/ACM degree in computer science with a specialization
conferences. He has been a guest editor and reviewer for multiple high-rank in information security from the FAST National
journals. University of Computer and Emerging Sciences,
Islamabad, Pakistan, in 2010. She has served as a
full-time Faculty Member at International Islamic
University, Islamabad; Iqra University, Islamabad;
NATALIA KRYVINSKA received the Ph.D. degree
and Saudi Electronic University, Riyadh, Saudi
in electrical and IT engineering from the Vienna
Arabia. She is currently an Assistant Professor with the Department of Cyber
University of Technology, Austria, and the Habil-
Security, Faculty of Computing and Artificial Intelligence, Air University,
itation (Docent Title) degree in management
Islamabad, and a Senior Researcher with the National Cybercrimes and
information systems from Comenius University in
Forensics Laboratory, National Center for Cyber Security, Islamabad. Her
Bratislava, Bratislava, Slovakia. She got her Pro-
research interests include computer forensics, machine learning, criminal
fessor title and was appointed for the professorship
profiling, software watermarking, intelligent systems, and data privacy
by the President of the Slovak Republic. She is
protection. She received the scholarship for her master’s degree.
currently a Full Professor and the Head of the
Department of the Information Systems, Faculty
of Management, Comenius University in Bratislava. Previously, she has
served as a University Lecturer and a Senior Researcher at the Department
of e-Business, School of Business Economics and Statistics, University of
Vienna. Her research interests include complex service systems engineering,
ROUBA BORGHOL received the master’s degree
service analytics, and applied mathematics.
in applied mathematics from the University of
Claude Bernard II, Lyon, and the Ph.D. degree
in mathematics from the University of Tours,
AHMAD S. ALMADHOR (Member, IEEE) France, in December 2005. She was a Lecturer
received the B.S.E. degree in computer science at the University of Tours, from 2005 to 2007.
from Jouf University (formerly Jouf College), She was a Research Fellow at the Polytechnic
Saudi Arabia, in 2005, the M.E. degree in com- School, Palaiseau, France, in 2008, and an Assis-
puter science and engineering from the University tant Professor at Lebanese University, Lebanon,
of South Carolina, Columbia, SC, USA, in 2010, from 2008 to 2009, and the College of Applied
and the Ph.D. degree in electrical and computer Science and Dhofar University, from 2010 to 2013. She is currently
engineering from the University of Denver, Den- an Assistant Professor of mathematics with the Rochester Institute of
ver, CO, USA, in 2019. From 2006 to 2008, he was Technology of Dubai. Throughout her 15 years in academia, she has
a Teaching Assistant and the College of Sciences taught several courses and topics, such as pure mathematics and applied
Manager, then a Lecturer, from 2011 to 2012, at Jouf University. Then, mathematics courses for both undergraduate and graduate programs.
he was a Senior Graduate Assistant and a Tutor Advisor at the University