Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

An_Improved_Dense_CNN_Architecture_for_Deepfake_Im

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier

An Improved Dense CNN Architecture


for Deepfake Image Detection
YOGESH PATEL1 , SUDEEP TANWAR1 , SENIOR MEMBER, IEEE, PRONAYA BHATTACHARYA2 ,
MEMBER, IEEE, RAJESH GUPTA1 , TURKI M. ALSUWIAN3 , INNOCENT EWEAN DAVISON4 ,
SENIOR MEMBER, IEEE, THOKOZILE F. MAZIBUKO4
1 Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India (e-mails:
20MCEC13@nirmauni.ac.in, sudeep.tanwar@nirmauni.ac.in, rajesh.gupta@nirmauni.ac.in)
2 Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University, Kolkata, India (e-mails:

pbhattacharya@kol.amity.edu)
3 Electrical Engineering Department, College of Engineering, Najran University, Najran 11001, Kingdom of Saudi Arabia. (e-mail: tmalsuwian@nu.edu.sa)
4 Department of Electrical Power Engineering Durban University of Technology, Durban, South Africa (e-mails: ThokozileM1@dut.ac.za,

inno.davidson@gmail.com)
Corresponding author: Sudeep Tanwar (e-mail: sudeep.tanwar@nirmauni.ac.in), Pronaya Bhattacharya (email:
pbhattacharya@kol.amity.edu), and Rajesh Gupta (email: rajesh.gupta@nirmauni.ac.in)

ABSTRACT Recent advancements in computer vision processing need potent tools to create realistic
deepfakes. A generative adversarial network (GAN) can fake the captured media streams, such as images,
audio, and video, and make them visually fit other environments. So, the dissemination of fake media
streams creates havoc in social communities and can destroy the reputation of a person or a community.
Moreover, it manipulates public sentiments and opinions toward the person or community. Recent studies
have suggested using the convolutional neural network (CNN) as an effective tool to detect deepfakes in the
network. But, most techniques cannot capture the inter-frame dissimilarities of the collected media streams.
Motivated by this, this paper presents a novel and improved deep-CNN (D-CNN) architecture for deepfake
detection with reasonable accuracy and high generalizability. Images from multiple sources are captured
to train the model, improving overall generalizability capabilities. The images are re-scaled and fed to the
D-CNN model. A binary-cross entropy and Adam optimizer are utilized to improve the learning rate of the
D-CNN model. We have considered seven different datasets from the reconstruction challenge with 5000
deepfake images and 10000 real images. The proposed model yields an accuracy of 98.33% in AttGANa ,
99.33% in GDWCTb , 95.33% in StyleGAN, 94.67% in StyleGAN2, and 99.17% in StarGANc real and
deepfake images, that indicates its viability in experimental setups.
a Facial
Attribute Editing by Only Changing What You Want (AttGAN)
b Group-wisedeep whitening-and-coloring transformation (GDWCT)
c A GAN capable of learning mappings among multiple domains (StarGAN)

INDEX TERMS Deepfake Detection, CNN, Convolutional Neural Network, GAN,

I. INTRODUCTION that person in a simulated fake environment. With the world


Artificial intelligence (AI) has progressed in diverse do- becoming more connected and networked through social me-
mains, including computer vision, speech generation and dia circles, DFs are increasingly used to create synthetic data
analysis, and the design of multi-agent systems in the in- of politicians, communities, actors, and media that give rise
dustry. In a similar direction, generative deep learning (DL) to fake news generation and dissemination. To generate DFs,
techniques have made a transformative shift in multimedia one effective algorithm is generative adversarial networks
processing, where recently, deepfakes (DF) have emerged, (GANs), initially proposed by Goodfellow et al. in 2014 [1]
which allows the creation of synthetic content based on cap- [2]. GANs [3] make it easy to create a fake synthetic image,
tured images and videos of persons. In DF, a person’s eyes, audio, and video content presented as real.
lips, and face movements are captured and superimposed on Technically, the GAN model comprises two networks: a
another external environment that forms a realistic vision of generator network (which aims to generate synthetic con-

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

tent out of the noise vector) and a discriminator network in the model. Thus, there is a stringent requirement for
(which aims to classify these generated synthetic images). An effective DF detection models that can form an optimal mix
iterative process is followed in the generator-discriminator of the aforementioned conditions [16].
network, where the discriminator feedback is supplied to the Recent approaches have suggested convolutional neural
generator network. Over time, the generator learns to create networks (CNN) as an effective fit for DF detection models
synthetic content, which looks extremely real and spoofs [17]. Usually, pre-trained CNN models are applied on single
the discriminator [4] [5]. Thus, the generator-discriminator frames, while other approaches have considered recurrent
network in DF raises concerns about the authenticity of convolutional networks where frames can be grouped to form
the published content on social platforms, as it is tough to the decision. In addition, some approaches consider facial
differentiate between real and fake content. Some notable expression patterns to capture fake content. Most CNN-
examples of DF include tools like DFaker, DeepFaceLab, based approaches are black boxes, where the models are
Faceswap, Faceswap-GAN, STGAN, StarGAN, and Face overfitting. In other cases, the validation, testing, and training
Swapping GAN (FSGAN), and many others [6]. In Deep- split are not uniformly distributed, which leads to different
FaceLab, it allows a user to swap a person’s face with another interpretations of the same datasets under different operat-
person’s face, change the age of a person, and synchronize ing conditions. For example, a DF detection model on the
the lip and eye movements in the video [7]. Face2Face [8] Facebook DF detection challenge dataset is proposed [18].
allows a real-time face enactment based on the RGB video The model scored an average precision of 82.56% on these
output, and the emulation of input expressions is carried out. datasets, but the performance drastically drops to 65.18% on
DF tools are also used in generating pornographic content the validation dataset, as it is collected from various sources.
that hurts the sentiments of the public [9] [10]. However, Thus, a generalization through CNN on one dataset does
hate speeches are other widely used propaganda in social not hold a cross-performance on another dataset [19]. The
circles. For example, a video of former United States 44𝑡 ℎ inconsistencies can be mitigated through an effective deep
President Barack Hussein Obama II published by BuzzFeed CNN (D-CNN) model that can address the cross-domain
shows the former president cursing another former president interpretability while maintaining the robustness and gener-
Donald Trump, which is done through the GAN technology. alizability of the DF detection scheme, which would yield a
It is massively distributed in social media circles as official high accuracy through an effective ensemble to the proposed
news, but the content is synthetic [11]. CNN approaches.
Thus, it raises a prime concern about the authenticity of
news content. To overcome the aforementioned issues of A. NOVELTY
DF GANs, a robust and highly generalizable DF detection Existing CNN-based DF detection models should conform
system is required. A good DF detection system can detect to the abilities of high generalizability, robustness, and in-
highly accurate manipulated and synthetic content from au- terpretability [12] [20]. The lack of the above-mentioned
thentic content. Recent approaches published in the literature abilities can be seen in the existing systems such as MesoNet,
point to the design of a robust DF detection scheme. Most MesoInceptionNet and many others. These are some well-
of the approaches in the literature lack robustness, effec- known CNN-based compact DF detection models focused
tiveness in training the DF detection model, and integration on detecting deepfake images for low-quality images. Even
of generalizability and interpretability in the model [12] though yielding promising results on the test set, these
[13] [14]. As indicated by Yu et al. in [12], the robustness models lack generalizability capabilities which is a well-
in the DF detection means that the system should be able discussed challenge in the domain of DF detection. Accuracy
to detect manipulation of high-quality and low-quality im- drops by a huge margin whenever these DF detection meth-
age/video contents. The system’s effectiveness should not be ods are tested against DF images generated using different
dropped based on the resolution of contents. Generally, the methods. DF detection systems learn certain features partic-
performance of DF detection systems drops over low-quality ular to the generation methods whose images were used to
content. Generalizability refers to the condition where each train these models. For Example, if any DF detection model
DF generation tool utilizes different approaches to generate was developed and trained over images from StarGAN and
the DF contents. Thus, the DF system should be able to detect then tested over reserved unseen test images will definitely
manipulations from these different tools in a single-shot [12]. yield good results, but when tested against images from some
Interpretability refers to the condition in the DT detection other DF generation method, say STYLEGAN, then accuracy
ecosystem, where a model should be able to predict which will drop by a huge margin. Sometimes accuracy drops to
parts of the image (person’s face, for example) are real or fake the point that it becomes just a random guess from the
and label the bounding boxes with fake probabilities. Thus, it model. Hence indicating a lack of generalizability being the
is crucial as it enables a system to understand the dynamics of challenge at large. Regardless, CNN approaches have mostly
generated synthetic content and presents a visual explanation treated DF detection as a binary classification problem, where
to understand the abnormalities in the images [15]. Current cross-domain interoperability is required [18]. The proposed
systems analyze DF detection on a sequential frame-by- work presents an improvement over MesoNet and MesoIn-
frame basis, which results in higher temporal inconsistencies ceptionNet, a D-CNN model that extracts deep features
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

from input images through the convolution layer to address terms of approach and the proposed method. The following
the aforementioned challenges. It captures the manipulation subsection presents the existing approaches in the classified
traces left behind as features and forms a classification model domains.
based on the similarities between real and fake images. The
similarities are projected to the closest match that improves A. PHYSICAL/PHYSIOLOGICAL FEATURES
the model predictability, as it captures the complex inconsis- In physical and physiological feature-based approaches,
tencies through the deep network. Furthermore, the model is visible discrepancies in the image/video content are exploited
trained over synthetic and real images from different sources, to classify whether the submitted content is synthetic or
improving the generalizability and cross-learning accuracy. real. The visible discrepancies primarily include improper
shadows, irregular geometry, missing details in facial fea-
B. RESEARCH CONTRIBUTION tures such as teeth or ears, inconsistent eye colors, head
Following are the major contributions of the paper. movements, and other features. For example, Li et al. [24]
• We analyze various existing approaches to DF detection leveraged inconsistencies in the blinking eye patterns, which
using the CNN model and highlight their advantages and the DF tools cannot mimic in a video stream. Authors in [29]
potential pitfalls. worked on the inconsistencies in the head pose movements
• We propose a novel D-CNN-based architecture to clas- compared to the rest body movements in the DF image and
sify DF image and video contents. The proposed model videos and identified the synthetic content in the data. The
is trained over images from seven data sources to in- authors identified 68 different landmarks in the whole body,
crease its generalizability. which includes 17 facial landmarks on the face. The direction
• We then evaluate the performance of the proposed ar- movement is considered from the center of the face, and if
chitecture using accuracy, precision, recall and F1 score the directions on two or more landmarks are the same, then
metrics over the reserved test set. they are classified as authentic content or synthetic. Matern
et al. [30] tried to use inconsistencies in other visual artifacts
C. ARTICLE LAYOUT such as inconsistent geometry of teeth, shadow, lighting, and
The layout of the article is as follows. Section II presents eye colors. However, the considered approach is good, but
the existing approaches of DF detection models. Section the latest DF generation tools have exploited and learned
III presents the problem formulation of the proposed DF about the geometry of faces, and thus the said models can
classification scheme. Section IV details the proposed model easily spoof the model. Thus, to overcome the feature-based
approach and the systematic explanation of the model pro- inconsistencies, the authors shifted to other representations,
cessing. Section V discusses the performance evaluation of including signal-level feature extraction.
the model based on various metrics. Section VI presents the
discussion and future challenges in the proposed scheme, and B. SIGNAL-LEVEL FEATURES
finally, section IX presents the work’s conclusion and future In signal-level features, deep features are extracted using
scope. either feature descriptors or feature extraction algorithms.
Thus, low-level features are extracted using steganalysis,
II. RELATED WORKS which the classification algorithm can use to classify whether
From the literature, it can be seen that researchers have the input content is DF. Kharbat et al. [31] presented a com-
already adopted different types of approaches to create an bination model of different signal-level feature descriptions
efficient DF detection system. Even though the approaches based on HOG, ORB, SURF, and others. The extracted deep
are there, their underlying principles in most approaches features are then fed as input to the SVM classifier to find
remain consistent, focusing on the utilization of inconsis- whether the image is DF. Authors in [35] utilized a feature
tencies and manipulation traces left behind by GAN tools extraction approach known as scale-invariant feature trans-
during the generation network [6]. Although nowadays, DF form (SIFT), which extracts key pixel features and analyzed
spans multiple modalities such as audio, video, image, or hy- them. Similar to the study of [31], Akhtar et al. [32] used
brid modality-based models. Among these, the image/video- local image descriptors such as LBP, LPQ, PHOG, SURF,
based DF is the most prominent; thus, most research is BSIF, and IQM. The results suggested that IQM performed
directed toward identifying image and video DFs. Thus, the more accurately than other models. However, as DF tools
image/video DF detection models are generally classified became more sophisticated, the GAN model fooled signal-
into three domains: physical/physiological features, signal level feature descriptors. Thus, the research shifted towards
level features, and data-driven models [21]. DF detection the data-driven DF detection models.
approaches involve more than one modality, i.e., combined
audio and video is termed multi-modal approach, where the C. DATA-DRIVEN MODELS
classification rests on computing the disharmony (or entropy In data-driven DF detection, we use deep neural networks
difference) between two different modalities in DF manipu- (DNN) instead of specific features to extract and learn about
lations [22] [23]. Table 1 presents a comparative analysis of the feature. Based on the learning, the model classifies the
the proposed D-CNN model against existing approaches in submitted content as DF or real images/videos. However,
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

TABLE 1: A comparative analysis of the proposed model with the existing approaches
Author Year Approach Algorithm Method Remarks
Li et al. [24] 2018 Physical Long-term recur- Used eye blinking pattern to detect DF videos Advanced DF videos are
attributes-based rent CNNs hard to detect using visual
detection feature sets
Marra et al. 2018 Data-driven mod- XceptionNet Performed a comparative study of InceptionNet, Lack of generalizability
[25] els DenseNet, and XceptionNet models. Among these,
XceptionNet performed best
Lee et al. 2018 Data driven mod- CNN Proposed a five layer CNN architecture called Deep Provides good results but
[26] els Forgery Discriminator lacked generalizability
Afchar et al. 2018 Data-driven mod- CNN It is a CNN model which utilizes inception module as Worked well with com-
[27] els architecture backbone pressed videos, but Xcep-
tionNet outperformed on
every dataset
Güera et al. 2018 Data-driven mod- RNN RNN-based temporal feature model accuracy is not effectively
[28] els high and can be outper-
formed via other models
Yang et al. 2019 Physical Support vector Exploited inconsistencies between the head pose of the Visual features are not re-
[29] attributes-based machine (SVM) face and other parts of the body using various facial liable with advanced DF
detection classifier landmarks datasets
Matern et al. 2019 Physical Ensemble model Used visual artifacts, such as difference in eye colors, dis- Visual features are not re-
[30] attribute-based with multi- proportionate shadow, details of invisible light reflections, liable with advanced DFs
detection layer perceptron and shape geometry
and logistic
regression
Kharbat et 2019 Signal level SVM classifier Combined multiple feature-point-descriptors, such as his- With advanced DF coming
al. [31] feature-based togram of oriented gradients (HOG), features from ac- up every year, extracting
detection celerated segment test (FAST), binary robust indepen- features is getting difficult.
dent elementary features (BRIEF), binary robust invariant
scalable keypoints (BRISK), KAZE, speeded-up robust
features (SURF), and oriented FAST and rotated BRIEF
(ORB). HOG achieved an accuracy of 94.5% with the
SVM classifier
Akhtar et al. 2019 Signal level SVM classifier Used local image descriptors, such as local binary pat- IQM performed best
[32] feature-based tern (LBP), local phase quantization (LPQ), pyramid his- among other models.
detection togram of oriented gradients (PHOG), binary gabor pat-
tern (BGP), and image quality metric (IQM).
Nguyen et al. 2019 Data-driven mod- Capsule network The capsule network consists of 3 primary capsules and It worked as good as
[33] els 2 output capsules. Features extracted from VGG-19 are MesoNet, but XceptionNet
provided as input. outperforms all the net-
works
Amerini et 2019 Data-driven mod- CNN Exploited discrepancies in motion across successive Other algorithms outper-
al. [34] els frames at 𝑓 (𝑡 ) and 𝑓 (𝑡 + 1). Used CNN as a classification forms the proposed model
algorithm
Proposed 2022 Data-driven CNN Proposed D-CNN based architecture trained over images Data pipeline in the pro-
Model model from seven different data sources posed architecture over DF
videos

to train the DNN model, a sufficient amount of data must with features extracted from VGG-19. The model performed
be supplied to the model, and thus the approach is named as well as MesoNet, but XceptionNet still outperforms it.
data-driven. Marra et al. [25] used networks such as Incep- Similar approaches are present, where the authors used the
tionNet, DenseNet, and XceptionNet, with a large dataset temporal component of the video to identify DF videos.
of samples collected from different categories from image- Güera et al. [28] proposed a recurrent neural network (RNN)
to-image translation, which were created using CycleGAN. model, and Amerini et al. [34] used CNN with the concept of
The results of their experiments suggested that XceptionNet using discrepancies across frames to identify DF videos.
outperforms all the other networks considered in the study. As outlined in the literature review section, the data-driven
However, the issue of generalizability remains, which was models normally outperformed the physiological and signal-
addressed by the authors in [26], where they proposed a deep based approaches. Thus, we consider a data-driven approach
forgery discriminator network, which is essentially a five- in the proposed scheme and propose a D-CNN model that
layer CNN architecture based on embedding the contrastive captures the deep features with improved generalization and
loss. The results were promising, but lack of generalizabil- model predictability.
ity remains the praoblem. Another CNN-based approach is
proposed by Afchar et al. [27], known as MesoNet, and it III. PROBLEM FORMULATION
performed well as it focused on the mesoscopic features of This section presents the problem formulation of the pro-
the images. Nguyen et al. [33] proposed a capsule network posed approach. The proposed model is a data-driven D-
CNN model for DF detection that predicts the respective
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

class of input images based on their features. To formulate the Algorithm 1 Working of the proposed approach.
problem, we consider a certain amount of available images, Input: I - RGB Images of Face, D - Destination address of
represented as 𝐼𝑡𝑜𝑡 𝑎𝑙 = {𝐼1 , 𝐼2 , . . . , 𝐼𝑛 }. 𝐼𝑡𝑜𝑡 𝑎𝑙 are sent for stored images, M - Destination address of pretrained model
training and are classified into real images, represented as 𝐼𝑟 Output: L - predicted likelihood, P - predicted label
or DF images, represented as 𝐼 𝑑 𝑓 . 𝐼𝑟 is constructed from 𝑝 procedure D EEPFAKE _D ETECTION( )
different data sources of real images, where any image 𝑖 ∈ 𝐼𝑟 Ht (Height) ← 160
is represented as follows: Wt (Width) ← 160
h i DataGen ← ImageDataGenerator()
𝑘=1 𝑘=1 𝑘=1 𝑘=2 𝑘=2 𝑘= 𝑝
𝐼𝑟 = N𝑖=1 , N𝑖=2 , . . . , N𝑖=𝑥 , N𝑖=1 , . . . , N𝑖=𝑥 , . . . , N𝑖=𝑥 Generator ← DataGen.flow_dir(D, Ht, Wt)
(1) model ← load_model(M)
i←1
Considering, each data source consists of 𝑥 real images, it can while i ≤ len(Generator.labels) do
be denoted as 𝑁 𝑘 . Thus, 𝐼𝑟 is further denoted as follows: I ← Generator.next()
𝑝
∑︁ L ← model.predict(I)
𝐼𝑟 = 𝑁𝑘 (2) P ← round(L)
𝑘=1 Display likelihood L
Display predicted label P
Similarly, for DF images, 𝐼 𝑑 𝑓 , there are 𝑞 data sources of Display Image I
deepfake images, and each source consists 𝑧 images. The i ← i+1
same is illustrated as follows: end while
end procedure
h i
𝑗=1 𝑗=1 𝑗=1 𝑗=2 𝑗=2 𝑗=𝑞
𝐼 𝑑 𝑓 = N𝑖=1 , N𝑖=2 , . . . , N𝑖=𝑧 , N𝑖=1 , . . . , N𝑖=𝑧 , . . . , N𝑖=𝑧
(3)

Similar to equation (2), we assume that 𝑧 images is denoted A. ALGORITHM


as a set 𝑁 𝑗 . Thus, 𝐼 𝑑 𝑓 is represented as: Algorithm 1 shows the proposed architecture’s flow. It
𝑞 takes the facial image as input and its directory address
∑︁
𝐼𝑑 𝑓 = 𝑁𝑗 (4) (where the image is stored). It outputs the predicted like-
𝑘=1 lihood and predicted label and prints the image that has
been processed. The procedure algorithm follows that it
Based on equation (2), and equation (4), 𝐼𝑡𝑜𝑡 𝑎𝑙 is described sets the 𝐻𝑒𝑖𝑔ℎ𝑡 and 𝑊𝑖𝑑𝑡ℎ to be 160. Then an object of
as follows: ImageDataGenerator will be created with all the necessary
𝑝
∑︁ 𝑞
∑︁ arguments for the required data augmentation techniques.
𝐼𝑡𝑜𝑡 𝑎𝑙 = 𝐼𝑟 + 𝐼 𝑑 𝑓 = 𝑁𝑘 + 𝑁𝑗 (5) This created object is called DataGen. Using this DataGen
𝑘=1 𝑘=1 object, we can flow the images one by one or in batches
based upon the arguments given to flow_from_directory().
The labels for the corresponding classes can be defined as:
This flow_from_directory() accepts the destination address of
Y = [y1 , y2 , . . . , y𝑚 ] (6) the stored images, 𝐻𝑒𝑖𝑔ℎ𝑡 and 𝑊𝑖𝑑𝑡ℎ as arguments. It will
resize the input image to a given size and apply all the data
where m = total number of images in 𝐼𝑡𝑜𝑡 𝑎𝑙 . The proposed augmentation techniques when required. It returns an object
architecture comes under the binary classification problem, called Generator. Now the pretrained model must be loaded
where there are only two classes, i.e., 𝑦 = 0 indicating 𝐼𝑟 , and in an object called model. Users can one by one read an image
𝑦 = 1 indicating 𝐼 𝑑 𝑓 image. from the input directory and give it as input to the model
to predict the likelihood of the prediction. Rounding up the
IV. PROPOSED APPROACH predicted likelihood gives the predicted label of the class, and
As discussed in Section III, the proposed model is a binary at the end, the image is also printed along with these two
classification model, where the input 𝐼𝑡𝑜𝑡 𝑎𝑙 is classified into outputs.
𝐼𝑟 or 𝐼 𝑑 𝑓 classes from multiple data sources for each class. The predicted likelihood ranges from 0 to 1. The closer
For DF detection, CNN is a prominent choice. Thus, we it is to zero, the more confident mode is that the image is
augment the CNN model with a deep layer and present real. Vice versa, the closer likelihood is to 1, the confident
the D-CNN model to extract the deep features from input model is about being the image deepfake. The closer it is
images using convolutional layers. Convolution operations to 0.5, it is much like a random guess. And thus, rounding
performed over the images in the earlier stages allow us to of the predicted likelihood gives the predicted label. Real is
extract much deeper features that can be used to classify the indicated by ’0’, and Deepfake is by ’1’. The image is also
input images into DF. Figure 1 presents the details of our printed alongside these results for the user.
proposed model.
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

FIGURE 1: The proposed D-CNN model.

B. PROPOSED ARCHITECTURE i.e., (3,3) instead of a larger filter such as (5,5) or (7,7). With
this, we now have the initial feature maps extracted from
This section discusses the proposed CNN-based architec- the input images, but the distributions of input batches can
ture (Figure 2). In General, CNN architecture consists of vary a lot for different batches based on the types of images
both convolutional and pooling layers. Convolutional layers that are included in them. Therefore, it can create problems
extract deep features from input images, whereas pooling with the optimizer algorithm’s convergence, destabilizing the
layers reduce the dimensionality of the input feature maps. training process. Thus it is helpful when the input to each
After convolutional layers, all these feature maps are made layer is unit gaussians. And to do that, these feature maps
into a one-dimensional array using a flattened layer and are batch normalized, which results in a speeded-up training
given as input to the fully connected layer. After the fully process (faster convergence) and decreased dependency on
connected layer, the output layer predicts the subsequent the weight initialization.
class based on the input image. Our proposed architecture The batch normalized tensor of size (160,160,80) is then
also follows the same approach where earlier layers consist passed on to the next block where two convolutional layers
of convolutional layers. After the convolutional layers, a are followed, which performs convolution operations with a
flattened layer is used, followed by a series of fully connected filter size of (3,3) and 16 different filters each, with Leaky
layers. In the end, the sigmoid function is used to predict the ReLU as activation. It allows us to extract deeper features
likelihood of the predicted output. Batch Normalization has that could be more meaningful in detecting deepfake images.
been used after certain layers to stabilize the training process, With these extracted feature maps, they are once again batch
whereas average pooling has decreased the dimensionality of normalized. Generally, with deeper CNNs, a larger number of
the feature maps over the proceeding layers. The black box filters will be used in deeper layers to extract deep features.
diagram of the proposed architecture can be seen in Figure 1. But due to this, the dimensions of the feature maps will
The proposed architecture reads input images with a keep increasing, resulting in many computations needed as
height and width of 160 pixels each and a batch size of we proceed further. To tackle this issue, pooling layers are
64. Then the various data augmentation techniques, such as used to decrease the dimensionality of the extracted feature
rescaling the input array, rotating the input image randomly maps. With this goal in mind, we have used the average
between 0 to 360 degrees, horizontal and vertical flip, shear pooling layer of size (2,2), which essentially decreases the
range, and a zoom range of 0.2, are all applied using Keras dimensions of feature maps by half. The output from the
preprocessing library. previous block with Average Pooling layers will be of size
Thus, the proposed architecture accepts input images of (80,80,16). This is accepted as input for the next block, which
size (160,160,3) with all the data augmentation techniques has a similar structure to the previous block. It differs only
applied. The flow diagram of the proposed architecture can by having three convolutional layers with a filter size of (3,3)
be seen in Figure 2. For the input, at the first layer, 2D and 32 different filters, with Leaky ReLU as activation. And
convolution operations are performed using filter sizes of then again, batch normalization and average pooling layers
(3,3) and 8 different filters. Leaky ReLU is also used at are followed. With this pooling layer from the previous block,
this layer as an activation function. Since it is the first layer the dimension of the feature map becomes (40,40,32). And
extracting image features, it is going to be a high-level feature then, it is taken as input for the next block, which consists of
of input images, and thus, the filter size is kept to be small,
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

FIGURE 2: Flow diagram of the proposed model.

4 consecutive convolutional layers with filter sizes (3,3) and with a value of 0.5. Finally, there is an output layer with a
64 different filters, with Leaky ReLU as activation. It is again single neuron and a sigmoid activation function. It predicts
followed by batch normalization and average pooling layer. whether the input image is a deepfake image or not. If the
With this, the next block receives an input size (20, 20, value is less than 0.5, then the predicted output is real; else,
64). Because of the previous four blocks, we have extracted it is a deepfake image. The loss function used during training
deep image features, which could be used to classify images is binary cross-entropy, and the optimizer used is ’Adam’
as a deepfake or not. So for the next two blocks, we try with a learning rate of 0.01. The black box diagram of the
to use a large filter of size (5,5). For the current block, we architecture can be seen in Figure 1 and Table 2 describes
use a convolutional layer with (5,5) filter and 128 different the output dimensions of each layer along with the number
filters, with LeakyReLU as activation. Then followed by of parameters.
batch normalization and max pooling layer. This reduces
the output dimensions to be (10,10,128). The next block V. RESULTS AND DISCUSSION
accepts the output from the previous block. It is followed by This section discusses the performance delivered by the
a convolutional layer with (5,5) filter size and 256 different proposed architecture and the results achieved.
filters, with LeakyReLU as activation. It is again followed by
batch normalization and the Max pooling layer, which gives A. SIMULATION SETUP
the dimension output (5,5,256). The Google collab pro has been used for training, which
The output from the previous block is transformed into a usually assigns Tesla T4 or Tesla P100 GPU. Since Google
one-dimensional array using the flattened layer.Followed by collab restricts the prolonged usage of GPUs, checkpointing
the flatten layer, there is a dropout layer with value of 0.5, has been used during the training to save the best-performing
which randomly sets half of the input units to zero. It helps model based on the lowest validation loss value. If necessary,
our model to avoid overfitting the training data. Being an the training could be resumed from the last best model saved,
improvement over MesoNet, the value of dropout layer has but it has never been used.
not been changed from it’s predecessor which experimentally
also yields best results in terms of avoiding overfitting. Fol- B. DATASET DESCRIPTION
lowing the previous block, there is a fully connected layer The dataset we decided to use was part of Deepfake Images
with 32 neurons/units. It also utilizes LeakyReLU as an Detection and Reconstruction Challenge [36]. The dataset
activation function. It is then followed by a dropout layer with consisted of real images from image datasets of CelebA and
a value of 0.5. Similarly, there are two consecutive blocks FFHQ. Both contain 5000 images each. Whereas 1000 im-
of the fully connected layer with 16 neurons/units with the ages each from GDWCT, AttGAN, STARGAN, StyleGAN
LeakyReLU activation function, followed by a dropout layer and StyleGAN2 datasets are included for deepfake detection.
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

TABLE 2: Output dimensions and parameters for each layer.


Layer Layer Type Output Dimension No. of Parameters
1 Input (Convolution 2-D) (160 x 160 x 8) 224
2 Batch Normalization (160 x 160 x 8) 32
3 Convolution 2-D (160 x 160 x 16) 1168
4 Convolution 2-D (160 x 160 x 16) 2320
5 Batch Normalization (160 x 160 x 16) 64
6 Average Pooling 2-D (80 x 80 x 16) 0
7 Convolution 2-D (80 x 80 x 32) 4640
8 Convolution 2-D (80 x 80 x 32) 9248
9 Convolution 2-D (80 x 80 x 32) 9248
10 Batch Normalization (80 x 80 x 32) 128
11 Average Pooling 2-D (40 x 40 x 32) 0
12 Convolution 2-D (40 x 40 x 64) 18496
13 Convolution 2-D (40 x 40 x 64) 36928
14 Convolution 2-D (40 x 40 x 64) 36928
15 Convolution 2-D (40 x 40 x 64) 36928
16 Batch Normalization (40 x 40 x 64) 256
17 Average Pooling 2-D (20 x 20 x 64) 0
18 Convolution 2-D (20 x 20 x 128) 204928
19 Batch Normalization (20 x 20 x 128) 512
20 Max Pooling 2-D (10 x 10 x 128) 0
21 Convolution 2-D (10 x 10 x 256) 819456
22 Batch Normalization (1 x 10 x 256) 1024
23 Max Pooling 2-D (5 x 5 x 256) 0
24 Flatten (6,400) 0
25 Dropout (6,400) 0
26 Dense (32) 204832
27 Leaky ReLU (32) 0
28 Dropout (32) 0
29 Dense (16) 528
30 Leaky ReLU (16) 0
31 Dropout (16) 0
32 Dense (16) 272
33 Leaky ReLU (16) 0
34 Dropout (16) 0
35 Dense (1) 17

Total Parameters 1,388,177


Trainable Parameters 1,387,169
Non-Traininable Parameters 1,008

Since the image provided are taken from different types of TABLE 3: Resolution of images from each Data Source
GAN architecture and datasets, images from these different Type of Image Dataset Resolution No. of Images
sources had different resolutions ranging from 1024x1024 Deepfake GDWCT 216 x 216 1000
Deepfake AttGAN 256 x 256 1000
being the largest to 178x218 being the smallest. The reso- Deepfake StarGAN 245 x 256 1000
lutions of images are discussed in Table 3. Deepfake StyleGAN 1024 x 1024 1000
Thus, there were 10000 real images and 5000 deepfake Deepfake StyleGAN2 1024 x 1024 1000
images. To make a balanced set, we decided to use 5000 Real CelebA 178 x 218 5000
Real FFHQ 1024 x 1024 5000
real images only. Thus, to make it completely balanced,
we randomly sampled 2500 images from CelebA and 2500
from FFHQ. Thus, a total of 5000 randomly sampled real
ratio of real images from both data sources. It thus gave 350
images from these two sources, whereas we have taken 5000
deepfake images. real images for the validation set.
We divided the image dataset into a train, validation, We then follow the same strategy with deepfakes as well.
and test sets. 60% of images are used for training, 10% We sampled 70% of each type of GAN image. That means
for validation, and 30% for test sets (the images have been we sampled 700 images from GDWCT, AttGAN, STAR-
properly balanced). Firstly, 70% of random sample images GAN, StyleGAN and StyleGAN2 each. Hence giving a well-
balanced set of 3500 deepfake images. And similarly, we
are for training from the real dataset from both data sources.
We randomly sampled 1750 images from CelebA and 1750 selected every 10th image from the training set to be used as a
from FFHQ. It makes 3500 real images for training. Out of validation set giving 350 deepfake images for the validation
3500, we selected every 10th image from this training dataset set. Thus, now we have 3150 real images and 3150 deepfake
to be kept as reserved for the validation set. It ensured the images for training, along with 350 real and 350 deepfake
images for validation. And the remaining 30% images were
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

FIGURE 3: Training accuracy and Training Loss over the training epochs

FIGURE 4: Validation accuracy and Validation Loss over the training epochs

used for testing the model for performance after training. For transmission and storage costs. Hence, CNN network with
the training purpose, Data Augmentation has been applied input size of (160,160,3) is selected with the hope to ensure
to these images. These data augmentation includes vertical usefulness of the model in real world use case. Low reso-
flipping, horizontal flipping, zooming by 0.2, shear range lution images also helps to keep the computational costs to
by 0.2, width shift range and height shift range by 0.2, as minimum. But there is scope for future work by either mov-
well as random rotation of 360 degrees. These will help the ing to variable sized input NN with Global Pooling layers or
model to learn detect deepfake images while maintaining experimenting with various efficient upscaling techniques to
spatial and scale invariance properties. Since training images see performance improvement.
consisted of only upright faces that too positioned at the
center of the image, there was a very high possibility that C. TRAINING
the D-CNN model would learn to discriminate between DF During training, the Adam optimizer is used with a learn-
and real images based on the features of the center of the ing rate of 0.01. The number of epochs used is 550. Due to
images only, that too with upright faces only. To ensure the the limitations of hardware and time usage on Google Colab,
dataset consisted facial images from different angles, facial check pointing and CSVLogger have been used to note the
images with different spatial position within images and of training accuracy and training loss as well as validation
different scale; data augmentation techniques were used. It accuracy and validation loss during the training phase. The
helps model to learn spatial and scale invariant features which batch size is set to 64, stabilising the training phase quite a
are of utmost importance for a DF detection system in the lot. We save the entire model instead of weights only.
wild. The input image size was set to 160 x 160. Historically From Figure 3, we can see that the training accuracy
detecting low resolution and low quality deepfake images steadily increases till the 200th epoch. After the 200th epoch,
has been considered a difficult task since there is much less change in accuracy slowly plateaus over preceding epochs.
information to work with. Adding more to that, conventional The same can be seen with loss values during the training
social media sites downscale high resolution images to avoid phase. Even though not a huge change, performance slowly
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

positives. The precision formula will be when the true


positive is indicated as TP and false positive as FP.
𝑇𝑃
Precision = (7)
𝑇 𝑃 + 𝐹𝑃
The precision that we get for our model on the test set
is 0.97 for Real images whereas precision for Deepfake
images is 0.98.
• Recall refers to the model’s ability to classify positive
positives correctly. So it is a metric which indicates how
many images were classified as deepfake out of all the
truly deepfake images submitted to the model. So the
images that were truly deepfake and classified as deep-
fake will be considered True Positive, and those which
FIGURE 5: Validation accuracy and loss values over the were truly deepfake but misclassified as real images
training epochs will be considered false negative. The recall formula is
when TP indicates True Positives and FN indicates false
negatives.
increases over the epochs. The same trend is also seen in the 𝑇𝑃
Recall = (8)
validation accuracy and loss values in Figure 4. There can be 𝑇𝑃 + 𝐹𝑁
seen fluctuations in the validation accuracy and loss; the most The recall we get for our model on the test set is 0.98 for
probable reason is the usage of Batch Normalization along Real images, whereas the precision for Deepfake images
with the Dropout layer, aggravating the situation even more. is 0.97.
But, apart from those fluctuations, it can be seen in Figure • F1 score: It indicates the balance between precision and
4 that validation accuracy and validation loss closely follow recall. It is the harmonic mean of precision and recall of
the same trend as that of Training accuracy and Training the proposed approach. It takes into consideration false
loss values. Although validation accuracy and validation positives and False Negatives both into consideration.
loss values fluctuate slightly, it closely follows the training The F1 score is calculated as follows:
loss value, indicating no overfitting in the model. It can be Precision · Recall
observed from Figure 5 that there is not much numerical 𝐹1 = 2 · (9)
Precision + Recall
difference between training loss and validation loss which is a
The F1 score for both classes is 0.97, indicating a good
good indication for a generalized model and not an overfitting
balance between precision and recall values.
model.
The classification report of the performance of our model
over the test dataset can be seen in Table 4. Precision, recall,
D. EVALUATION METRICS
𝑓 1-score, macro average, and weighted average can be seen.
The proposed model yields 97.2% of accuracy on the From all these results, the model’s precision, recall, and
Test dataset, which consists of 1500 real and 1500 deepfake 𝑓 1 score show promising results. To further understand the
images with all the data augmentations techniques applied. classification capabilities of the proposed model, we have
As we have already seen during the training phase on the also generated a confusion matrix, shown in Figure 6. Here
validation set, the model’s performance is relatively good, ’0’ indicates real, and ’1’ indicates a deepfake label. True
and there are no signs of overfitting in the model. In addition, label, which can be seen in the figure, means the real label
the Testing dataset’s accuracy proves that the model has been assigned to it, whereas predicted label means the predicted
trained properly and not overfitted to the training dataset. The label from our model. These create 4 categories: true positive,
accuracy of training, validation and the testing dataset was true negative, false positive, and false negative. So category
around 97%. Along with accuracy, the precision, recall, and with ’0’ as a true label and ’0’ as the predicted label will
𝐹1 score values are also used to evaluate the model’s perfor- be considered a true negative since both the labels (true as
mance. Along with this, we present the confusion matrix to well as predicted labels) suggest it to be a real image. So they
understand the classification capabilities of the model. have been classified as negative for deepfake images, which
• Precision: It refers to the ability of the model to classify is true. Similarly, the category with true label and predicted
positive out of all positive predictions made correctly. label as ’1’ are true positive. The category with true label as
It is a metric that indicates how many images were ’0’ and predicted label as ’1’ are considered false positives
truly deepfake images out of all the images predicted since they were real images misclassified as deepfake by our
as deepfake by the proposed model. The truly deep- proposed model. Similarly, the category with true label as ’1’
fake images classified as deepfake will be considered and the predicted label as ’0’ is considered a false negative.
True Positive. In contrast, those predicted as deepfake Since they were truly deepfake images but were predicted
but truly were real images will be classified as false as real. In simpler terms, the confusion matrix shows that
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

TABLE 4: Classification report. images separately. So we randomly sampled 300 real images,
Precision Recall f1-score Support 150 from CelebA and 150 from FFHQ. We then evaluate each
Real 0.97 0.98 0.97 1500 model individually. And our model yielded 98.83% accuracy
Deepfake 0.98 0.97 0.97 1500
on AttGAN vs CelebA+FFHQ images. Whereas it gave
Macro Average 0.97 0.97 0.97 3000 99.33% accuracy on GDWCT vs CelebA+FFHQ images.
Weighted Average 0.97 0.97 0.97 3000 It gave 95.33% accuracy on StyleGAN vs CelebA+FFHQ
and 94.67% on StyleGAN2 vs CelebA+FFHQ images. Fi-
nally, our model yielded 99.17% accuracy on StarGAN vs
CelebA+FFHQ images.
We then evaluated the proposed model on the imbalanced
set. We already had 300 images for each data source stored
separately. We fed all 300 deepfake images separately for
each data source to see our model’s performance. Our model
gave complete 100% accuracy in classifying deepfake images
generated from AttGAN with a loss value of 0.0051, whereas
on GDWCT, it gave an accuracy of 99.33% with a loss value
of 0.0141. Our model performed well over StyleGAN and
StyleGAN2 with an accuracy of 95.66% and 93.99%. In con-
trast, our model gave 99.33% accuracy in classifying images
generated using StarGAN. Table 5 presents the performance
of the proposed model under different image databases with
real images.
The model indicates promising results over the reserved
Test set images. The model’s performance is balanced over
all the different image data sources. When we evaluated our
FIGURE 6: Confusion matrix. model over all the different data sources separately, we got
more insight into the model’s performance. It is essential to
understand that the accuracy of the combined data set might
out of 1500 Real test images, our model has classified 1471 look promising, but the model might lack performance over
images correctly, whereas 29 Real images were misclassified certain kinds of images. When we look into it that way, it is
as Deepfake. And out of 1500 Deepfake images, our model seen that our model shows extraordinary performance over
classifies 1450 images correctly and misclassifies 50 images the images from AttGAN, GDWCT, and StarGAN. In con-
as Real. Figure 6. trast, performance drops a bit over images from StyleGAN
and StyleGAN2.
VI. DISCUSSION
When investigated further, it is found that StyleGAN and
To further understand the proposed model’s performance.
StyleGAN2 images are very high-resolution images, whereas
We extend our analysis by evaluating our model over images
images from AttGAN, GDWCT, and StarGAN are low-
of all these data sources separately. It will allow us to un-
resolution images. Thus, it suggests that our model performs
derstand more about the generalizability capabilities of the
extraordinarily over low-resolution images but drops a bit
proposed model. So, in the test set, we had 1500 deepfake
(not much) over high-resolution images. Although still, the
images from 5 GAN architectures. Thus, it means we had
performance is quite promising and impressive, even for
300 deepfake images from each data source. When then
high-resolution images. But the overall performance, consid-
combined these images from different data sources with real
ering the images with such different data sources and resolu-
tions, is still pretty impressive. Some of the results are shown
TABLE 5: Performance of the proposed model on individual in Figure 7. As it can be seen in the figure, model outputs
data source. it’s results in terms of confidence score which essentially is
probability of that image being a deepfake image or not. If
Subset of Test set Proposed Model the model confidence score is closer to ’0’, it is extremely
AttGAN images + Real images 98.33%
confident about the image being real and vice versa. When
the confidence score comes closer to ’0.5’, it indicates that
GDWCT images + Real images 99.33% the model is bit confused. And it can be seen in the figure
for misclassified Deepfake images and misclassified Real
StyleGAN images + Real images 95.33%
images, the confidence score is closer to ’0.5’. Initial analysis
StyleGAN2 + Real images 94.67% has suggested that since there are manipulation traces and
little blurriness left behind for deepfake images, the neu-
StarGAN + Real images 99.17%
rons activation suggests that background areas are activated
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

TABLE 6: Experiment results. VIII. FUTURE SCOPE


As already discussed above, for real-world use of efficient
Model Accuracy
deepfake detection methods, it must be robust, generalizable,
MesoNet 57% computation efficient, and quick. Moreover, there is always a
trade-off between accuracy and time latency. As for the pro-
MesoInception 50.73%
posed model, the future direction could involve experimen-
Proposed model 77% tation with variable input NN along with Global Pooling so
that the resolutions of the input images are not downscaled.
Furthermore, experimentation with various efficient image
upscaling algorithms and their effects on the performance
strongly than facial features such as eyes, nose and mouth.
could also be analyzed for better insights.
This gets inverted completely for real images where eyes,
nose and mouth areas are strongly activated. It suggests that
IX. CONCLUSION
eyes, nose and mouth are far detailed in real images and this
It has always been challenging to detect deepfake content,
becomes the basis for discriminating capabilities of proposed
as they are generated at a different level of abstraction. It has
system.
always been treated as a binary classification problem, as real
or deepfake class labels. So, CNN is a prominent solution to
VII. EXPERIMENT detect deepfake images. Motivated by this, we have proposed
To compare the performance and generalizability capa- a CNN-based architecture to detect deepfake images in this
bilities of the proposed model with existing models, we paper. The proposed architecture offers 97.2% accuracy con-
perform a small experiment where we test our proposed sidering images from 5 different data sources for deepfake
model with MesoNet and MesoInception network over the images and 2 different data sources for real images. Even
CelebDF dataset. In literature, it is considered a challenging though there is a huge difference between the resolutions
dataset for deepfake detection. Since none of the models are of these images, the proposed architecture provides a well-
trained over this dataset, it will be an ideal condition to test balanced performance over all data sources. The work can
the generalizability capabilities of these three models. Figure be further extended to classify video deepfake content. This
8 shows the experimental setup of the proposed model. model can be used for video deepfake detection, where each
CelebDF dataset consists of 795 deepfake videos and 408 video frame is extracted, the face is detected, cropped, and
real videos. Real videos are divided into 158 real videos then fed to the model to identify deepfake manipulations.
provided by authors of celebrities and 250 YouTube videos. This can be easily done by creating a pipeline to process this
We decided to work on 795 deepfake videos and 158 real video data. Thus, the proposed CNN-based model performs
videos. To simulate deepfake detection in the wild, for both well and has quite a balanced performance over the given
the real and deepfake videos, we extracted every 50th frame dataset with all the data augmentation techniques applied.
of all the videos. We performed face recognition using the Furthermore, it shows good generalizability and performance
Haar Cascade algorithm. Haar cascade was selected for its over unseen reserved test sets.
excellent capabilities of identifying faces irrespective of scale
and location within the image. These recognized faces were REFERENCES
cropped and stored. Hence, this resulted in a total of 4877 [1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
facial images, out of which 3816 were deepfake images and S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
2014.
1061 were real images. [2] K. N. Ramadhani and R. Munir, “A comparative study of deepfake video
We also implemented MesoNet and MesoInception Net- detection method,” in 2020 3rd International Conference on Information
work within our local system. We imported pretrained and Communications Technology (ICOIACT), pp. 394–399, 2020.
[3] R. Katarya and A. Lal, “A study on combating emerging threat of deepfake
weights provided by the authors. The results for MesoNet, weaponization,” in 2020 Fourth International Conference on I-SMAC (IoT
MesoInception network, and the proposed model are 57%, in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 485–490, 2020.
50.73%, and 77%, respectively. MesoNet delivers 89% ac- [4] D. Yadav and S. Salmani, “Deepfake: A survey on facial forgery technique
using generative adversarial network,” in 2019 International Conference
curacy on its native test set of Face2Face images, whereas on Intelligent Computing and Control Systems (ICCS), pp. 852–857, 2019.
MesoInception delivers accuracy of 91%, drops to 57% and [5] A. Malik, M. Kuribayashi, S. M. Abdullahi, and A. N. Khan, “Deepfake
50.73%, respectively. Table 6 shows the experimental results detection for human face images and videos: A survey,” IEEE Access,
vol. 10, pp. 18757–18775, 2022.
of the proposed model with the existing models. It reflects [6] C. C. Ki Chan, V. Kumar, S. Delaney, and M. Gochoo, “Combating
how achieving generalizability capabilities is an arduous task deepfakes: Multi-lstm and blockchain as proof of authenticity for digital
but, at the same time, of utmost importance for a real-world media,” in 2020 IEEE / ITU International Conference on Artificial Intelli-
gence for Good (AI4G), pp. 55–62, 2020.
use case. This drop in accuracy can also be seen in the [7] “Deepfacelab.” https://github.com/iperov/DeepFaceLab. Accessed: 2022-
proposed model, but it still manages to hold its ground. There 01-14.
still lies a scope for future works, which will be discussed in [8] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner,
“Face2face: Real-time face capture and reenactment of rgb videos,” in
the next section. 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 2387–2395, 2016.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

FIGURE 7: Classification of deepfake, real, and misclassified images by the proposed model.

FIGURE 8: Experiment setup for comparing the performance capabilities of the proposed and the existing models.

[9] P. Korshunov and S. Marcel, “Vulnerability assessment and detection of other- audio-visual dissonance-based deepfake detection and localization,”
deepfake videos,” in 2019 International Conference on Biometrics (ICB), 2020.
pp. 1–6, 2019. [23] Y. Zhang, J. Zhan, W. Jiang, and Z. Fan, “Deepfake detection based on
[10] Y. Mirsky and W. Lee, “The creation and detection of deepfakes,” ACM incompatibility between multiple modes,” in 2021 International Confer-
Computing Surveys, vol. 54, no. 1, 2021. cited By 25. ence on Intelligent Technology and Embedded Systems (ICITES), pp. 1–7,
[11] “A video that appeared to show obama calling trump a "dipsh-t" 2021.
is a warning about a disturbing new trend called ’deepfakes’.” [24] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake
https://www.businessinsider.in/tech/a-video-that-appeared-to-show- videos by detecting eye blinking,” in 2018 IEEE International Workshop
obama-calling-trump-a-dipsh-t-is-a-warning-about-a-disturbing-new- on Information Forensics and Security (WIFS), pp. 1–7, 2018.
trend-called-deepfakes/articleshow/63807263.cms. Accessed: 2022-05- [25] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, “Detection
25. of gan-generated fake images over social networks,” in 2018 IEEE Con-
[12] P. Yu, Z. Xia, J. Fei, and Y. Lu, “A survey on deepfake video detection,” ference on Multimedia Information Processing and Retrieval (MIPR),
IET Biometrics, vol. 10, no. 6, pp. 607–624, 2021. pp. 384–389, 2018.
[13] Y.-J. Heo, Y.-J. Choi, Y.-W. Lee, and B.-G. Kim, “Deepfake detection [26] C.-C. Hsu, C.-Y. Lee, and Y.-X. Zhuang, “Learning to detect fake face
scheme based on vision transformer and distillation,” 2021. images in the wild,” in 2018 International Symposium on Computer,
[14] R. Caldelli, L. Galteri, I. Amerini, and A. Del Bimbo, “Optical flow based Consumer and Control (IS3C), pp. 388–391, 2018.
cnn for detection of unlearnt deepfake manipulations,” Pattern Recognition [27] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact
Letters, vol. 146, pp. 31–37, 2021. facial video forgery detection network,” in 2018 IEEE International Work-
[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image shop on Information Forensics and Security (WIFS), pp. 1–7, 2018.
translation using cycle-consistent adversarial networks,” in 2017 IEEE [28] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural
International Conference on Computer Vision (ICCV), Venice, Italy, networks,” in 2018 15th IEEE International Conference on Advanced
pp. 2242–2251, 2017. Video and Signal Based Surveillance (AVSS), pp. 1–6, 2018.
[16] K. Patel, D. Mehta, C. Mistry, R. Gupta, S. Tanwar, N. Kumar, and [29] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
M. Alazab, “Facial sentiment analysis using ai techniques: State-of-the- head poses,” in ICASSP 2019 - 2019 IEEE International Conference on
art, taxonomies, and challenges,” IEEE Access, vol. 8, pp. 90495–90519, Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265, 2019.
2020. [30] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to
[17] H. S. Shad, M. M. Rizvee, N. T. Roza, S. M. A. Hoq, M. Monirujja- expose deepfakes and face manipulations,” in 2019 IEEE Winter Applica-
man Khan, A. Singh, A. Zaguia, S. Bourouis, and S. K. Gupta, “Com- tions of Computer Vision Workshops (WACVW), pp. 83–92, 2019.
parative analysis of deepfake image detection method using convolutional [31] F. F. Kharbat, T. Elamsy, A. Mahmoud, and R. Abdullah, “Image feature
neural network,” Intell. Neuroscience, vol. 2021, jan 2021. detectors for deepfake video detection,” in 2019 IEEE/ACS 16th Inter-
[18] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. national Conference on Computer Systems and Applications (AICCSA),
Ferrer, “The deepfake detection challenge (dfdc) dataset,” 2020. pp. 1–4, 2019.
[19] J. Hathaliya, R. Parekh, N. Patel, R. Gupta, S. Tanwar, F. Alqahtani, M. El- [32] Z. Akhtar and D. Dasgupta, “A comparative evaluation of local feature de-
ghatwary, O. Ivanov, M. S. Raboaca, and B.-C. Neagu, “Convolutional scriptors for deepfakes detection,” in 2019 IEEE International Symposium
neural network-based parkinson disease classification using spect imaging on Technologies for Homeland Security (HST), pp. 1–5, 2019.
data,” Mathematics, vol. 10, no. 15, 2022. [33] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Using
[20] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact capsule networks to detect forged images and videos,” in ICASSP 2019
facial video forgery detection network,” in 2018 IEEE International Work- - 2019 IEEE International Conference on Acoustics, Speech and Signal
shop on Information Forensics and Security (WIFS), pp. 1–7, 2018. Processing (ICASSP), pp. 2307–2311, 2019.
[21] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale chal- [34] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, “Deepfake video
lenging dataset for deepfake forensics,” in 2020 IEEE/CVF Conference on detection through optical flow based cnn,” in 2019 IEEE/CVF Interna-
Computer Vision and Pattern Recognition (CVPR), pp. 3204–3213, 2020. tional Conference on Computer Vision Workshop (ICCVW), pp. 1205–
[22] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian, “Not made for each 1207, 2019.

VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

[35] M. Dordevic, M. Milivojevic, and A. Gavrovska, “Deepfake video analysis PRONAYA BHATTACHARYA (M’22) is cur-
using sift features,” in 2019 27th Telecommunications Forum (TELFOR), rently employed as an Associate Professor with
pp. 1–4, 2019. the Computer Science and Engineering Depart-
[36] “Deepfake images detection and reconstruction challenge – 21st interna- ment, Amity School of Engineering and Tech-
tional conference on image analysis and processing..” https://iplab.dmi. nology, Amity University, Kolkata, India. He has
unict.it/Deepfakechallenge/. Accessed: 2023-01-05. completed his PhD from Dr. A. P. J Abdul Kalam
Technical University, Lucknow, Uttar Pradesh, In-
YOGESH PATEL has completed his Master of dia. He has over ten years of teaching experience.
Technology in Computer Engineering from In- He has authored or coauthored more than 100
stitute of Technology, Nirma University. He has research papers in leading SCI journals and top
active interest in domains like Deep Learning, core IEEE COMSOC A* conferences. Some of his top-notch findings are
Data Science, and Blockchain. He is working on published in reputed SCI journals, like IEEE JOURNAL OF BIOMEDI-
presenting solutions to integrate Generative Ad- CAL AND HEALTH INFORMATICS, IEEE TRANSACTIONS ON VE-
versarial Networks in adversarial learning tech- HICULAR TECHNOLOGY, IEEE INTERNET OF THINGS JOURNAL,
niques in wide range of domains like Healthcare, IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEER-
Vehicular Networks, and ermerging communica- ING, IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYS-
tion networks. TEMS, IEEE TRANSACTIONS OF NETWORK AND SERVICE MAN-
AGEMENT, IEEE ACCESS, IEEE SENSORS, IEEE INTERNET OF
THINGS MAGAZINE, IEEE COMMUNICATION STANDARDS MAG-
AZINE, ETT (Wiley), Expert Systems (Wiley), CCPE(Wiley), FGCS
(Elsevier), OQEL (Springer), WPC (Springer), ACM-MOBICOM, IEEE-
INFOCOM, IEEE-ICC, IEEE-CITS, IEEE-ICIEM, IEEE-CCCI, and IEEE-
ECAI. He has an H-index of 19 and an i10-index of 32. His research interests
include healthcare analytics, optical switching and networking, federated
learning, blockchain, and the IoT. He has been appointed at the capacity
of keynote speaker, technical committee member, and session chair across
the globe. He was awarded eight best paper Awards in Springer ICRIC-
2019, IEEE-ICIEM-2021, IEEE-ECAI-2021, Springer COMS2-2021, and
IEEE-ICIEM-2022. He is a Reviewer of 21 reputed SCI journals, like IEEE
INTERNET OF THINGS JOURNAL, IEEE TRANSACTIONS ON IN-
DUSTRIAL INFORMATICS, IEEE TRANSACTIONS OF VEHICULAR
TECHNOLOGY, IEEE JOURNAL OF BIOMEDICAL AND HEALTH
INFORMATICS, IEEE ACCESS, IEEE NETWORK, ETT (Wiley), IJCS
(Wiley), MTAP (Springer), OSN (Elsevier), WPC (Springer), and others. He
is also an active member of ST Research Laboratory (www.sudeeptanwar.in)

SUDEEP TANWAR (M’15, SM’21) is currently


working as a Professor with the Computer Science
and Engineering Department, Institute of Tech-
nology, Nirma University, India. He is a Visiting
Professor at Jan Wyzykowski University, Polkow-
ice, Poland; and the University of Pitesti, Pitesti,
Romania. He has authored two books, edited 13
books, and more than 270 technical papers, in-
cluding top journals and top conferences, such
as IEEE TRANSACTIONS ON NETWORK SCI-
ENCE AND ENGINEERING, IEEE TRANSACTIONS ON VEHICULAR
TECHNOLOGY, IEEE TRANSACTIONS ON INDUSTRIAL INFOR-
MATICS, IEEE WIRELESS COMMUNICATIONS, IEEE Network, ICC,
GLOBECOM, and INFOCOM. He initiated the research field of blockchain
technology adoption in various verticals, in 2017. His H-index is 58. He ac-
tively serves his research communities in various roles. His research interests
include blockchain technology, wireless sensor networks, fog computing,
smart grid, and the IoT. He is a member of the Technical Committee on
Tactile Internet of the IEEE Communication Society. He is a Senior Member
of CSI, IAENG, ISTE, and CSTA. He has been awarded the Best Research
Paper Awards from IEEE GLOBECOM 2018, IEEE ICC 2019, and Springer
ICRIC-2019. He has served many international conferences as a member
of the organizing committee, such as the Publication Chair for FTNCT-
2020, ICCIC 2020, and WiMob2019; a member of the Advisory Board
for ICACCT-2021 and ICACI 2020; the Workshop Co-Chair for CIS 2021;
and the General Chair for IC4S 2019 and 2020 and ICCSDF 2020. He is
serving on the editorial boards for Frontiers of Blockchain, Cyber Security
and Applications, Computer Communications, the International Journal of
Communication Systems, and Security and Privacy.

14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

RAJESH GUPTA is working as an Assistant INNOCENT EWEAN DAVIDSON (M’92,


Professor at Nirma University, Ahmedabad, Gu- SM’02), IEEE, USA; Fellow, Institute of En-
jarat, India. He received his Ph.D. in Computer gineering and Technology, UK; Fellow, South
Science and Engineering in 2023 from Nirma African Institute of Electrical Engineers; Char-
University under the supervision of Dr. Sudeep tered Engineer, UK; registered Professional En-
Tanwar. He received his Bachelor of Engineering gineer (P Eng.), Engineering Council of South
in 2008 from the University of Jammu, India and Africa. He received the Bachelor of Engineering,
Master’s in Technology in 2013 from Shri Mata BSc (Eng.) with Honours, Masters of Engineering,
Vaishno Devi University, Jammu, India. He has MSc (Eng.) degrees in Electrical Engineering
authored/co-authored some publications (includ- from the University of Ilorin, 1984, 1987; and
ing papers in SCI Indexed Journals and IEEE ComSoc sponsored Interna- Doctor of Philosophy, PhD in Electrical Engineering, from the University of
tional Conferences). Some of his research findings are published in top- Cape Town, 1998; Post-graduate Diploma in Business Management, from
cited journals and conferences such as IEEE Transactions on Industrial University of KwaZulu-Natal, 2004; Associate Certificate in Sustainable
Informatics, IEEE Transactions on Network and Service Management, IEEE Energy Management (SEMAC), from the British Columbia Institute of
Transactions on Network Science and Engineering, IEEE Transactions on Technology, Burnaby, BC, Canada, 2011, and Course Certificate in Artificial
Green Communications and Networking, IEEE Transactions on Computa- Intelligence, from the University of California at Berkeley, USA in 2020. He
tional Social Systems, IEEE Network Magazine, IEEE Internet of Things is a Full Professor and the Chair, Department of Electrical Power Engi-
Journal, IEEE IoT Magazine, Computer Communications, Computer And neering; Research Leader, Smart Grid Research Centre; Program Manager,
Electrical Engineering, International Journal of Communication Systems DUT-DSI Space Science and CNS Research Program, Durban University of
Wiley, Transactions on Emerging Telecommunications Technologies Wiley, Technology, Durban, South Africa. His current research interests include
Physical Communication Elsevier, IEEE ICC, IEEE INFOCOM, IEEE HVdc power transmission, grid integration of renewable energy, applied
GLOBECOM, IEEE CITS, and many more. His research interest includes artificial Intelligence, and space technology. He has managed over US3
Device-to-Device Communication, Network Security, Blockchain Technol- million in research funds and is the recipient of numerous international best
ogy, 5G Communication Network, and Machine Learning. His h-index is paper awards, and at DUT’s annual research and innovation. Prof. Davidson
27 and i10-index is 37. He is also a recipient of Doctoral Scholarship has supervised 5 Post-doctoral Research Fellows; graduated 55 PhD/Masters
from the Ministry of Electronics and Information Technology, Govt. of students and over 1200 engineers, technologists and technicians. He is
India under the Visvesvaraya Ph.D. Scheme. He is a recipient of Student the author/co-author of over 350 technical papers in accredited journals,
Travel Grant from WICE-IEEE to attend IEEE ICC 2021 held in Canada. peer-reviewed conference proceedings. He is a member: Western Canada
He has been awarded best research paper awards from IEEE ECAI 2021, Group of Chartered Engineers (WCGCE); the Institute of Engineering and
IEEE ICCCA 2021, IEEE IWCMC 2021, and IEEE SCIoT 2022. His name Technology (IET Canada) British Columbia Chapter; IEEE Collabratec
has been included in the list of Top 2% scientists worldwide published by Communities on Smart Cities and IEEE (South Africa Chapter); Gen-
the Stanford University, USA, consecutively in 2021 and 2022. He was eral Chair, 30th IEEE Southern Africa Universities Power Engineering
felicitated by Nirma University for their research achievements bagged Conference, 2022. He is the Host & Convenor, DSI-DUTSANSA-ATNS
in 2019-20 and 2021-22. He is also an active member of ST Research Space Science and CNS Symposium; and Guest Speaker in several forums,
Laboratory (www.sudeeptanwar.in). including: Science Forum of South Africa, and International Conference on
Sustainable Development. He is a member: Western Canada Group of Char-
tered Engineers (WCGCE); the Institute of Engineering and Technology
(IET Canada) British Columbia Chapter; IEEE Collabratec Communities on
Smart Cities and IEEE (South Africa Chapter).is presently working as Senior
Lecturer in Department of Electrical Engineering Technology, University of
Johannesburg. He has the qualifications of B. Tech, M. Tech and Ph.D. He
is working as Academic Editor of International Transactions on Electrical
Energy System Wiley & Regional Editor of Recent Advances in E & EE,
Bentham Science. He is a Y rated researcher from NRF South Africa. His
area of interest includes power system operation and control and application
of AI techniques to the power systems.

TURKI M. ALSUWIAN was awarded a B.Sc.


degree in Electrical Engineering from King Saud
University, Riyadh, Saudi Arabia. He worked as
an electrical power engineer in a Saudi Electricity
company from April 2004 until January 2009. In
2011, he received an M.Sc. degree in Electrical
Engineering from Gannon University, Pennsylva-
nia State, USA. In 2018, he obtained a Ph.D.
degree in Electrical Engineering from the Univer-
sity of Dayton, Ohio State, USA. Currently, he is
an Assistant Professor in the Electrical Engineering Department, Najran
University, Saudi Arabia. His main research interests are applied control in
different fields such as flight control, power quality control, power electron-
ics control, communication control, adaptive control, modeling control, and
artificial intelligence.

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417

Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection

THOKOZILE F. MAZIBUKO obtained National


Diploma in Engineering (Electrical), 2008 from
Durban University of Technology. She served
as a Tutor/Students Assistant at DUT in 2006-
2007; Engineering Trainee at Anglo Platinum,
2007-2008. She joined Anglo American/Hatch in
Rustenburg, South Africa as Planner Assistant
and Quality Control Coordinator from 2008-2009.
She obtained Bachelor of Technology (Electrical)
from Tshwane University of Technology (TUT) in
2011. From 2013-2014, she pursued the Master’s degree in Electrical En-
gineering from TUT and RWTH Aachen University, Germany, conducting
research and implementation of real-time platform: namely, the application
of PTP synchronised PMUs in power system small signal stability, transient
stability analysis of a multi-machine system based on synchro-phasors,
and awarded the Master’s degree (Cum Laude) from TUT. She was with
Council for Scientific and Industrial Research (CSIR), Pretoria, until 2016,
and joined Rand Water, Johannesburg in 2017-2018. She was employed as
a Lecturer at University of Johannesburg, 2018-2020, and joined DUT in
January 2021 as a Lecturer. She is currently a PhD student at DUT. Her
research interests are in Smart Micro-Grids; network optimization, control,
and applied artificial intelligence.

16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like