An_Improved_Dense_CNN_Architecture_for_Deepfake_Im
An_Improved_Dense_CNN_Architecture_for_Deepfake_Im
An_Improved_Dense_CNN_Architecture_for_Deepfake_Im
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier
pbhattacharya@kol.amity.edu)
3 Electrical Engineering Department, College of Engineering, Najran University, Najran 11001, Kingdom of Saudi Arabia. (e-mail: tmalsuwian@nu.edu.sa)
4 Department of Electrical Power Engineering Durban University of Technology, Durban, South Africa (e-mails: ThokozileM1@dut.ac.za,
inno.davidson@gmail.com)
Corresponding author: Sudeep Tanwar (e-mail: sudeep.tanwar@nirmauni.ac.in), Pronaya Bhattacharya (email:
pbhattacharya@kol.amity.edu), and Rajesh Gupta (email: rajesh.gupta@nirmauni.ac.in)
ABSTRACT Recent advancements in computer vision processing need potent tools to create realistic
deepfakes. A generative adversarial network (GAN) can fake the captured media streams, such as images,
audio, and video, and make them visually fit other environments. So, the dissemination of fake media
streams creates havoc in social communities and can destroy the reputation of a person or a community.
Moreover, it manipulates public sentiments and opinions toward the person or community. Recent studies
have suggested using the convolutional neural network (CNN) as an effective tool to detect deepfakes in the
network. But, most techniques cannot capture the inter-frame dissimilarities of the collected media streams.
Motivated by this, this paper presents a novel and improved deep-CNN (D-CNN) architecture for deepfake
detection with reasonable accuracy and high generalizability. Images from multiple sources are captured
to train the model, improving overall generalizability capabilities. The images are re-scaled and fed to the
D-CNN model. A binary-cross entropy and Adam optimizer are utilized to improve the learning rate of the
D-CNN model. We have considered seven different datasets from the reconstruction challenge with 5000
deepfake images and 10000 real images. The proposed model yields an accuracy of 98.33% in AttGANa ,
99.33% in GDWCTb , 95.33% in StyleGAN, 94.67% in StyleGAN2, and 99.17% in StarGANc real and
deepfake images, that indicates its viability in experimental setups.
a Facial
Attribute Editing by Only Changing What You Want (AttGAN)
b Group-wisedeep whitening-and-coloring transformation (GDWCT)
c A GAN capable of learning mappings among multiple domains (StarGAN)
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
tent out of the noise vector) and a discriminator network in the model. Thus, there is a stringent requirement for
(which aims to classify these generated synthetic images). An effective DF detection models that can form an optimal mix
iterative process is followed in the generator-discriminator of the aforementioned conditions [16].
network, where the discriminator feedback is supplied to the Recent approaches have suggested convolutional neural
generator network. Over time, the generator learns to create networks (CNN) as an effective fit for DF detection models
synthetic content, which looks extremely real and spoofs [17]. Usually, pre-trained CNN models are applied on single
the discriminator [4] [5]. Thus, the generator-discriminator frames, while other approaches have considered recurrent
network in DF raises concerns about the authenticity of convolutional networks where frames can be grouped to form
the published content on social platforms, as it is tough to the decision. In addition, some approaches consider facial
differentiate between real and fake content. Some notable expression patterns to capture fake content. Most CNN-
examples of DF include tools like DFaker, DeepFaceLab, based approaches are black boxes, where the models are
Faceswap, Faceswap-GAN, STGAN, StarGAN, and Face overfitting. In other cases, the validation, testing, and training
Swapping GAN (FSGAN), and many others [6]. In Deep- split are not uniformly distributed, which leads to different
FaceLab, it allows a user to swap a person’s face with another interpretations of the same datasets under different operat-
person’s face, change the age of a person, and synchronize ing conditions. For example, a DF detection model on the
the lip and eye movements in the video [7]. Face2Face [8] Facebook DF detection challenge dataset is proposed [18].
allows a real-time face enactment based on the RGB video The model scored an average precision of 82.56% on these
output, and the emulation of input expressions is carried out. datasets, but the performance drastically drops to 65.18% on
DF tools are also used in generating pornographic content the validation dataset, as it is collected from various sources.
that hurts the sentiments of the public [9] [10]. However, Thus, a generalization through CNN on one dataset does
hate speeches are other widely used propaganda in social not hold a cross-performance on another dataset [19]. The
circles. For example, a video of former United States 44𝑡 ℎ inconsistencies can be mitigated through an effective deep
President Barack Hussein Obama II published by BuzzFeed CNN (D-CNN) model that can address the cross-domain
shows the former president cursing another former president interpretability while maintaining the robustness and gener-
Donald Trump, which is done through the GAN technology. alizability of the DF detection scheme, which would yield a
It is massively distributed in social media circles as official high accuracy through an effective ensemble to the proposed
news, but the content is synthetic [11]. CNN approaches.
Thus, it raises a prime concern about the authenticity of
news content. To overcome the aforementioned issues of A. NOVELTY
DF GANs, a robust and highly generalizable DF detection Existing CNN-based DF detection models should conform
system is required. A good DF detection system can detect to the abilities of high generalizability, robustness, and in-
highly accurate manipulated and synthetic content from au- terpretability [12] [20]. The lack of the above-mentioned
thentic content. Recent approaches published in the literature abilities can be seen in the existing systems such as MesoNet,
point to the design of a robust DF detection scheme. Most MesoInceptionNet and many others. These are some well-
of the approaches in the literature lack robustness, effec- known CNN-based compact DF detection models focused
tiveness in training the DF detection model, and integration on detecting deepfake images for low-quality images. Even
of generalizability and interpretability in the model [12] though yielding promising results on the test set, these
[13] [14]. As indicated by Yu et al. in [12], the robustness models lack generalizability capabilities which is a well-
in the DF detection means that the system should be able discussed challenge in the domain of DF detection. Accuracy
to detect manipulation of high-quality and low-quality im- drops by a huge margin whenever these DF detection meth-
age/video contents. The system’s effectiveness should not be ods are tested against DF images generated using different
dropped based on the resolution of contents. Generally, the methods. DF detection systems learn certain features partic-
performance of DF detection systems drops over low-quality ular to the generation methods whose images were used to
content. Generalizability refers to the condition where each train these models. For Example, if any DF detection model
DF generation tool utilizes different approaches to generate was developed and trained over images from StarGAN and
the DF contents. Thus, the DF system should be able to detect then tested over reserved unseen test images will definitely
manipulations from these different tools in a single-shot [12]. yield good results, but when tested against images from some
Interpretability refers to the condition in the DT detection other DF generation method, say STYLEGAN, then accuracy
ecosystem, where a model should be able to predict which will drop by a huge margin. Sometimes accuracy drops to
parts of the image (person’s face, for example) are real or fake the point that it becomes just a random guess from the
and label the bounding boxes with fake probabilities. Thus, it model. Hence indicating a lack of generalizability being the
is crucial as it enables a system to understand the dynamics of challenge at large. Regardless, CNN approaches have mostly
generated synthetic content and presents a visual explanation treated DF detection as a binary classification problem, where
to understand the abnormalities in the images [15]. Current cross-domain interoperability is required [18]. The proposed
systems analyze DF detection on a sequential frame-by- work presents an improvement over MesoNet and MesoIn-
frame basis, which results in higher temporal inconsistencies ceptionNet, a D-CNN model that extracts deep features
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
from input images through the convolution layer to address terms of approach and the proposed method. The following
the aforementioned challenges. It captures the manipulation subsection presents the existing approaches in the classified
traces left behind as features and forms a classification model domains.
based on the similarities between real and fake images. The
similarities are projected to the closest match that improves A. PHYSICAL/PHYSIOLOGICAL FEATURES
the model predictability, as it captures the complex inconsis- In physical and physiological feature-based approaches,
tencies through the deep network. Furthermore, the model is visible discrepancies in the image/video content are exploited
trained over synthetic and real images from different sources, to classify whether the submitted content is synthetic or
improving the generalizability and cross-learning accuracy. real. The visible discrepancies primarily include improper
shadows, irregular geometry, missing details in facial fea-
B. RESEARCH CONTRIBUTION tures such as teeth or ears, inconsistent eye colors, head
Following are the major contributions of the paper. movements, and other features. For example, Li et al. [24]
• We analyze various existing approaches to DF detection leveraged inconsistencies in the blinking eye patterns, which
using the CNN model and highlight their advantages and the DF tools cannot mimic in a video stream. Authors in [29]
potential pitfalls. worked on the inconsistencies in the head pose movements
• We propose a novel D-CNN-based architecture to clas- compared to the rest body movements in the DF image and
sify DF image and video contents. The proposed model videos and identified the synthetic content in the data. The
is trained over images from seven data sources to in- authors identified 68 different landmarks in the whole body,
crease its generalizability. which includes 17 facial landmarks on the face. The direction
• We then evaluate the performance of the proposed ar- movement is considered from the center of the face, and if
chitecture using accuracy, precision, recall and F1 score the directions on two or more landmarks are the same, then
metrics over the reserved test set. they are classified as authentic content or synthetic. Matern
et al. [30] tried to use inconsistencies in other visual artifacts
C. ARTICLE LAYOUT such as inconsistent geometry of teeth, shadow, lighting, and
The layout of the article is as follows. Section II presents eye colors. However, the considered approach is good, but
the existing approaches of DF detection models. Section the latest DF generation tools have exploited and learned
III presents the problem formulation of the proposed DF about the geometry of faces, and thus the said models can
classification scheme. Section IV details the proposed model easily spoof the model. Thus, to overcome the feature-based
approach and the systematic explanation of the model pro- inconsistencies, the authors shifted to other representations,
cessing. Section V discusses the performance evaluation of including signal-level feature extraction.
the model based on various metrics. Section VI presents the
discussion and future challenges in the proposed scheme, and B. SIGNAL-LEVEL FEATURES
finally, section IX presents the work’s conclusion and future In signal-level features, deep features are extracted using
scope. either feature descriptors or feature extraction algorithms.
Thus, low-level features are extracted using steganalysis,
II. RELATED WORKS which the classification algorithm can use to classify whether
From the literature, it can be seen that researchers have the input content is DF. Kharbat et al. [31] presented a com-
already adopted different types of approaches to create an bination model of different signal-level feature descriptions
efficient DF detection system. Even though the approaches based on HOG, ORB, SURF, and others. The extracted deep
are there, their underlying principles in most approaches features are then fed as input to the SVM classifier to find
remain consistent, focusing on the utilization of inconsis- whether the image is DF. Authors in [35] utilized a feature
tencies and manipulation traces left behind by GAN tools extraction approach known as scale-invariant feature trans-
during the generation network [6]. Although nowadays, DF form (SIFT), which extracts key pixel features and analyzed
spans multiple modalities such as audio, video, image, or hy- them. Similar to the study of [31], Akhtar et al. [32] used
brid modality-based models. Among these, the image/video- local image descriptors such as LBP, LPQ, PHOG, SURF,
based DF is the most prominent; thus, most research is BSIF, and IQM. The results suggested that IQM performed
directed toward identifying image and video DFs. Thus, the more accurately than other models. However, as DF tools
image/video DF detection models are generally classified became more sophisticated, the GAN model fooled signal-
into three domains: physical/physiological features, signal level feature descriptors. Thus, the research shifted towards
level features, and data-driven models [21]. DF detection the data-driven DF detection models.
approaches involve more than one modality, i.e., combined
audio and video is termed multi-modal approach, where the C. DATA-DRIVEN MODELS
classification rests on computing the disharmony (or entropy In data-driven DF detection, we use deep neural networks
difference) between two different modalities in DF manipu- (DNN) instead of specific features to extract and learn about
lations [22] [23]. Table 1 presents a comparative analysis of the feature. Based on the learning, the model classifies the
the proposed D-CNN model against existing approaches in submitted content as DF or real images/videos. However,
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
TABLE 1: A comparative analysis of the proposed model with the existing approaches
Author Year Approach Algorithm Method Remarks
Li et al. [24] 2018 Physical Long-term recur- Used eye blinking pattern to detect DF videos Advanced DF videos are
attributes-based rent CNNs hard to detect using visual
detection feature sets
Marra et al. 2018 Data-driven mod- XceptionNet Performed a comparative study of InceptionNet, Lack of generalizability
[25] els DenseNet, and XceptionNet models. Among these,
XceptionNet performed best
Lee et al. 2018 Data driven mod- CNN Proposed a five layer CNN architecture called Deep Provides good results but
[26] els Forgery Discriminator lacked generalizability
Afchar et al. 2018 Data-driven mod- CNN It is a CNN model which utilizes inception module as Worked well with com-
[27] els architecture backbone pressed videos, but Xcep-
tionNet outperformed on
every dataset
Güera et al. 2018 Data-driven mod- RNN RNN-based temporal feature model accuracy is not effectively
[28] els high and can be outper-
formed via other models
Yang et al. 2019 Physical Support vector Exploited inconsistencies between the head pose of the Visual features are not re-
[29] attributes-based machine (SVM) face and other parts of the body using various facial liable with advanced DF
detection classifier landmarks datasets
Matern et al. 2019 Physical Ensemble model Used visual artifacts, such as difference in eye colors, dis- Visual features are not re-
[30] attribute-based with multi- proportionate shadow, details of invisible light reflections, liable with advanced DFs
detection layer perceptron and shape geometry
and logistic
regression
Kharbat et 2019 Signal level SVM classifier Combined multiple feature-point-descriptors, such as his- With advanced DF coming
al. [31] feature-based togram of oriented gradients (HOG), features from ac- up every year, extracting
detection celerated segment test (FAST), binary robust indepen- features is getting difficult.
dent elementary features (BRIEF), binary robust invariant
scalable keypoints (BRISK), KAZE, speeded-up robust
features (SURF), and oriented FAST and rotated BRIEF
(ORB). HOG achieved an accuracy of 94.5% with the
SVM classifier
Akhtar et al. 2019 Signal level SVM classifier Used local image descriptors, such as local binary pat- IQM performed best
[32] feature-based tern (LBP), local phase quantization (LPQ), pyramid his- among other models.
detection togram of oriented gradients (PHOG), binary gabor pat-
tern (BGP), and image quality metric (IQM).
Nguyen et al. 2019 Data-driven mod- Capsule network The capsule network consists of 3 primary capsules and It worked as good as
[33] els 2 output capsules. Features extracted from VGG-19 are MesoNet, but XceptionNet
provided as input. outperforms all the net-
works
Amerini et 2019 Data-driven mod- CNN Exploited discrepancies in motion across successive Other algorithms outper-
al. [34] els frames at 𝑓 (𝑡 ) and 𝑓 (𝑡 + 1). Used CNN as a classification forms the proposed model
algorithm
Proposed 2022 Data-driven CNN Proposed D-CNN based architecture trained over images Data pipeline in the pro-
Model model from seven different data sources posed architecture over DF
videos
to train the DNN model, a sufficient amount of data must with features extracted from VGG-19. The model performed
be supplied to the model, and thus the approach is named as well as MesoNet, but XceptionNet still outperforms it.
data-driven. Marra et al. [25] used networks such as Incep- Similar approaches are present, where the authors used the
tionNet, DenseNet, and XceptionNet, with a large dataset temporal component of the video to identify DF videos.
of samples collected from different categories from image- Güera et al. [28] proposed a recurrent neural network (RNN)
to-image translation, which were created using CycleGAN. model, and Amerini et al. [34] used CNN with the concept of
The results of their experiments suggested that XceptionNet using discrepancies across frames to identify DF videos.
outperforms all the other networks considered in the study. As outlined in the literature review section, the data-driven
However, the issue of generalizability remains, which was models normally outperformed the physiological and signal-
addressed by the authors in [26], where they proposed a deep based approaches. Thus, we consider a data-driven approach
forgery discriminator network, which is essentially a five- in the proposed scheme and propose a D-CNN model that
layer CNN architecture based on embedding the contrastive captures the deep features with improved generalization and
loss. The results were promising, but lack of generalizabil- model predictability.
ity remains the praoblem. Another CNN-based approach is
proposed by Afchar et al. [27], known as MesoNet, and it III. PROBLEM FORMULATION
performed well as it focused on the mesoscopic features of This section presents the problem formulation of the pro-
the images. Nguyen et al. [33] proposed a capsule network posed approach. The proposed model is a data-driven D-
CNN model for DF detection that predicts the respective
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
class of input images based on their features. To formulate the Algorithm 1 Working of the proposed approach.
problem, we consider a certain amount of available images, Input: I - RGB Images of Face, D - Destination address of
represented as 𝐼𝑡𝑜𝑡 𝑎𝑙 = {𝐼1 , 𝐼2 , . . . , 𝐼𝑛 }. 𝐼𝑡𝑜𝑡 𝑎𝑙 are sent for stored images, M - Destination address of pretrained model
training and are classified into real images, represented as 𝐼𝑟 Output: L - predicted likelihood, P - predicted label
or DF images, represented as 𝐼 𝑑 𝑓 . 𝐼𝑟 is constructed from 𝑝 procedure D EEPFAKE _D ETECTION( )
different data sources of real images, where any image 𝑖 ∈ 𝐼𝑟 Ht (Height) ← 160
is represented as follows: Wt (Width) ← 160
h i DataGen ← ImageDataGenerator()
𝑘=1 𝑘=1 𝑘=1 𝑘=2 𝑘=2 𝑘= 𝑝
𝐼𝑟 = N𝑖=1 , N𝑖=2 , . . . , N𝑖=𝑥 , N𝑖=1 , . . . , N𝑖=𝑥 , . . . , N𝑖=𝑥 Generator ← DataGen.flow_dir(D, Ht, Wt)
(1) model ← load_model(M)
i←1
Considering, each data source consists of 𝑥 real images, it can while i ≤ len(Generator.labels) do
be denoted as 𝑁 𝑘 . Thus, 𝐼𝑟 is further denoted as follows: I ← Generator.next()
𝑝
∑︁ L ← model.predict(I)
𝐼𝑟 = 𝑁𝑘 (2) P ← round(L)
𝑘=1 Display likelihood L
Display predicted label P
Similarly, for DF images, 𝐼 𝑑 𝑓 , there are 𝑞 data sources of Display Image I
deepfake images, and each source consists 𝑧 images. The i ← i+1
same is illustrated as follows: end while
end procedure
h i
𝑗=1 𝑗=1 𝑗=1 𝑗=2 𝑗=2 𝑗=𝑞
𝐼 𝑑 𝑓 = N𝑖=1 , N𝑖=2 , . . . , N𝑖=𝑧 , N𝑖=1 , . . . , N𝑖=𝑧 , . . . , N𝑖=𝑧
(3)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
B. PROPOSED ARCHITECTURE i.e., (3,3) instead of a larger filter such as (5,5) or (7,7). With
this, we now have the initial feature maps extracted from
This section discusses the proposed CNN-based architec- the input images, but the distributions of input batches can
ture (Figure 2). In General, CNN architecture consists of vary a lot for different batches based on the types of images
both convolutional and pooling layers. Convolutional layers that are included in them. Therefore, it can create problems
extract deep features from input images, whereas pooling with the optimizer algorithm’s convergence, destabilizing the
layers reduce the dimensionality of the input feature maps. training process. Thus it is helpful when the input to each
After convolutional layers, all these feature maps are made layer is unit gaussians. And to do that, these feature maps
into a one-dimensional array using a flattened layer and are batch normalized, which results in a speeded-up training
given as input to the fully connected layer. After the fully process (faster convergence) and decreased dependency on
connected layer, the output layer predicts the subsequent the weight initialization.
class based on the input image. Our proposed architecture The batch normalized tensor of size (160,160,80) is then
also follows the same approach where earlier layers consist passed on to the next block where two convolutional layers
of convolutional layers. After the convolutional layers, a are followed, which performs convolution operations with a
flattened layer is used, followed by a series of fully connected filter size of (3,3) and 16 different filters each, with Leaky
layers. In the end, the sigmoid function is used to predict the ReLU as activation. It allows us to extract deeper features
likelihood of the predicted output. Batch Normalization has that could be more meaningful in detecting deepfake images.
been used after certain layers to stabilize the training process, With these extracted feature maps, they are once again batch
whereas average pooling has decreased the dimensionality of normalized. Generally, with deeper CNNs, a larger number of
the feature maps over the proceeding layers. The black box filters will be used in deeper layers to extract deep features.
diagram of the proposed architecture can be seen in Figure 1. But due to this, the dimensions of the feature maps will
The proposed architecture reads input images with a keep increasing, resulting in many computations needed as
height and width of 160 pixels each and a batch size of we proceed further. To tackle this issue, pooling layers are
64. Then the various data augmentation techniques, such as used to decrease the dimensionality of the extracted feature
rescaling the input array, rotating the input image randomly maps. With this goal in mind, we have used the average
between 0 to 360 degrees, horizontal and vertical flip, shear pooling layer of size (2,2), which essentially decreases the
range, and a zoom range of 0.2, are all applied using Keras dimensions of feature maps by half. The output from the
preprocessing library. previous block with Average Pooling layers will be of size
Thus, the proposed architecture accepts input images of (80,80,16). This is accepted as input for the next block, which
size (160,160,3) with all the data augmentation techniques has a similar structure to the previous block. It differs only
applied. The flow diagram of the proposed architecture can by having three convolutional layers with a filter size of (3,3)
be seen in Figure 2. For the input, at the first layer, 2D and 32 different filters, with Leaky ReLU as activation. And
convolution operations are performed using filter sizes of then again, batch normalization and average pooling layers
(3,3) and 8 different filters. Leaky ReLU is also used at are followed. With this pooling layer from the previous block,
this layer as an activation function. Since it is the first layer the dimension of the feature map becomes (40,40,32). And
extracting image features, it is going to be a high-level feature then, it is taken as input for the next block, which consists of
of input images, and thus, the filter size is kept to be small,
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
4 consecutive convolutional layers with filter sizes (3,3) and with a value of 0.5. Finally, there is an output layer with a
64 different filters, with Leaky ReLU as activation. It is again single neuron and a sigmoid activation function. It predicts
followed by batch normalization and average pooling layer. whether the input image is a deepfake image or not. If the
With this, the next block receives an input size (20, 20, value is less than 0.5, then the predicted output is real; else,
64). Because of the previous four blocks, we have extracted it is a deepfake image. The loss function used during training
deep image features, which could be used to classify images is binary cross-entropy, and the optimizer used is ’Adam’
as a deepfake or not. So for the next two blocks, we try with a learning rate of 0.01. The black box diagram of the
to use a large filter of size (5,5). For the current block, we architecture can be seen in Figure 1 and Table 2 describes
use a convolutional layer with (5,5) filter and 128 different the output dimensions of each layer along with the number
filters, with LeakyReLU as activation. Then followed by of parameters.
batch normalization and max pooling layer. This reduces
the output dimensions to be (10,10,128). The next block V. RESULTS AND DISCUSSION
accepts the output from the previous block. It is followed by This section discusses the performance delivered by the
a convolutional layer with (5,5) filter size and 256 different proposed architecture and the results achieved.
filters, with LeakyReLU as activation. It is again followed by
batch normalization and the Max pooling layer, which gives A. SIMULATION SETUP
the dimension output (5,5,256). The Google collab pro has been used for training, which
The output from the previous block is transformed into a usually assigns Tesla T4 or Tesla P100 GPU. Since Google
one-dimensional array using the flattened layer.Followed by collab restricts the prolonged usage of GPUs, checkpointing
the flatten layer, there is a dropout layer with value of 0.5, has been used during the training to save the best-performing
which randomly sets half of the input units to zero. It helps model based on the lowest validation loss value. If necessary,
our model to avoid overfitting the training data. Being an the training could be resumed from the last best model saved,
improvement over MesoNet, the value of dropout layer has but it has never been used.
not been changed from it’s predecessor which experimentally
also yields best results in terms of avoiding overfitting. Fol- B. DATASET DESCRIPTION
lowing the previous block, there is a fully connected layer The dataset we decided to use was part of Deepfake Images
with 32 neurons/units. It also utilizes LeakyReLU as an Detection and Reconstruction Challenge [36]. The dataset
activation function. It is then followed by a dropout layer with consisted of real images from image datasets of CelebA and
a value of 0.5. Similarly, there are two consecutive blocks FFHQ. Both contain 5000 images each. Whereas 1000 im-
of the fully connected layer with 16 neurons/units with the ages each from GDWCT, AttGAN, STARGAN, StyleGAN
LeakyReLU activation function, followed by a dropout layer and StyleGAN2 datasets are included for deepfake detection.
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
Since the image provided are taken from different types of TABLE 3: Resolution of images from each Data Source
GAN architecture and datasets, images from these different Type of Image Dataset Resolution No. of Images
sources had different resolutions ranging from 1024x1024 Deepfake GDWCT 216 x 216 1000
Deepfake AttGAN 256 x 256 1000
being the largest to 178x218 being the smallest. The reso- Deepfake StarGAN 245 x 256 1000
lutions of images are discussed in Table 3. Deepfake StyleGAN 1024 x 1024 1000
Thus, there were 10000 real images and 5000 deepfake Deepfake StyleGAN2 1024 x 1024 1000
images. To make a balanced set, we decided to use 5000 Real CelebA 178 x 218 5000
Real FFHQ 1024 x 1024 5000
real images only. Thus, to make it completely balanced,
we randomly sampled 2500 images from CelebA and 2500
from FFHQ. Thus, a total of 5000 randomly sampled real
ratio of real images from both data sources. It thus gave 350
images from these two sources, whereas we have taken 5000
deepfake images. real images for the validation set.
We divided the image dataset into a train, validation, We then follow the same strategy with deepfakes as well.
and test sets. 60% of images are used for training, 10% We sampled 70% of each type of GAN image. That means
for validation, and 30% for test sets (the images have been we sampled 700 images from GDWCT, AttGAN, STAR-
properly balanced). Firstly, 70% of random sample images GAN, StyleGAN and StyleGAN2 each. Hence giving a well-
balanced set of 3500 deepfake images. And similarly, we
are for training from the real dataset from both data sources.
We randomly sampled 1750 images from CelebA and 1750 selected every 10th image from the training set to be used as a
from FFHQ. It makes 3500 real images for training. Out of validation set giving 350 deepfake images for the validation
3500, we selected every 10th image from this training dataset set. Thus, now we have 3150 real images and 3150 deepfake
to be kept as reserved for the validation set. It ensured the images for training, along with 350 real and 350 deepfake
images for validation. And the remaining 30% images were
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
FIGURE 3: Training accuracy and Training Loss over the training epochs
FIGURE 4: Validation accuracy and Validation Loss over the training epochs
used for testing the model for performance after training. For transmission and storage costs. Hence, CNN network with
the training purpose, Data Augmentation has been applied input size of (160,160,3) is selected with the hope to ensure
to these images. These data augmentation includes vertical usefulness of the model in real world use case. Low reso-
flipping, horizontal flipping, zooming by 0.2, shear range lution images also helps to keep the computational costs to
by 0.2, width shift range and height shift range by 0.2, as minimum. But there is scope for future work by either mov-
well as random rotation of 360 degrees. These will help the ing to variable sized input NN with Global Pooling layers or
model to learn detect deepfake images while maintaining experimenting with various efficient upscaling techniques to
spatial and scale invariance properties. Since training images see performance improvement.
consisted of only upright faces that too positioned at the
center of the image, there was a very high possibility that C. TRAINING
the D-CNN model would learn to discriminate between DF During training, the Adam optimizer is used with a learn-
and real images based on the features of the center of the ing rate of 0.01. The number of epochs used is 550. Due to
images only, that too with upright faces only. To ensure the the limitations of hardware and time usage on Google Colab,
dataset consisted facial images from different angles, facial check pointing and CSVLogger have been used to note the
images with different spatial position within images and of training accuracy and training loss as well as validation
different scale; data augmentation techniques were used. It accuracy and validation loss during the training phase. The
helps model to learn spatial and scale invariant features which batch size is set to 64, stabilising the training phase quite a
are of utmost importance for a DF detection system in the lot. We save the entire model instead of weights only.
wild. The input image size was set to 160 x 160. Historically From Figure 3, we can see that the training accuracy
detecting low resolution and low quality deepfake images steadily increases till the 200th epoch. After the 200th epoch,
has been considered a difficult task since there is much less change in accuracy slowly plateaus over preceding epochs.
information to work with. Adding more to that, conventional The same can be seen with loss values during the training
social media sites downscale high resolution images to avoid phase. Even though not a huge change, performance slowly
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
TABLE 4: Classification report. images separately. So we randomly sampled 300 real images,
Precision Recall f1-score Support 150 from CelebA and 150 from FFHQ. We then evaluate each
Real 0.97 0.98 0.97 1500 model individually. And our model yielded 98.83% accuracy
Deepfake 0.98 0.97 0.97 1500
on AttGAN vs CelebA+FFHQ images. Whereas it gave
Macro Average 0.97 0.97 0.97 3000 99.33% accuracy on GDWCT vs CelebA+FFHQ images.
Weighted Average 0.97 0.97 0.97 3000 It gave 95.33% accuracy on StyleGAN vs CelebA+FFHQ
and 94.67% on StyleGAN2 vs CelebA+FFHQ images. Fi-
nally, our model yielded 99.17% accuracy on StarGAN vs
CelebA+FFHQ images.
We then evaluated the proposed model on the imbalanced
set. We already had 300 images for each data source stored
separately. We fed all 300 deepfake images separately for
each data source to see our model’s performance. Our model
gave complete 100% accuracy in classifying deepfake images
generated from AttGAN with a loss value of 0.0051, whereas
on GDWCT, it gave an accuracy of 99.33% with a loss value
of 0.0141. Our model performed well over StyleGAN and
StyleGAN2 with an accuracy of 95.66% and 93.99%. In con-
trast, our model gave 99.33% accuracy in classifying images
generated using StarGAN. Table 5 presents the performance
of the proposed model under different image databases with
real images.
The model indicates promising results over the reserved
Test set images. The model’s performance is balanced over
all the different image data sources. When we evaluated our
FIGURE 6: Confusion matrix. model over all the different data sources separately, we got
more insight into the model’s performance. It is essential to
understand that the accuracy of the combined data set might
out of 1500 Real test images, our model has classified 1471 look promising, but the model might lack performance over
images correctly, whereas 29 Real images were misclassified certain kinds of images. When we look into it that way, it is
as Deepfake. And out of 1500 Deepfake images, our model seen that our model shows extraordinary performance over
classifies 1450 images correctly and misclassifies 50 images the images from AttGAN, GDWCT, and StarGAN. In con-
as Real. Figure 6. trast, performance drops a bit over images from StyleGAN
and StyleGAN2.
VI. DISCUSSION
When investigated further, it is found that StyleGAN and
To further understand the proposed model’s performance.
StyleGAN2 images are very high-resolution images, whereas
We extend our analysis by evaluating our model over images
images from AttGAN, GDWCT, and StarGAN are low-
of all these data sources separately. It will allow us to un-
resolution images. Thus, it suggests that our model performs
derstand more about the generalizability capabilities of the
extraordinarily over low-resolution images but drops a bit
proposed model. So, in the test set, we had 1500 deepfake
(not much) over high-resolution images. Although still, the
images from 5 GAN architectures. Thus, it means we had
performance is quite promising and impressive, even for
300 deepfake images from each data source. When then
high-resolution images. But the overall performance, consid-
combined these images from different data sources with real
ering the images with such different data sources and resolu-
tions, is still pretty impressive. Some of the results are shown
TABLE 5: Performance of the proposed model on individual in Figure 7. As it can be seen in the figure, model outputs
data source. it’s results in terms of confidence score which essentially is
probability of that image being a deepfake image or not. If
Subset of Test set Proposed Model the model confidence score is closer to ’0’, it is extremely
AttGAN images + Real images 98.33%
confident about the image being real and vice versa. When
the confidence score comes closer to ’0.5’, it indicates that
GDWCT images + Real images 99.33% the model is bit confused. And it can be seen in the figure
for misclassified Deepfake images and misclassified Real
StyleGAN images + Real images 95.33%
images, the confidence score is closer to ’0.5’. Initial analysis
StyleGAN2 + Real images 94.67% has suggested that since there are manipulation traces and
little blurriness left behind for deepfake images, the neu-
StarGAN + Real images 99.17%
rons activation suggests that background areas are activated
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
FIGURE 7: Classification of deepfake, real, and misclassified images by the proposed model.
FIGURE 8: Experiment setup for comparing the performance capabilities of the proposed and the existing models.
[9] P. Korshunov and S. Marcel, “Vulnerability assessment and detection of other- audio-visual dissonance-based deepfake detection and localization,”
deepfake videos,” in 2019 International Conference on Biometrics (ICB), 2020.
pp. 1–6, 2019. [23] Y. Zhang, J. Zhan, W. Jiang, and Z. Fan, “Deepfake detection based on
[10] Y. Mirsky and W. Lee, “The creation and detection of deepfakes,” ACM incompatibility between multiple modes,” in 2021 International Confer-
Computing Surveys, vol. 54, no. 1, 2021. cited By 25. ence on Intelligent Technology and Embedded Systems (ICITES), pp. 1–7,
[11] “A video that appeared to show obama calling trump a "dipsh-t" 2021.
is a warning about a disturbing new trend called ’deepfakes’.” [24] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake
https://www.businessinsider.in/tech/a-video-that-appeared-to-show- videos by detecting eye blinking,” in 2018 IEEE International Workshop
obama-calling-trump-a-dipsh-t-is-a-warning-about-a-disturbing-new- on Information Forensics and Security (WIFS), pp. 1–7, 2018.
trend-called-deepfakes/articleshow/63807263.cms. Accessed: 2022-05- [25] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, “Detection
25. of gan-generated fake images over social networks,” in 2018 IEEE Con-
[12] P. Yu, Z. Xia, J. Fei, and Y. Lu, “A survey on deepfake video detection,” ference on Multimedia Information Processing and Retrieval (MIPR),
IET Biometrics, vol. 10, no. 6, pp. 607–624, 2021. pp. 384–389, 2018.
[13] Y.-J. Heo, Y.-J. Choi, Y.-W. Lee, and B.-G. Kim, “Deepfake detection [26] C.-C. Hsu, C.-Y. Lee, and Y.-X. Zhuang, “Learning to detect fake face
scheme based on vision transformer and distillation,” 2021. images in the wild,” in 2018 International Symposium on Computer,
[14] R. Caldelli, L. Galteri, I. Amerini, and A. Del Bimbo, “Optical flow based Consumer and Control (IS3C), pp. 388–391, 2018.
cnn for detection of unlearnt deepfake manipulations,” Pattern Recognition [27] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact
Letters, vol. 146, pp. 31–37, 2021. facial video forgery detection network,” in 2018 IEEE International Work-
[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image shop on Information Forensics and Security (WIFS), pp. 1–7, 2018.
translation using cycle-consistent adversarial networks,” in 2017 IEEE [28] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural
International Conference on Computer Vision (ICCV), Venice, Italy, networks,” in 2018 15th IEEE International Conference on Advanced
pp. 2242–2251, 2017. Video and Signal Based Surveillance (AVSS), pp. 1–6, 2018.
[16] K. Patel, D. Mehta, C. Mistry, R. Gupta, S. Tanwar, N. Kumar, and [29] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
M. Alazab, “Facial sentiment analysis using ai techniques: State-of-the- head poses,” in ICASSP 2019 - 2019 IEEE International Conference on
art, taxonomies, and challenges,” IEEE Access, vol. 8, pp. 90495–90519, Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265, 2019.
2020. [30] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to
[17] H. S. Shad, M. M. Rizvee, N. T. Roza, S. M. A. Hoq, M. Monirujja- expose deepfakes and face manipulations,” in 2019 IEEE Winter Applica-
man Khan, A. Singh, A. Zaguia, S. Bourouis, and S. K. Gupta, “Com- tions of Computer Vision Workshops (WACVW), pp. 83–92, 2019.
parative analysis of deepfake image detection method using convolutional [31] F. F. Kharbat, T. Elamsy, A. Mahmoud, and R. Abdullah, “Image feature
neural network,” Intell. Neuroscience, vol. 2021, jan 2021. detectors for deepfake video detection,” in 2019 IEEE/ACS 16th Inter-
[18] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. national Conference on Computer Systems and Applications (AICCSA),
Ferrer, “The deepfake detection challenge (dfdc) dataset,” 2020. pp. 1–4, 2019.
[19] J. Hathaliya, R. Parekh, N. Patel, R. Gupta, S. Tanwar, F. Alqahtani, M. El- [32] Z. Akhtar and D. Dasgupta, “A comparative evaluation of local feature de-
ghatwary, O. Ivanov, M. S. Raboaca, and B.-C. Neagu, “Convolutional scriptors for deepfakes detection,” in 2019 IEEE International Symposium
neural network-based parkinson disease classification using spect imaging on Technologies for Homeland Security (HST), pp. 1–5, 2019.
data,” Mathematics, vol. 10, no. 15, 2022. [33] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Using
[20] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact capsule networks to detect forged images and videos,” in ICASSP 2019
facial video forgery detection network,” in 2018 IEEE International Work- - 2019 IEEE International Conference on Acoustics, Speech and Signal
shop on Information Forensics and Security (WIFS), pp. 1–7, 2018. Processing (ICASSP), pp. 2307–2311, 2019.
[21] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale chal- [34] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, “Deepfake video
lenging dataset for deepfake forensics,” in 2020 IEEE/CVF Conference on detection through optical flow based cnn,” in 2019 IEEE/CVF Interna-
Computer Vision and Pattern Recognition (CVPR), pp. 3204–3213, 2020. tional Conference on Computer Vision Workshop (ICCVW), pp. 1205–
[22] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian, “Not made for each 1207, 2019.
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
[35] M. Dordevic, M. Milivojevic, and A. Gavrovska, “Deepfake video analysis PRONAYA BHATTACHARYA (M’22) is cur-
using sift features,” in 2019 27th Telecommunications Forum (TELFOR), rently employed as an Associate Professor with
pp. 1–4, 2019. the Computer Science and Engineering Depart-
[36] “Deepfake images detection and reconstruction challenge – 21st interna- ment, Amity School of Engineering and Tech-
tional conference on image analysis and processing..” https://iplab.dmi. nology, Amity University, Kolkata, India. He has
unict.it/Deepfakechallenge/. Accessed: 2023-01-05. completed his PhD from Dr. A. P. J Abdul Kalam
Technical University, Lucknow, Uttar Pradesh, In-
YOGESH PATEL has completed his Master of dia. He has over ten years of teaching experience.
Technology in Computer Engineering from In- He has authored or coauthored more than 100
stitute of Technology, Nirma University. He has research papers in leading SCI journals and top
active interest in domains like Deep Learning, core IEEE COMSOC A* conferences. Some of his top-notch findings are
Data Science, and Blockchain. He is working on published in reputed SCI journals, like IEEE JOURNAL OF BIOMEDI-
presenting solutions to integrate Generative Ad- CAL AND HEALTH INFORMATICS, IEEE TRANSACTIONS ON VE-
versarial Networks in adversarial learning tech- HICULAR TECHNOLOGY, IEEE INTERNET OF THINGS JOURNAL,
niques in wide range of domains like Healthcare, IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEER-
Vehicular Networks, and ermerging communica- ING, IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYS-
tion networks. TEMS, IEEE TRANSACTIONS OF NETWORK AND SERVICE MAN-
AGEMENT, IEEE ACCESS, IEEE SENSORS, IEEE INTERNET OF
THINGS MAGAZINE, IEEE COMMUNICATION STANDARDS MAG-
AZINE, ETT (Wiley), Expert Systems (Wiley), CCPE(Wiley), FGCS
(Elsevier), OQEL (Springer), WPC (Springer), ACM-MOBICOM, IEEE-
INFOCOM, IEEE-ICC, IEEE-CITS, IEEE-ICIEM, IEEE-CCCI, and IEEE-
ECAI. He has an H-index of 19 and an i10-index of 32. His research interests
include healthcare analytics, optical switching and networking, federated
learning, blockchain, and the IoT. He has been appointed at the capacity
of keynote speaker, technical committee member, and session chair across
the globe. He was awarded eight best paper Awards in Springer ICRIC-
2019, IEEE-ICIEM-2021, IEEE-ECAI-2021, Springer COMS2-2021, and
IEEE-ICIEM-2022. He is a Reviewer of 21 reputed SCI journals, like IEEE
INTERNET OF THINGS JOURNAL, IEEE TRANSACTIONS ON IN-
DUSTRIAL INFORMATICS, IEEE TRANSACTIONS OF VEHICULAR
TECHNOLOGY, IEEE JOURNAL OF BIOMEDICAL AND HEALTH
INFORMATICS, IEEE ACCESS, IEEE NETWORK, ETT (Wiley), IJCS
(Wiley), MTAP (Springer), OSN (Elsevier), WPC (Springer), and others. He
is also an active member of ST Research Laboratory (www.sudeeptanwar.in)
14 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
VOLUME 4, 2016 15
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3251417
Patel et al.: An improved Dense CNN Architecture for Deepfake Image Detection
16 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4