Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer

Unpaired image-to-image translation for agricultural
images using Vision Transformer
A Project Report
Submitted by
Prajakta Khaire 122122006
in fulfilment for the award of the degree

of
M.Tech Computer Engineering
Under the guidance of

Dr. Vahida Attar
College of Engineering, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,
COEP Technological University
August, 2023
DEPARTMENT OF COMPUTER ENGINEERING AND
INFORMATION TECHNOLOGY,
COEP Technological University
CERTIFICATE
Certified that this project, titled “Unpaired image-to-image translation for agricul-
tural images using Vision Transformer ” has been successfully completed by
Prajakta Khaire 122122006
and is approved for the fulfilment of the requirements for the degree of “M.Tech. Com-
puter Engineering”.
SIGNATURE SIGNATURE
Dr. Vahida Attar Dr. P. K. Deshmukh
Project Guide Head
Department of Computer Engineering Department of Computer Engineering
and Information Technology, and Information Technology,
COEP Technological University, COEP Technological Univesity,
Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

Abstract
A lack of adequate training instances, including both quantity and variety, is typically
the barrier in the implementation of solutions in any field. For agricultural applications,
systems developed for tasks like plant and weed classification, and disease detection will
usually rely on the plant type. So each application needs its unique customized dataset.
A large amount of data is required to develop any kind of ML-based or DL-based model
for a particular task. In the case of classification, the data of different climates, and
light conditions should be available, so that system can classify the crops precisely in any
condition. The image-to-image(I2I) translation is a technique used to translate images
from one form to another while maintaining the contents of the image. I2I translation
has become more popular in recent years due to its wide range of applications. Many
computer vision and image processing applications such as image segmentation, data
annotation, image enhancement, image synthesis, pose prediction, and style transfer use
the image to image translation. Various Deep learning algorithms such as RNN, CNN,
GAN, etc. are used for I2I translation. Transformers is another approach that can be
used for I2I translation. This work presents the study of I2I translation, the architecture
of the vision transformer, and how it can be used to implement image translation on the
image dataset.
Contents
List of Tables ii
List of Figures iii
Nomenclature iv
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Definition and Objectives . . . . . . . . . . . . . . . . . . . . . 3
2 Existing Solutions 4
2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Image Translation 7
3.1 Image-to-image translation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Unpaired I2I translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Vision Transformer(ViT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 ViT Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Global Position Embeddings . . . . . . . . . . . . . . . . . . . . . 8
3.3.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1
3.5 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5.2 Multi-Head Self-Attention Layer . . . . . . . . . . . . . . . . . . . 10
3.5.3 Feedforward Neural Network Layer . . . . . . . . . . . . . . . . . 10
3.5.4 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Vision Transformer for Image Translation 12
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Vision Transformers (ViT) for Image Translation . . . . . . . . . . . . . 14
4.4 Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Experimental Setup 17
5.1 Dataset Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Implementation and Result 21
6.1 Implementation Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 Results 26
7.1 Result of CityScape Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Result of ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Result of Soybean Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8 Conclusion 33
9 Timeline 35
9.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
10 Publication Details 37
List of Tables
2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.1 Quantitative evaluation with FID metric, SSIM index and PSNR for cityscape
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Quantitative evaluation with FID metric, SSIM index and PSNR for Ima-
geNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 Quantitative evaluation with FID metric, SSIM index and PSNR for Soy-
bean dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4 Quantitative evaluation on Cityscape Dataset . . . . . . . . . . . . . . . 32
9.1 Action Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ii
List of Figures
3.1 Architecture of vision transformer . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Vision Transformer Architecture for image translation . . . . . . . . . . 12
4.2 Layers of Transforms Block . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.1 Results of Cloudy to Sunny translation Column 1 is Cloudy input image
Column 2 is patches of input image and Column 3 is Sunny output image 26
7.2 Results of Sunny to Cloudy translation Column 1 is Sunny input image
Column 2 is patches of input image and Column 3 is Cloudy output image 27
9.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iii
Nomenclature
µ Mean
σ Standard deviation
iv
Chapter 1
Introduction
1.1 Overview
Weed detection and control play a vital role in agriculture, as weeds reduce crop
yields, degrade product quality, and raise production costs. The plant that grows along
with valuable agricultural goods is called a weed. Weed detection is the challenge of
properly recognizing the area of weeds. Weeds may eliminate the nutrients in crops; for
this reason, they will harm crops. As weeds may affect the growth of crops, these weeds
should be detected and removed.
To design an autonomous weed control system, a crucial initial step is to locate and
identify weeds accurately. Weed detection in crops is problematic since weeds and crops
frequently have similar textures, colors, and forms. Common issues in recognizing and
distinguishing crops and weeds include similarity in color and texture, crop shadow in
sunlight, and texture and color change because of lighting conditions and brightness. So
multiple approaches have been developed for automatic weed detection using image data.
Unavailability of the large data becomes the bottleneck for such a problem.
Data collection is very important as the whole system depends on the data. To clas-
sify the crops precisely the data in different climates and lightning conditions should be
1
available. Collecting data in such various conditions becomes challenging. So, to deal
with this situation, we can use an I2I translation that can help to transform images.
Deep learning techniques are used for most of computer vision tasks. In recent years,
I2I translation has gained more attention due to a variety of applications like image
restoration, style transfer, and image synthesis. The I2I translation is a method that
converts an image from one domain to another, i.e. one image is mapped into another.
The successful implementation of Convolutional networks in many computer vision tasks
has made it popular that it got the priority for I2I translation tasks too. Autoencoders
and U-net models in convolutional networks are mostly used for I2I translation.
Considering its effectiveness, convolutional layers have trouble representing long-
range relationships. Innovative ideas, Transformers, patch, and attention extended the
range of deep learning and provided different approaches to address issues. Transformers
employ self-attention that is applied to the complete input.
1.2 Motivation
One of the primary issues in Deep Learning associated with the agricultural domain
is dealing with sparse datasets and limited samples, especially when algorithms require
labeled data and numerous training instances. Although public datasets are accessible
online, most datasets are restricted in size and suitable for particular applications. Data
collection is a complicated and costly activity. It demands the involvement of individuals
from diverse areas at various stages. Also, the difficulty of class imbalance has been a
universal concern in computer vision as well as in machine learning. The primary moti-
vation for this work is to develop an efficient method of generating images to supplement
the training set, which can help improve accuracy in a variety of fields.
1.3 Problem Definition and Objectives
Problem Definition
To implement unpaired image translation using a vision transformer for the agricul-
tural crop images.
Objectives
1. To analyse the publicly available agricultural image dataset.
2. To study the vision transformer architecture for image translation task.
3. To implement a transformer-based architecture for the conversion of unpaired im-
ages.
4. To conduct translation operations for sunny to night, sunny to rainy, sunny to cloudy
and, vice-versa on the cityscapes benchmark dataset.
5. To validate the performance of the above transformer-based architecture on the
agricultural crop image dataset.

Chapter 2
Existing Solutions
2.1 Review of Related Literature
Several methodologies involving CNNs were applied for I2I translation. Most of
the older approaches have depended on Autoencoders for I2I translation. An input is
transmitted via the encoder in an autoencoder. Encoder downsamples the input until a
bottleneck layer.[1] At the bottleneck latent vector is developed. After that, a decoder
reverses the process by upsampling a latent vector till the required image is generated.
The difficulty with these systems is that low-level information valuable for output image
creation is lost at bottleneck. Some successful systems have utilised U-Net technologies
to overcome this difficulty. U-Net uses skip connections. U-Net is particularly effective
due to skip connections, which concatenate the encoder to the decoder.[1] This lets the
low-level information to be kept and promptly transmitted to the output.
Transformers were first introduced for NLP. Generally, Transformers are pre-trained
on the text and fine-tuned for unique applications. Transformers do not suffer from
memory shortages as compared to recurrent networks. Because the transformer has self-
attention mechanisms and the way with which it uses Multilayer perceptrons(MLPs).
Recurrent layers cannot remember tokens after a specified amount of time. Some exper-
4
iments demonstrate that the memory of the recurrent layers can be expanded with the
help of the attention mechanism but still it is not compatible with the transformer.
Image GPT(IGPT) is another type of transformer-based architecture that takes in-
put as pixels and tries to produce similar pixels. Vision transformers take images as input
and divide it into patches. IGPT operates pixels simply like tokens. IGPT works simi-
larly like the text generation model where it receives pixels as input and creates target
pixels.
The transformer model gets a one-dimensional sequence of word embeddings as in-
put since it was first built for NLP. In contrast, when utilized for the purpose of image
classification in computer vision, the input data to the Transformer model is provided in
the form of two-dimensional pictures. Vision transformers (Vit) are a type of transformer
architecture that primarily employs encoders and unique patch embeddings. An images
are flattened into patches and then projected linearly to add position embedding. In
position embedding the patches are numbered so as to generate required output after
translation. The pathches with positional embedding are given to transformer encoder
for further processing.

Table 2.1: Review of Related Literature
Sr.No. Author Year Major takeaway/ Research gap
In this article, the taxonomy of different methods and ap-

1 Y. Pang et al. [6] 2021
plications of image translation has been provided.
In this, they have implemented image translation on the
2 S. Kim et al. [3] 2022 cityscapes dataset. They have considered instance and
global-level features.
Based on image translation and augmentation this article
focuses on how the improve the performance of plant disease
3 H. Nazki et al. [5] 2020 recognition. They have implemented the image translation
on healthy tomato crop leaves which are translated to the
infected leaves to get a larger size of balanced data.
For I2I translation, they have developed networks that learn
4 P. Isola et al. [2] 2016 the mapping from the input to the output image as well as
a loss function to train the mapping.
In this, they have developed a system using adversarial-
consistency loss GAN (ACL-GAN) which translates the im-

5 W. Zheng et al. [7] 2022
ages into anime. But has a limitation in that the model is
not able to handle the data with complex background.

Chapter 3
Image Translation
3.1 Image-to-image translation
The I2I translation is an area in computer vision that transforms the image from one
domain to another. The goal of image translation is to understand the mapping between
the input image and output image that has similar characteristics but different styles.
There are two types of I2I translation.
1. Paired image translation In this type of translation, the input and the ground truth
image are aligned. The drawback of paired image translation is that it is difficult to
obtain the paired training samples.
2. Unpaired image translation In this type of translation, the domain of input and
output images is different so the goal is to find the mapping between different images
using unpaired training data.
3.2 Unpaired I2I translation
Unpaired I2I translation is a type of image generation technique where the input
image and output images has no direct correspondence. In contrast to paired I2I transla-
7
tion, where a model is trained on a dataset of paired input and output images, unpaired
image-to-image translation requires two separate datasets: one dataset of input images
and one dataset of output images with the desired style or characteristics.
Unpaired I2I translation has many practical applications, such as generating images
in a different style, enhancing the quality of images, or converting images to a different
domain. For example, it can be used to translate images from a day scene to a night scene
or to convert a photograph of a horse to a zebra. Unpaired I2I translation is a powerful
technique for generating images with a desired style or characteristics without relying on
paired examples. It has numerous practical applications and can be implemented using
various machine learning techniques.
3.3 Vision Transformer(ViT)
The vision transformer (ViT) architecture for I2I translation involves modifying the
original ViT architecture by replacing the final classification layer with a decoder that
generates the output image. Modified ViT architecture is given below:
3.3.1 ViT Encoder
The ViT encoder consists of a stack of transformer layers that process the input
image. Each transformer layer includes multi-head self-attention and feedforward neural
networks to extract features from the input image.
3.3.2 Global Position Embeddings
In the original ViT architecture, position embeddings were added to the input image
to preserve spatial information. However, in image-to-image translation, we also need to
capture global information about the input image. Therefore, we can add global position
embeddings to the input image by averaging the output of the final transformer layer and
concatenating it with the input image.
3.3.3 Decoder
The decoder is a generative model that takes the output of the ViT encoder and
generates the output image. The decoder can be a CNN, GAN, or another type of
generative model. During training, the input image is given to the ViT encoder, and the
output of the encoder is then provided to the decoder. The decoder generates the output
image, and the loss function is calculated based on the difference between the generated
image and the ground truth output image. The model is trained using backpropagation
to optimize the parameters.
3.4 Architecture
Figure 3.1: Architecture of vision transformer

3.5 Transformer Encoder
The transformer encoder block in this context of I2I translation is slightly different than
the one used in natural language processing tasks, as it includes some additional layers
that are specific to image processing:
3.5.1 Convolutional Layer
The first layer in the transformer encoder block for I2I translation is typically a
convolutional layer. This layer applies a set of convolutional filters to the input patches
to extract local spatial features from the image. The output of the this layer is then
passed to the following layers.
3.5.2 Multi-Head Self-Attention Layer
This layer operates on the output of convolutional layer and computes a weighted
sum of the input patches, where weights are computed by the similarity between each
input patch and every other patch in the image. This layer allows the model to capture
global dependencies between the patches and to attend to the most relevant patches for
the image-to-image translation task.
3.5.3 Feedforward Neural Network Layer
The feedforward neural network layer applies a non-linear transformation to the
output of the above mentioned layer. This layer is typically composed of linear transfor-
mations separated by a activation function. The purpose of the feedforward layer is to
add additional capacity and non-linearity to the transformer model.

3.5.4 Normalization Layer
Like natural language processing tasks, normalization is also used in the transformer
encoder block for image-to-image translation. The residual connections allow information
from the input image to be directly propagated to the output of the block, while the layer
normalization stabilizes the learning process.
Overall, the steps involved in using the ViT architecture for image translation
are similar to those involved in using other deep learning models for image translation.
It includes a convolutional layer at the beginning to extract local spatial features from
the image. The multi-head self-attention layer and the feedforward neural network layer
are used to capture global dependencies and add non-linearity and capacity to the model.
The use of ViT as an encoder provides several advantages, such as the ability to process
input images efficiently using self-attention and the ability to handle variable-size input
images.
Chapter 4
Vision Transformer for Image
Translation
4.1 System Architecture
Figure 4.1: Vision Transformer Architecture for image translation
12
Figure 4.2: Layers of Transforms Block
4.2 Preprocessing
1. Splitting an Image into Patches:
The first step in using the Vision Transformer (ViT) architecture for image process-
ing is to split the input image into smaller patches. These patches are typically
square regions of the image with a fixed size, such as 16x16 pixels or 32x32 pixels.
By splitting the image into patches, we can reduce the dimensionality of the input
and make it easier to process with the ViT model.
2. Flattening the Patches:
Once the patches have been extracted from the input image, they are flattened into
vectors. This involves reshaping the pixel values of each patch into a one-dimensional
array. The resulting vectors represent the flattened patch images and are used as
the input to the ViT model.
3. Producing Linear Embeddings:
The flattened patch vectors are then fed through a linear projection layer to produce
lower-dimensional embeddings. This linear projection layer is implemented as a fully
connected neural network layer with a smaller output dimension than the input
dimension. By reducing the dimensionality of the patch vectors, we can further
reduce the computational complexity of the ViT model and make it more efficient.
4. Adding Positional Embeddings:
In order to preserve the spatial information of the input image, we add positional em-
beddings to the flattened patch embeddings. These embeddings encode the spatial
relationship between the patches and help the ViT model to understand the loca-
tion of each patch in the image. The positional embeddings are typically learned
during training and are added to the flattened patch embeddings using element-wise
addition.
4.3 Vision Transformers (ViT) for Image Translation
The first step is to use a Vision Transformer to extract features from the input images.
ViT is a state-of-the-art model that divides the input image into smaller patches and
processes them through self-attention mechanisms. The ViT model consists of multiple
Transformer encoder layers to encode both global and local information from the image
patches.
4.4 Image Generation
CycleGAN is a framework that uses cycle-consistency loss to enforce the recon-
structed images to be similar to their original counterparts. It consists of two main
components: the generator and the discriminator.
1. Generator (G):
The generator takes images from one domain as input and tries to transform
them to the target domain. In this case, we have two generators: Gs 2c (Sunny to
Cloudy) and Gc 2s (Cloudy to Sunny).
2. Discriminator (D):
The discriminator tries to distinguish between the real images from the tar-
get domain and the generated images by the respective generators. There are two
discriminators: Ds (Sunny Discriminator) and Dc (Cloudy Discriminator).
4.5 Loss Function
1. Generator (G):
The generator takes images from one domain as input and tries to transform
them to the target domain. In this case, we have two generators: Gs 2c (Sunny to
Cloudy) and Gc 2s (Cloudy to Sunny).
2. Discriminator (D):
The discriminator tries to distinguish between the real images from the tar-
get domain and the generated images by the respective generators. There are two
discriminators: Ds (Sunny Discriminator) and Dc (Cloudy Discriminator).

4.6 Image Reconstruction
To reconstruct the image, you can apply the generators Gs 2c and Gc 2s consecutively.
For example, to reconstruct the sunny image after making it cloudy and then sunny again:
Sunny Image (Is ) − > Gs 2c − > Cloudy Image (Ic ) − > Gc 2s − > Reconstructed
Sunny Image
Similarly, to reconstruct the cloudy image after making it sunny and then cloudy
again:
Cloudy Image (Ic ) − > Gc 2s − > Sunny Image (Is ) − > Gs 2c − > Reconstructed
Cloudy Image
The reconstruction process helps ensure that the model is learning meaningful trans-
lations and not introducing artifacts or losing information during the process.
Chapter 5
Experimental Setup
5.1 Dataset Requirement
1. Cityscapes:
Cityscapes is a comprehensive and widely-used dataset comprising labeled videos
captured from vehicles traveling in various urban environments across Germany.
This dataset has been widely utilized for various computer vision tasks, including se-
mantic segmentation, object detection, and image translation. The dataset consists
of a total of 2975 images designated for training and 500 images for validation. Each
image is of dimensions 256x512 pixels, ensuring a high-resolution representation of
the urban scenes. In the context of image translation, the images in the Cityscapes
dataset are initially categorized into two distinct classes: sunny and cloudy. This
classification is essential to provide meaningful input to our image translation model,
as weather conditions greatly impact the appearance and characteristics of urban
scenes. By segregating the data into these two classes, the model can learn to trans-
form images between the sunny and cloudy conditions, simulating different weather
scenarios and enabling diverse applications in the computer vision domain. After
the initial classification into sunny and cloudy classes, the dataset is further split
17
into three distinct subsets: training, testing, and validation data. The training set,
consisting of a majority of the images, is used to train the image translation model
on the task of converting images between sunny and cloudy conditions. The testing
set is employed to evaluate the model’s performance and generalization abilities on
previously unseen data. Finally, the validation set is used to fine-tune the model and
make adjustments to hyperparameters, ensuring optimal performance and prevent-
ing overfitting. The Cityscapes dataset’s unique combination of real-world urban
scenes, high-quality annotations, and diverse weather conditions makes it an ideal
choice for testing and benchmarking image translation models.
2. Imagenet:
The dataset used in this research is derived from the renowned ImageNet
project, which serves as a vast visual database primarily employed in the advance-
ment of visual object recognition algorithms. ImageNet comprises an extensive col-
lection of over 20,000 categories, providing a diverse range of visual data for research
purposes. For this specific study, a subset of the ImageNet dataset consisting of
cloudy and sunny images was selected. The subset, obtained from source [7], in-
cludes 5000 images for each class. To ensure proper evaluation and generalization of
the model, the dataset was split into three subsets: a training set with 70% of the
images (3500 images), a validation set with 15%, and a test set with the remain-
ing 15As part of the preprocessing steps, the dataset was centered to enhance data
consistency and improve model performance during training and evaluation.
3. Soyabean crops:
The dataset, titled ”Soybean Leaf Damage Dataset,” is a valuable collection of
soybean leaf images encompassing three distinct categories: Caterpillar, Diabrotica
Speciosa, and Healthy. These images portray soybean leaves damaged by caterpil-
lars, diabrotica speciosa, as well as healthy leaves without any damage from the
mentioned insects. Captured in a real environment, the images capture the natural
interferences of wind, sun, shadows, and cloudy conditions, ensuring a realistic rep-
resentation of soybean leaf conditions. In total, the dataset comprises 6,410 images,
distributed across three folders: caterpillar (3,309 images), diabrotica speciosa (2,205
images), and healthy (896 images) [4]. To increase the dataset size, the images have
been standardized to dimensions of 500 x 500 pixels, and augmentations such as
rotations of 45, 90, and 180 degrees have been applied. This augmentation strategy
facilitates better model generalization and robustness during training. The data’s
natural quality enables researchers to apply various filters and pre-processing tools as
needed for their specific applications. This flexibility allows for greater adaptability
to different image translation and classification tasks, enhancing the dataset’s versa-
tility. The images were captured using smartphones and drones, ensuring a diverse
range of perspectives and capturing conditions. The availability of drone-captured
images further enriches the dataset by providing aerial views of soybean leaf damage,
which can be particularly useful for certain agricultural analyses. In preparation for
image translation tasks, the dataset’s images are classified into two classes: sunny
and cloudy. This classification allows the development of an image translation model
capable of transforming soybean leaf images between different weather conditions.
Subsequently, the dataset is divided into three subsets: training, testing, and vali-
dation data, following standard practices in machine learning. This division ensures
that the image translation model is trained on a significant portion of the data,
evaluated on unseen samples, and fine-tuned for optimal performance. Overall, the
”Soybean Leaf Damage Dataset” serves as a valuable resource for researchers and
practitioners in the field of agriculture, particularly those working on soybean leaf

damage detection, image translation, and other related tasks. Its comprehensive
collection of real-world images, diverse conditions, and clear categorization facili-
tate the development of robust and accurate models for agricultural analysis and
decision-making.
5.2 Software Requirement
1. Google Colab/ Jupyter Notebook
2. Pytorch
3. Python 3 Programming language

Chapter 6
Implementation and Result
6.1 Implementation Steps
1. Dataset Preprocessing:
The implementation of unpaired image translation using Vision Transformer
(ViT) involved preprocessing the datasets - ImageNet, Cityscape, and Soybean Crop.
Images were resized to a uniform dimension of 256x256 pixels, and a binary classi-
fication was applied, categorizing images into sunny and cloudy classes to facilitate
the image translation task. The datasets were further divided into training, testing,
and validation sets using standard proportions.
2. Model Architecture:
The chosen model for unpaired image translation was the Vision Transformer, a
state-of-the-art architecture known for its effectiveness in handling visual data. The
ViT model was configured with multiple transformer layers, self-attention mecha-
nisms, and feed-forward neural networks. The model was adapted for conditional
image-to-image translation, enabling it to transform images between sunny and
cloudy conditions.
3. Training:
21
The ViT-based image translation model was trained using an adversarial learn-
ing framework with cycle-consistency loss. The model was optimized using the Adam
optimizer, with a learning rate schedule and gradient clipping to stabilize training.
The training process involved minimizing the adversarial loss between translated
and real images, ensuring accurate translations while maintaining the image’s con-
tent and style.
4. Testing:
The performance of the implemented ViT model was evaluated using various
metrics, including Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio
(PSNR), and the Frchet Inception Distance (FID). These metrics provided quanti-
tative insights into the quality of the generated images and the model’s ability to
preserve essential visual features during translation.
6.2 Evaluation Parameters
1. Quantitative Evaluation:
The quantitative evaluation of the ViT-based image translation model revealed
promising results. The model achieved high SSIM and PSNR scores, indicating a
close resemblance between the translated and real images. Additionally, the FID
score demonstrated the model’s ability to generate images that were statistically
similar to the real dataset, showcasing its capability to learn meaningful representa-
tions.
2. Qualitative Evaluation:
Qualitative analysis of the translated images demonstrated the ViT model’s
proficiency in transforming sunny images to cloudy conditions and vice versa. The
model effectively captured weather-specific attributes and successfully translated
images while preserving crucial details like object shapes and scene context.
3. Comparative Analysis:
The performance of the implemented ViT model was compared with existing
state-of-the-art methods for unpaired image translation. The results highlighted
the superiority of the ViT model in terms of image quality, fidelity, and translation
accuracy, solidifying its position as a leading approach in this domain.
6.3 Evaluation Metrics
1. Frchet Inception Distance (FID):
FID is a metric used to evaluate the quality and diversity of generated images
compared to real images. It computes the Frchet distance between the feature rep-
resentations of real and generated images using an Inception network.
It doesn’t have a direct physical unit like length or weight. FID values are real
numbers that indicate the dissimilarity between two image distributions. Lower FID
values indicate better image quality and similarity between the distributions.
Mathematical Equation:
Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2
and Σ2 be the mean and covariance of the generated image features. The FID score
is computed as follows:
F ID = kµ1 − µ2 k2 + Tr (Σ1 + Σ2 − 2 (Σ1 ∗ Σ2 ))0 .5
Here, ||.|| denotes the L2 norm, and Tr() denotes the trace of a matrix.
2. Structural Similarity Index (SSIM):
SSIM is a metric that measures the structural similarity between two images.
SSIM values range between -1 and 1, where 1 indicates perfect similarity and -1
indicates maximum dissimilarity. SSIM doesn’t have a unit in the traditional sense,
as it’s a normalized measure that assesses the image’s structural content, luminance,
contrast, and texture.
A value of 1 indicates that the compared images are identical in terms of struc-
ture, luminance, contrast, and texture. A value of -1 indicates that the images are
maximally dissimilar. Higher SSIM values generally indicate better image quality
and greater similarity between images.
SSIM (I1 , I2 ) = (2 ∗ µ1 ∗ µ2 + c1 ) ∗ (2 ∗ σ1 2 + c2 ) / (µ21 + µ22 + c1 ) ∗ (σ12 + σ22 + c2 )
where,
- I1 and I2 are the two images being compared.
- µ1 and µ2 are the means of the two images.
- σ1 and σ2 are the standard deviations of the two images.
- σ1 2 is the covariance between the images.
- c1 and c2 are constants to stabilize the division.
3. Peak Signal-to-Noise Ratio (PSNR):
PSNR is a metric commonly used to measure the quality of images. It calculates
the ratio between the maximum possible pixel value (peak signal) and the mean
squared error between two images (noise). The unit of PSNR is usually decibels
(dB), which is a logarithmic unit for expressing the ratio between the original signal’s
maximum power and the power of noise. Higher PSNR values indicate better image
quality and less perceptible distortion.
PSNR (I1 , I2 ) = 20 ∗ log 10 (M AXI ) − 10 ∗ log 10(M SE)
where,
- M AXI is the maximum pixel value (e.g., 255 for 8-bit images).
- MSE is the mean squared error between the two images I1 and I2 .
In the equations above, I1 and I2 represent the real and generated images,
respectively, for FID and SSIM calculations. For PSNR, I1 is the real image, and I2
is the generated (reconstructed) image.

Chapter 7
Results
7.1 Result of CityScape Dataset
Figure 7.1: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column
2 is patches of input image and Column 3 is Sunny output image
26
Figure 7.2: Results of Sunny to Cloudy translation Column 1 is Sunny input image Column
2 is patches of input image and Column 3 is Cloudy output image

Table 7.1: Quantitative evaluation with FID metric, SSIM index and PSNR for cityscape dataset
Cityscape FID SSIM PSNR
Sunny to Cloudy 86.35 0.23 11.72
Cloudy to Sunny 85.68 0.22 11.91
7.2 Result of ImageNet Dataset


Table 7.2: Quantitative evaluation with FID metric, SSIM index and PSNR for ImageNet dataset
ImageNet FID SSIM PSNR
7.3 Result of Soybean Dataset


Table 7.3: Quantitative evaluation with FID metric, SSIM index and PSNR for Soybean dataset
Soybean FID SSIM PSNR
Table 7.4: Quantitative evaluation on Cityscape Dataset
Experiment CycleGAN UNIT ViT
CIS CIS CIS
Sunny to Cloudy 0.014 0.081 0.293
Cloudy to Sunny 0.090 0.219 0.189

Chapter 8
Conclusion
In this I explored the application of Vision Transformer (ViT) for unpaired image
translation, specifically focusing on three diverse datasets: ImageNet, Cityscape, and
Soybean Crop. We set out to investigate the effectiveness of the ViT model on these
datasets and to determine its performance in comparison to existing methods, particularly
on the benchmark dataset Cityscape. Moreover, we aimed to assess the applicability of the
ViT model on previously unexplored datasets, such as agricultural imagery represented
by the Soybean Crop dataset.
The results obtained from our experiments showcased the remarkable potential of the
Vision Transformer for unpaired image translation tasks. The model achieved exceptional
performance on the Cityscape dataset, surpassing the results achieved by previously im-
plemented methods. This significant improvement demonstrates the effectiveness of the
ViT approach and its capability to tackle complex and diverse urban scene images present
in the Cityscape dataset. The success of ViT on Cityscape also makes it a promising can-
didate for addressing similar challenges in other urban-centric datasets like ImageNet.
Furthermore, the application of the Vision Transformer on the Soybean Crop dataset
revealed promising outcomes. To the best of our knowledge, this is the first instance of
image translation being applied to an agricultural dataset. While the results on Soybean
33
Crop might not have reached the levels of the Cityscape dataset, it lays the groundwork
for potential future improvements and advancements in applying ViT to agriculture-
specific image translation tasks. This pioneering effort in the agriculture domain opens
up exciting possibilities for further research and development in this field.
Considering the overall performance and generalizability of the Vision Transformer, we
can confidently conclude that its effectiveness on the benchmark Cityscape dataset serves
as an encouraging indication of its potential for other datasets like ImageNet and Soybean
Crop. The ViT model’s ability to understand and adapt to diverse visual data makes it
a versatile and robust choice for unpaired image translation tasks across various domains.
Chapter 9
Timeline
Table 9.1: Action Plan
Sr.No. Timeline Activity
1. Analysis of various publicly available crop image dataset
2. Literature Review, Problem statement, and objectives

1 Phase 1 Presentation
3. Study of I2I translation and working vision transformer
1. Study of transformer encoder
2 Phase 2 Presentation 2. Implementation of image patches and Patch embeddings
1. Implementation of vision transformer on cityscapes, imagenet and
3 Phase 3 Presentation soybean dataset
1. Result and Conclusion
4 Phase 4 Presentation 2. Research Paper
35
9.1 Gantt Chart
Figure 9.1: Gantt Chart

Chapter 10
Publication Details
The 14th International Conference on Computing, Communication and Networking
Technologies (ICCCNT) is a premier conference. This conference is highly esteemed for
its research in the fields of computing, communication and networking, which have been
proven to have a wealth of applications.
This event is organized annually with the intention of providing an excellent platform
for leading academics, researchers, industrial participants and students to share their re-
search findings with renowned experts.
1. Paper Title:A Comprehensive Survey of Weed Detection and Classification Datasets
for Precision Agriculture
2. Name of the Conference: International Conference on Computing, Communication
and Networking Technologies (ICCCNT)
3. Organized By: IIT - Delhi, Delhi, India
4. Paper Accepted / Rejected: Accepted
5. Corrective Actions: None
37
Bibliography
[1] Yiğit Gündüç. Tensor-to-image: Image-to-image translation with vision transformers.
arXiv preprint arXiv:2110.08037, 2021.
[2] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-
lation with conditional adversarial networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1125–1134, 2017.
[3] Soohyun Kim, Jongbeom Baek, Jihye Park, Gyeongnyeon Kim, and Seungryong Kim.
Instaformer: Instance-aware image-to-image translation with transformer. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 18321–18331, 2022.
[4] Maria Eloisa Mignoni, Aislan Honorato, Rafael Kunst, Rodrigo Righi, and Angélica
Massuquetti. Soybean images dataset for caterpillar and diabrotica speciosa pest
detection and classification. Data in Brief, 40:107756, 2022.
[5] Haseeb Nazki, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. Unsupervised image
translation using adversarial networks for improved plant disease recognition. Com-
puters and Electronics in Agriculture, 168:105117, 2020.
[6] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-image translation:
Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.
38
[7] Wanfeng Zheng, Qiang Li, Guoxin Zhang, Pengfei Wan, and Zhongyuan Wang.
Ittr: Unpaired image-to-image translation with transformers. arXiv preprint
arXiv:2203.16015, 2022.

Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer

Uploaded by

Copyright:

Available Formats

Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer

Uploaded by

Copyright:

Available Formats

Unpaired image-to-image translation for agricultural

images using Vision Transformer

in fulfilment for the award of the degree

M.Tech Computer Engineering

Under the guidance of

DEPARTMENT OF COMPUTER ENGINEERING AND

COEP Technological University

tural images using Vision Transformer ” has been successfully completed by

Prajakta Khaire 122122006

Project Guide Head

Department of Computer Engineering Department of Computer Engineering

and Information Technology, and Information Technology,

COEP Technological University, COEP Technological Univesity,

Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

condition. The image-to-image(I2I) translation is a technique used to translate images

List of Figures iii

1.3 Problem Definition and Objectives . . . . . . . . . . . . . . . . . . . . . 3

2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Image-to-image translation . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Unpaired I2I translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Vision Transformer(ViT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.1 ViT Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.2 Global Position Embeddings . . . . . . . . . . . . . . . . . . . . . 8

3.5.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5.2 Multi-Head Self-Attention Layer . . . . . . . . . . . . . . . . . . . 10

3.5.3 Feedforward Neural Network Layer . . . . . . . . . . . . . . . . . 10

3.5.4 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Vision Transformer for Image Translation 12

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Vision Transformers (ViT) for Image Translation . . . . . . . . . . . . . 14

4.4 Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.6 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1 Dataset Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Implementation and Result 21

6.1 Implementation Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.1 Result of CityScape Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 Result of ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.3 Result of Soybean Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 6

7.4 Quantitative evaluation on Cityscape Dataset . . . . . . . . . . . . . . . 32

9.1 Action Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Architecture of vision transformer . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Vision Transformer Architecture for image translation . . . . . . . . . . 12

4.2 Layers of Transforms Block . . . . . . . . . . . . . . . . . . . . . . . . . 13

7.1 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 26

7.2 Results of Sunny to Cloudy translation Column 1 is Sunny input image

Column 2 is patches of input image and Column 3 is Cloudy output image 27

7.3 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 28

7.4 Results of Sunny to Cloudy translation Column 1 is Sunny input image

Column 2 is patches of input image and Column 3 is Cloudy output image 29

7.5 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 30