Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer
Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer
Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer
A Project Report
Submitted by
Prajakta Khaire 122122006
INFORMATION TECHNOLOGY,
CERTIFICATE
Certified that this project, titled “Unpaired image-to-image translation for agricul-
and is approved for the fulfilment of the requirements for the degree of “M.Tech. Com-
puter Engineering”.
SIGNATURE SIGNATURE
Dr. Vahida Attar Dr. P. K. Deshmukh
A lack of adequate training instances, including both quantity and variety, is typically
the barrier in the implementation of solutions in any field. For agricultural applications,
systems developed for tasks like plant and weed classification, and disease detection will
usually rely on the plant type. So each application needs its unique customized dataset.
A large amount of data is required to develop any kind of ML-based or DL-based model
for a particular task. In the case of classification, the data of different climates, and
light conditions should be available, so that system can classify the crops precisely in any
from one form to another while maintaining the contents of the image. I2I translation
has become more popular in recent years due to its wide range of applications. Many
computer vision and image processing applications such as image segmentation, data
annotation, image enhancement, image synthesis, pose prediction, and style transfer use
the image to image translation. Various Deep learning algorithms such as RNN, CNN,
GAN, etc. are used for I2I translation. Transformers is another approach that can be
used for I2I translation. This work presents the study of I2I translation, the architecture
of the vision transformer, and how it can be used to implement image translation on the
image dataset.
Contents
List of Tables ii
Nomenclature iv
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Existing Solutions 4
3 Image Translation 7
3.3.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1
3.5 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Experimental Setup 17
7 Results 26
9 Timeline 35
10 Publication Details 37
List of Tables
7.1 Quantitative evaluation with FID metric, SSIM index and PSNR for cityscape
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Quantitative evaluation with FID metric, SSIM index and PSNR for Ima-
geNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 Quantitative evaluation with FID metric, SSIM index and PSNR for Soy-
bean dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
List of Figures
iii
Nomenclature
µ Mean
σ Standard deviation
iv
Chapter 1
Introduction
1.1 Overview
Weed detection and control play a vital role in agriculture, as weeds reduce crop
yields, degrade product quality, and raise production costs. The plant that grows along
with valuable agricultural goods is called a weed. Weed detection is the challenge of
properly recognizing the area of weeds. Weeds may eliminate the nutrients in crops; for
this reason, they will harm crops. As weeds may affect the growth of crops, these weeds
To design an autonomous weed control system, a crucial initial step is to locate and
identify weeds accurately. Weed detection in crops is problematic since weeds and crops
frequently have similar textures, colors, and forms. Common issues in recognizing and
distinguishing crops and weeds include similarity in color and texture, crop shadow in
sunlight, and texture and color change because of lighting conditions and brightness. So
multiple approaches have been developed for automatic weed detection using image data.
Unavailability of the large data becomes the bottleneck for such a problem.
Data collection is very important as the whole system depends on the data. To clas-
sify the crops precisely the data in different climates and lightning conditions should be
1
available. Collecting data in such various conditions becomes challenging. So, to deal
with this situation, we can use an I2I translation that can help to transform images.
Deep learning techniques are used for most of computer vision tasks. In recent years,
I2I translation has gained more attention due to a variety of applications like image
restoration, style transfer, and image synthesis. The I2I translation is a method that
converts an image from one domain to another, i.e. one image is mapped into another.
has made it popular that it got the priority for I2I translation tasks too. Autoencoders
and U-net models in convolutional networks are mostly used for I2I translation.
range relationships. Innovative ideas, Transformers, patch, and attention extended the
range of deep learning and provided different approaches to address issues. Transformers
1.2 Motivation
One of the primary issues in Deep Learning associated with the agricultural domain
is dealing with sparse datasets and limited samples, especially when algorithms require
labeled data and numerous training instances. Although public datasets are accessible
online, most datasets are restricted in size and suitable for particular applications. Data
from diverse areas at various stages. Also, the difficulty of class imbalance has been a
universal concern in computer vision as well as in machine learning. The primary moti-
vation for this work is to develop an efficient method of generating images to supplement
the training set, which can help improve accuracy in a variety of fields.
1.3 Problem Definition and Objectives
Problem Definition
To implement unpaired image translation using a vision transformer for the agricul-
Objectives
ages.
4. To conduct translation operations for sunny to night, sunny to rainy, sunny to cloudy
Existing Solutions
Several methodologies involving CNNs were applied for I2I translation. Most of
the older approaches have depended on Autoencoders for I2I translation. An input is
transmitted via the encoder in an autoencoder. Encoder downsamples the input until a
bottleneck layer.[1] At the bottleneck latent vector is developed. After that, a decoder
reverses the process by upsampling a latent vector till the required image is generated.
The difficulty with these systems is that low-level information valuable for output image
creation is lost at bottleneck. Some successful systems have utilised U-Net technologies
to overcome this difficulty. U-Net uses skip connections. U-Net is particularly effective
due to skip connections, which concatenate the encoder to the decoder.[1] This lets the
Transformers were first introduced for NLP. Generally, Transformers are pre-trained
on the text and fine-tuned for unique applications. Transformers do not suffer from
memory shortages as compared to recurrent networks. Because the transformer has self-
attention mechanisms and the way with which it uses Multilayer perceptrons(MLPs).
Recurrent layers cannot remember tokens after a specified amount of time. Some exper-
4
iments demonstrate that the memory of the recurrent layers can be expanded with the
help of the attention mechanism but still it is not compatible with the transformer.
put as pixels and tries to produce similar pixels. Vision transformers take images as input
and divide it into patches. IGPT operates pixels simply like tokens. IGPT works simi-
larly like the text generation model where it receives pixels as input and creates target
pixels.
put since it was first built for NLP. In contrast, when utilized for the purpose of image
classification in computer vision, the input data to the Transformer model is provided in
the form of two-dimensional pictures. Vision transformers (Vit) are a type of transformer
architecture that primarily employs encoders and unique patch embeddings. An images
are flattened into patches and then projected linearly to add position embedding. In
position embedding the patches are numbered so as to generate required output after
translation. The pathches with positional embedding are given to transformer encoder
2 S. Kim et al. [3] 2022 cityscapes dataset. They have considered instance and
global-level features.
3 H. Nazki et al. [5] 2020 recognition. They have implemented the image translation
4 P. Isola et al. [2] 2016 the mapping from the input to the output image as well as
Image Translation
The I2I translation is an area in computer vision that transforms the image from one
domain to another. The goal of image translation is to understand the mapping between
the input image and output image that has similar characteristics but different styles.
1. Paired image translation In this type of translation, the input and the ground truth
image are aligned. The drawback of paired image translation is that it is difficult to
2. Unpaired image translation In this type of translation, the domain of input and
output images is different so the goal is to find the mapping between different images
Unpaired I2I translation is a type of image generation technique where the input
image and output images has no direct correspondence. In contrast to paired I2I transla-
7
tion, where a model is trained on a dataset of paired input and output images, unpaired
image-to-image translation requires two separate datasets: one dataset of input images
and one dataset of output images with the desired style or characteristics.
Unpaired I2I translation has many practical applications, such as generating images
domain. For example, it can be used to translate images from a day scene to a night scene
technique for generating images with a desired style or characteristics without relying on
paired examples. It has numerous practical applications and can be implemented using
The vision transformer (ViT) architecture for I2I translation involves modifying the
original ViT architecture by replacing the final classification layer with a decoder that
The ViT encoder consists of a stack of transformer layers that process the input
image. Each transformer layer includes multi-head self-attention and feedforward neural
In the original ViT architecture, position embeddings were added to the input image
capture global information about the input image. Therefore, we can add global position
embeddings to the input image by averaging the output of the final transformer layer and
3.3.3 Decoder
The decoder is a generative model that takes the output of the ViT encoder and
generates the output image. The decoder can be a CNN, GAN, or another type of
generative model. During training, the input image is given to the ViT encoder, and the
output of the encoder is then provided to the decoder. The decoder generates the output
image, and the loss function is calculated based on the difference between the generated
image and the ground truth output image. The model is trained using backpropagation
3.4 Architecture
The transformer encoder block in this context of I2I translation is slightly different than
the one used in natural language processing tasks, as it includes some additional layers
The first layer in the transformer encoder block for I2I translation is typically a
convolutional layer. This layer applies a set of convolutional filters to the input patches
to extract local spatial features from the image. The output of the this layer is then
This layer operates on the output of convolutional layer and computes a weighted
sum of the input patches, where weights are computed by the similarity between each
input patch and every other patch in the image. This layer allows the model to capture
global dependencies between the patches and to attend to the most relevant patches for
output of the above mentioned layer. This layer is typically composed of linear transfor-
Like natural language processing tasks, normalization is also used in the transformer
encoder block for image-to-image translation. The residual connections allow information
from the input image to be directly propagated to the output of the block, while the layer
Overall, the steps involved in using the ViT architecture for image translation
are similar to those involved in using other deep learning models for image translation.
It includes a convolutional layer at the beginning to extract local spatial features from
the image. The multi-head self-attention layer and the feedforward neural network layer
are used to capture global dependencies and add non-linearity and capacity to the model.
The use of ViT as an encoder provides several advantages, such as the ability to process
input images efficiently using self-attention and the ability to handle variable-size input
images.
Chapter 4
Translation
12
Figure 4.2: Layers of Transforms Block
4.2 Preprocessing
The first step in using the Vision Transformer (ViT) architecture for image process-
ing is to split the input image into smaller patches. These patches are typically
square regions of the image with a fixed size, such as 16x16 pixels or 32x32 pixels.
By splitting the image into patches, we can reduce the dimensionality of the input
Once the patches have been extracted from the input image, they are flattened into
vectors. This involves reshaping the pixel values of each patch into a one-dimensional
array. The resulting vectors represent the flattened patch images and are used as
The flattened patch vectors are then fed through a linear projection layer to produce
connected neural network layer with a smaller output dimension than the input
reduce the computational complexity of the ViT model and make it more efficient.
In order to preserve the spatial information of the input image, we add positional em-
beddings to the flattened patch embeddings. These embeddings encode the spatial
relationship between the patches and help the ViT model to understand the loca-
tion of each patch in the image. The positional embeddings are typically learned
during training and are added to the flattened patch embeddings using element-wise
addition.
The first step is to use a Vision Transformer to extract features from the input images.
ViT is a state-of-the-art model that divides the input image into smaller patches and
processes them through self-attention mechanisms. The ViT model consists of multiple
Transformer encoder layers to encode both global and local information from the image
patches.
4.4 Image Generation
1. Generator (G):
The generator takes images from one domain as input and tries to transform
them to the target domain. In this case, we have two generators: Gs 2c (Sunny to
2. Discriminator (D):
The discriminator tries to distinguish between the real images from the tar-
get domain and the generated images by the respective generators. There are two
1. Generator (G):
The generator takes images from one domain as input and tries to transform
them to the target domain. In this case, we have two generators: Gs 2c (Sunny to
2. Discriminator (D):
The discriminator tries to distinguish between the real images from the tar-
get domain and the generated images by the respective generators. There are two
To reconstruct the image, you can apply the generators Gs 2c and Gc 2s consecutively.
For example, to reconstruct the sunny image after making it cloudy and then sunny again:
Sunny Image (Is ) − > Gs 2c − > Cloudy Image (Ic ) − > Gc 2s − > Reconstructed
Sunny Image
Similarly, to reconstruct the cloudy image after making it sunny and then cloudy
again:
Cloudy Image (Ic ) − > Gc 2s − > Sunny Image (Is ) − > Gs 2c − > Reconstructed
Cloudy Image
The reconstruction process helps ensure that the model is learning meaningful trans-
lations and not introducing artifacts or losing information during the process.
Chapter 5
Experimental Setup
1. Cityscapes:
This dataset has been widely utilized for various computer vision tasks, including se-
mantic segmentation, object detection, and image translation. The dataset consists
of a total of 2975 images designated for training and 500 images for validation. Each
the urban scenes. In the context of image translation, the images in the Cityscapes
dataset are initially categorized into two distinct classes: sunny and cloudy. This
scenes. By segregating the data into these two classes, the model can learn to trans-
form images between the sunny and cloudy conditions, simulating different weather
scenarios and enabling diverse applications in the computer vision domain. After
the initial classification into sunny and cloudy classes, the dataset is further split
17
into three distinct subsets: training, testing, and validation data. The training set,
consisting of a majority of the images, is used to train the image translation model
on the task of converting images between sunny and cloudy conditions. The testing
previously unseen data. Finally, the validation set is used to fine-tune the model and
2. Imagenet:
The dataset used in this research is derived from the renowned ImageNet
project, which serves as a vast visual database primarily employed in the advance-
lection of over 20,000 categories, providing a diverse range of visual data for research
purposes. For this specific study, a subset of the ImageNet dataset consisting of
cloudy and sunny images was selected. The subset, obtained from source [7], in-
cludes 5000 images for each class. To ensure proper evaluation and generalization of
the model, the dataset was split into three subsets: a training set with 70% of the
images (3500 images), a validation set with 15%, and a test set with the remain-
ing 15As part of the preprocessing steps, the dataset was centered to enhance data
3. Soyabean crops:
Speciosa, and Healthy. These images portray soybean leaves damaged by caterpil-
lars, diabrotica speciosa, as well as healthy leaves without any damage from the
mentioned insects. Captured in a real environment, the images capture the natural
interferences of wind, sun, shadows, and cloudy conditions, ensuring a realistic rep-
resentation of soybean leaf conditions. In total, the dataset comprises 6,410 images,
distributed across three folders: caterpillar (3,309 images), diabrotica speciosa (2,205
images), and healthy (896 images) [4]. To increase the dataset size, the images have
rotations of 45, 90, and 180 degrees have been applied. This augmentation strategy
facilitates better model generalization and robustness during training. The data’s
natural quality enables researchers to apply various filters and pre-processing tools as
needed for their specific applications. This flexibility allows for greater adaptability
to different image translation and classification tasks, enhancing the dataset’s versa-
tility. The images were captured using smartphones and drones, ensuring a diverse
images further enriches the dataset by providing aerial views of soybean leaf damage,
which can be particularly useful for certain agricultural analyses. In preparation for
image translation tasks, the dataset’s images are classified into two classes: sunny
and cloudy. This classification allows the development of an image translation model
Subsequently, the dataset is divided into three subsets: training, testing, and vali-
dation data, following standard practices in machine learning. This division ensures
that the image translation model is trained on a significant portion of the data,
evaluated on unseen samples, and fine-tuned for optimal performance. Overall, the
”Soybean Leaf Damage Dataset” serves as a valuable resource for researchers and
tate the development of robust and accurate models for agricultural analysis and
decision-making.
2. Pytorch
1. Dataset Preprocessing:
(ViT) involved preprocessing the datasets - ImageNet, Cityscape, and Soybean Crop.
Images were resized to a uniform dimension of 256x256 pixels, and a binary classi-
fication was applied, categorizing images into sunny and cloudy classes to facilitate
the image translation task. The datasets were further divided into training, testing,
2. Model Architecture:
The chosen model for unpaired image translation was the Vision Transformer, a
state-of-the-art architecture known for its effectiveness in handling visual data. The
ViT model was configured with multiple transformer layers, self-attention mecha-
nisms, and feed-forward neural networks. The model was adapted for conditional
cloudy conditions.
3. Training:
21
The ViT-based image translation model was trained using an adversarial learn-
ing framework with cycle-consistency loss. The model was optimized using the Adam
optimizer, with a learning rate schedule and gradient clipping to stabilize training.
The training process involved minimizing the adversarial loss between translated
and real images, ensuring accurate translations while maintaining the image’s con-
4. Testing:
The performance of the implemented ViT model was evaluated using various
(PSNR), and the Frchet Inception Distance (FID). These metrics provided quanti-
tative insights into the quality of the generated images and the model’s ability to
1. Quantitative Evaluation:
promising results. The model achieved high SSIM and PSNR scores, indicating a
close resemblance between the translated and real images. Additionally, the FID
score demonstrated the model’s ability to generate images that were statistically
similar to the real dataset, showcasing its capability to learn meaningful representa-
tions.
2. Qualitative Evaluation:
proficiency in transforming sunny images to cloudy conditions and vice versa. The
model effectively captured weather-specific attributes and successfully translated
images while preserving crucial details like object shapes and scene context.
3. Comparative Analysis:
The performance of the implemented ViT model was compared with existing
the superiority of the ViT model in terms of image quality, fidelity, and translation
FID is a metric used to evaluate the quality and diversity of generated images
compared to real images. It computes the Frchet distance between the feature rep-
It doesn’t have a direct physical unit like length or weight. FID values are real
numbers that indicate the dissimilarity between two image distributions. Lower FID
values indicate better image quality and similarity between the distributions.
Mathematical Equation:
Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2
and Σ2 be the mean and covariance of the generated image features. The FID score
is computed as follows:
Here, ||.|| denotes the L2 norm, and Tr() denotes the trace of a matrix.
SSIM is a metric that measures the structural similarity between two images.
SSIM values range between -1 and 1, where 1 indicates perfect similarity and -1
indicates maximum dissimilarity. SSIM doesn’t have a unit in the traditional sense,
as it’s a normalized measure that assesses the image’s structural content, luminance,
A value of 1 indicates that the compared images are identical in terms of struc-
ture, luminance, contrast, and texture. A value of -1 indicates that the images are
maximally dissimilar. Higher SSIM values generally indicate better image quality
Mathematical Equation:
Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2
and Σ2 be the mean and covariance of the generated image features. The FID score
is computed as follows:
where,
the ratio between the maximum possible pixel value (peak signal) and the mean
squared error between two images (noise). The unit of PSNR is usually decibels
(dB), which is a logarithmic unit for expressing the ratio between the original signal’s
maximum power and the power of noise. Higher PSNR values indicate better image
quality and less perceptible distortion.
Mathematical Equation:
Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2
and Σ2 be the mean and covariance of the generated image features. The FID score
is computed as follows:
where,
- M AXI is the maximum pixel value (e.g., 255 for 8-bit images).
- MSE is the mean squared error between the two images I1 and I2 .
In the equations above, I1 and I2 represent the real and generated images,
respectively, for FID and SSIM calculations. For PSNR, I1 is the real image, and I2
Results
Figure 7.1: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column
26
Figure 7.2: Results of Sunny to Cloudy translation Column 1 is Sunny input image Column
Figure 7.3: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column
Figure 7.5: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column
Conclusion
In this I explored the application of Vision Transformer (ViT) for unpaired image
Soybean Crop. We set out to investigate the effectiveness of the ViT model on these
on the benchmark dataset Cityscape. Moreover, we aimed to assess the applicability of the
The results obtained from our experiments showcased the remarkable potential of the
Vision Transformer for unpaired image translation tasks. The model achieved exceptional
performance on the Cityscape dataset, surpassing the results achieved by previously im-
ViT approach and its capability to tackle complex and diverse urban scene images present
in the Cityscape dataset. The success of ViT on Cityscape also makes it a promising can-
didate for addressing similar challenges in other urban-centric datasets like ImageNet.
Furthermore, the application of the Vision Transformer on the Soybean Crop dataset
revealed promising outcomes. To the best of our knowledge, this is the first instance of
image translation being applied to an agricultural dataset. While the results on Soybean
33
Crop might not have reached the levels of the Cityscape dataset, it lays the groundwork
specific image translation tasks. This pioneering effort in the agriculture domain opens
can confidently conclude that its effectiveness on the benchmark Cityscape dataset serves
as an encouraging indication of its potential for other datasets like ImageNet and Soybean
Crop. The ViT model’s ability to understand and adapt to diverse visual data makes it
a versatile and robust choice for unpaired image translation tasks across various domains.
Chapter 9
Timeline
35
9.1 Gantt Chart
Publication Details
its research in the fields of computing, communication and networking, which have been
This event is organized annually with the intention of providing an excellent platform
for leading academics, researchers, industrial participants and students to share their re-
37
Bibliography
[2] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-
[3] Soohyun Kim, Jongbeom Baek, Jihye Park, Gyeongnyeon Kim, and Seungryong Kim.
[4] Maria Eloisa Mignoni, Aislan Honorato, Rafael Kunst, Rodrigo Righi, and Angélica
Massuquetti. Soybean images dataset for caterpillar and diabrotica speciosa pest
[5] Haseeb Nazki, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. Unsupervised image
translation using adversarial networks for improved plant disease recognition. Com-
[6] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-image translation:
38
[7] Wanfeng Zheng, Qiang Li, Guoxin Zhang, Pengfei Wan, and Zhongyuan Wang.
arXiv:2203.16015, 2022.