Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Khaire, Prajakta Unpaired Image To Image Translation Using Vision Transformer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Unpaired image-to-image translation for agricultural

images using Vision Transformer

A Project Report
Submitted by
Prajakta Khaire 122122006

in fulfilment for the award of the degree


of

M.Tech Computer Engineering

Under the guidance of


Dr. Vahida Attar
College of Engineering, Pune

DEPARTMENT OF COMPUTER ENGINEERING AND


INFORMATION TECHNOLOGY,
COEP Technological University
August, 2023
DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,

COEP Technological University

CERTIFICATE

Certified that this project, titled “Unpaired image-to-image translation for agricul-

tural images using Vision Transformer ” has been successfully completed by

Prajakta Khaire 122122006

and is approved for the fulfilment of the requirements for the degree of “M.Tech. Com-

puter Engineering”.

SIGNATURE SIGNATURE
Dr. Vahida Attar Dr. P. K. Deshmukh

Project Guide Head

Department of Computer Engineering Department of Computer Engineering

and Information Technology, and Information Technology,

COEP Technological University, COEP Technological Univesity,

Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.


Abstract

A lack of adequate training instances, including both quantity and variety, is typically

the barrier in the implementation of solutions in any field. For agricultural applications,

systems developed for tasks like plant and weed classification, and disease detection will

usually rely on the plant type. So each application needs its unique customized dataset.

A large amount of data is required to develop any kind of ML-based or DL-based model

for a particular task. In the case of classification, the data of different climates, and

light conditions should be available, so that system can classify the crops precisely in any

condition. The image-to-image(I2I) translation is a technique used to translate images

from one form to another while maintaining the contents of the image. I2I translation

has become more popular in recent years due to its wide range of applications. Many

computer vision and image processing applications such as image segmentation, data

annotation, image enhancement, image synthesis, pose prediction, and style transfer use

the image to image translation. Various Deep learning algorithms such as RNN, CNN,

GAN, etc. are used for I2I translation. Transformers is another approach that can be

used for I2I translation. This work presents the study of I2I translation, the architecture

of the vision transformer, and how it can be used to implement image translation on the

image dataset.
Contents

List of Tables ii

List of Figures iii

Nomenclature iv

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problem Definition and Objectives . . . . . . . . . . . . . . . . . . . . . 3

2 Existing Solutions 4

2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Image Translation 7

3.1 Image-to-image translation . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Unpaired I2I translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Vision Transformer(ViT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.1 ViT Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3.2 Global Position Embeddings . . . . . . . . . . . . . . . . . . . . . 8

3.3.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1
3.5 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5.2 Multi-Head Self-Attention Layer . . . . . . . . . . . . . . . . . . . 10

3.5.3 Feedforward Neural Network Layer . . . . . . . . . . . . . . . . . 10

3.5.4 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Vision Transformer for Image Translation 12

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Vision Transformers (ViT) for Image Translation . . . . . . . . . . . . . 14

4.4 Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.6 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Experimental Setup 17

5.1 Dataset Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Implementation and Result 21

6.1 Implementation Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Results 26

7.1 Result of CityScape Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 Result of ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.3 Result of Soybean Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30


8 Conclusion 33

9 Timeline 35

9.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

10 Publication Details 37
List of Tables

2.1 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . 6

7.1 Quantitative evaluation with FID metric, SSIM index and PSNR for cityscape

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.2 Quantitative evaluation with FID metric, SSIM index and PSNR for Ima-

geNet dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.3 Quantitative evaluation with FID metric, SSIM index and PSNR for Soy-

bean dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.4 Quantitative evaluation on Cityscape Dataset . . . . . . . . . . . . . . . 32

9.1 Action Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ii
List of Figures

3.1 Architecture of vision transformer . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Vision Transformer Architecture for image translation . . . . . . . . . . 12

4.2 Layers of Transforms Block . . . . . . . . . . . . . . . . . . . . . . . . . 13

7.1 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 26

7.2 Results of Sunny to Cloudy translation Column 1 is Sunny input image

Column 2 is patches of input image and Column 3 is Cloudy output image 27

7.3 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 28

7.4 Results of Sunny to Cloudy translation Column 1 is Sunny input image

Column 2 is patches of input image and Column 3 is Cloudy output image 29

7.5 Results of Cloudy to Sunny translation Column 1 is Cloudy input image

Column 2 is patches of input image and Column 3 is Sunny output image 30

7.6 Results of Sunny to Cloudy translation Column 1 is Sunny input image

Column 2 is patches of input image and Column 3 is Cloudy output image 31

9.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

iii
Nomenclature

µ Mean

σ Standard deviation

iv
Chapter 1

Introduction

1.1 Overview

Weed detection and control play a vital role in agriculture, as weeds reduce crop

yields, degrade product quality, and raise production costs. The plant that grows along

with valuable agricultural goods is called a weed. Weed detection is the challenge of

properly recognizing the area of weeds. Weeds may eliminate the nutrients in crops; for

this reason, they will harm crops. As weeds may affect the growth of crops, these weeds

should be detected and removed.

To design an autonomous weed control system, a crucial initial step is to locate and

identify weeds accurately. Weed detection in crops is problematic since weeds and crops

frequently have similar textures, colors, and forms. Common issues in recognizing and

distinguishing crops and weeds include similarity in color and texture, crop shadow in

sunlight, and texture and color change because of lighting conditions and brightness. So

multiple approaches have been developed for automatic weed detection using image data.

Unavailability of the large data becomes the bottleneck for such a problem.

Data collection is very important as the whole system depends on the data. To clas-

sify the crops precisely the data in different climates and lightning conditions should be

1
available. Collecting data in such various conditions becomes challenging. So, to deal

with this situation, we can use an I2I translation that can help to transform images.

Deep learning techniques are used for most of computer vision tasks. In recent years,

I2I translation has gained more attention due to a variety of applications like image

restoration, style transfer, and image synthesis. The I2I translation is a method that

converts an image from one domain to another, i.e. one image is mapped into another.

The successful implementation of Convolutional networks in many computer vision tasks

has made it popular that it got the priority for I2I translation tasks too. Autoencoders

and U-net models in convolutional networks are mostly used for I2I translation.

Considering its effectiveness, convolutional layers have trouble representing long-

range relationships. Innovative ideas, Transformers, patch, and attention extended the

range of deep learning and provided different approaches to address issues. Transformers

employ self-attention that is applied to the complete input.

1.2 Motivation

One of the primary issues in Deep Learning associated with the agricultural domain

is dealing with sparse datasets and limited samples, especially when algorithms require

labeled data and numerous training instances. Although public datasets are accessible

online, most datasets are restricted in size and suitable for particular applications. Data

collection is a complicated and costly activity. It demands the involvement of individuals

from diverse areas at various stages. Also, the difficulty of class imbalance has been a

universal concern in computer vision as well as in machine learning. The primary moti-

vation for this work is to develop an efficient method of generating images to supplement

the training set, which can help improve accuracy in a variety of fields.
1.3 Problem Definition and Objectives

Problem Definition

To implement unpaired image translation using a vision transformer for the agricul-

tural crop images.

Objectives

1. To analyse the publicly available agricultural image dataset.

2. To study the vision transformer architecture for image translation task.

3. To implement a transformer-based architecture for the conversion of unpaired im-

ages.

4. To conduct translation operations for sunny to night, sunny to rainy, sunny to cloudy

and, vice-versa on the cityscapes benchmark dataset.

5. To validate the performance of the above transformer-based architecture on the

agricultural crop image dataset.


Chapter 2

Existing Solutions

2.1 Review of Related Literature

Several methodologies involving CNNs were applied for I2I translation. Most of

the older approaches have depended on Autoencoders for I2I translation. An input is

transmitted via the encoder in an autoencoder. Encoder downsamples the input until a

bottleneck layer.[1] At the bottleneck latent vector is developed. After that, a decoder

reverses the process by upsampling a latent vector till the required image is generated.

The difficulty with these systems is that low-level information valuable for output image

creation is lost at bottleneck. Some successful systems have utilised U-Net technologies

to overcome this difficulty. U-Net uses skip connections. U-Net is particularly effective

due to skip connections, which concatenate the encoder to the decoder.[1] This lets the

low-level information to be kept and promptly transmitted to the output.

Transformers were first introduced for NLP. Generally, Transformers are pre-trained

on the text and fine-tuned for unique applications. Transformers do not suffer from

memory shortages as compared to recurrent networks. Because the transformer has self-

attention mechanisms and the way with which it uses Multilayer perceptrons(MLPs).

Recurrent layers cannot remember tokens after a specified amount of time. Some exper-

4
iments demonstrate that the memory of the recurrent layers can be expanded with the

help of the attention mechanism but still it is not compatible with the transformer.

Image GPT(IGPT) is another type of transformer-based architecture that takes in-

put as pixels and tries to produce similar pixels. Vision transformers take images as input

and divide it into patches. IGPT operates pixels simply like tokens. IGPT works simi-

larly like the text generation model where it receives pixels as input and creates target

pixels.

The transformer model gets a one-dimensional sequence of word embeddings as in-

put since it was first built for NLP. In contrast, when utilized for the purpose of image

classification in computer vision, the input data to the Transformer model is provided in

the form of two-dimensional pictures. Vision transformers (Vit) are a type of transformer

architecture that primarily employs encoders and unique patch embeddings. An images

are flattened into patches and then projected linearly to add position embedding. In

position embedding the patches are numbered so as to generate required output after

translation. The pathches with positional embedding are given to transformer encoder

for further processing.


Table 2.1: Review of Related Literature

Sr.No. Author Year Major takeaway/ Research gap

In this article, the taxonomy of different methods and ap-


1 Y. Pang et al. [6] 2021
plications of image translation has been provided.

In this, they have implemented image translation on the

2 S. Kim et al. [3] 2022 cityscapes dataset. They have considered instance and

global-level features.

Based on image translation and augmentation this article

focuses on how the improve the performance of plant disease

3 H. Nazki et al. [5] 2020 recognition. They have implemented the image translation

on healthy tomato crop leaves which are translated to the

infected leaves to get a larger size of balanced data.

For I2I translation, they have developed networks that learn

4 P. Isola et al. [2] 2016 the mapping from the input to the output image as well as

a loss function to train the mapping.

In this, they have developed a system using adversarial-

consistency loss GAN (ACL-GAN) which translates the im-


5 W. Zheng et al. [7] 2022
ages into anime. But has a limitation in that the model is

not able to handle the data with complex background.


Chapter 3

Image Translation

3.1 Image-to-image translation

The I2I translation is an area in computer vision that transforms the image from one

domain to another. The goal of image translation is to understand the mapping between

the input image and output image that has similar characteristics but different styles.

There are two types of I2I translation.

1. Paired image translation In this type of translation, the input and the ground truth

image are aligned. The drawback of paired image translation is that it is difficult to

obtain the paired training samples.

2. Unpaired image translation In this type of translation, the domain of input and

output images is different so the goal is to find the mapping between different images

using unpaired training data.

3.2 Unpaired I2I translation

Unpaired I2I translation is a type of image generation technique where the input

image and output images has no direct correspondence. In contrast to paired I2I transla-

7
tion, where a model is trained on a dataset of paired input and output images, unpaired

image-to-image translation requires two separate datasets: one dataset of input images

and one dataset of output images with the desired style or characteristics.

Unpaired I2I translation has many practical applications, such as generating images

in a different style, enhancing the quality of images, or converting images to a different

domain. For example, it can be used to translate images from a day scene to a night scene

or to convert a photograph of a horse to a zebra. Unpaired I2I translation is a powerful

technique for generating images with a desired style or characteristics without relying on

paired examples. It has numerous practical applications and can be implemented using

various machine learning techniques.

3.3 Vision Transformer(ViT)

The vision transformer (ViT) architecture for I2I translation involves modifying the

original ViT architecture by replacing the final classification layer with a decoder that

generates the output image. Modified ViT architecture is given below:

3.3.1 ViT Encoder

The ViT encoder consists of a stack of transformer layers that process the input

image. Each transformer layer includes multi-head self-attention and feedforward neural

networks to extract features from the input image.

3.3.2 Global Position Embeddings

In the original ViT architecture, position embeddings were added to the input image

to preserve spatial information. However, in image-to-image translation, we also need to

capture global information about the input image. Therefore, we can add global position
embeddings to the input image by averaging the output of the final transformer layer and

concatenating it with the input image.

3.3.3 Decoder

The decoder is a generative model that takes the output of the ViT encoder and

generates the output image. The decoder can be a CNN, GAN, or another type of

generative model. During training, the input image is given to the ViT encoder, and the

output of the encoder is then provided to the decoder. The decoder generates the output

image, and the loss function is calculated based on the difference between the generated

image and the ground truth output image. The model is trained using backpropagation

to optimize the parameters.

3.4 Architecture

Figure 3.1: Architecture of vision transformer


3.5 Transformer Encoder

The transformer encoder block in this context of I2I translation is slightly different than

the one used in natural language processing tasks, as it includes some additional layers

that are specific to image processing:

3.5.1 Convolutional Layer

The first layer in the transformer encoder block for I2I translation is typically a

convolutional layer. This layer applies a set of convolutional filters to the input patches

to extract local spatial features from the image. The output of the this layer is then

passed to the following layers.

3.5.2 Multi-Head Self-Attention Layer

This layer operates on the output of convolutional layer and computes a weighted

sum of the input patches, where weights are computed by the similarity between each

input patch and every other patch in the image. This layer allows the model to capture

global dependencies between the patches and to attend to the most relevant patches for

the image-to-image translation task.

3.5.3 Feedforward Neural Network Layer

The feedforward neural network layer applies a non-linear transformation to the

output of the above mentioned layer. This layer is typically composed of linear transfor-

mations separated by a activation function. The purpose of the feedforward layer is to

add additional capacity and non-linearity to the transformer model.


3.5.4 Normalization Layer

Like natural language processing tasks, normalization is also used in the transformer

encoder block for image-to-image translation. The residual connections allow information

from the input image to be directly propagated to the output of the block, while the layer

normalization stabilizes the learning process.

Overall, the steps involved in using the ViT architecture for image translation

are similar to those involved in using other deep learning models for image translation.

It includes a convolutional layer at the beginning to extract local spatial features from

the image. The multi-head self-attention layer and the feedforward neural network layer

are used to capture global dependencies and add non-linearity and capacity to the model.

The use of ViT as an encoder provides several advantages, such as the ability to process

input images efficiently using self-attention and the ability to handle variable-size input

images.
Chapter 4

Vision Transformer for Image

Translation

4.1 System Architecture

Figure 4.1: Vision Transformer Architecture for image translation

12
Figure 4.2: Layers of Transforms Block

4.2 Preprocessing

1. Splitting an Image into Patches:

The first step in using the Vision Transformer (ViT) architecture for image process-

ing is to split the input image into smaller patches. These patches are typically

square regions of the image with a fixed size, such as 16x16 pixels or 32x32 pixels.

By splitting the image into patches, we can reduce the dimensionality of the input

and make it easier to process with the ViT model.

2. Flattening the Patches:

Once the patches have been extracted from the input image, they are flattened into
vectors. This involves reshaping the pixel values of each patch into a one-dimensional

array. The resulting vectors represent the flattened patch images and are used as

the input to the ViT model.

3. Producing Linear Embeddings:

The flattened patch vectors are then fed through a linear projection layer to produce

lower-dimensional embeddings. This linear projection layer is implemented as a fully

connected neural network layer with a smaller output dimension than the input

dimension. By reducing the dimensionality of the patch vectors, we can further

reduce the computational complexity of the ViT model and make it more efficient.

4. Adding Positional Embeddings:

In order to preserve the spatial information of the input image, we add positional em-

beddings to the flattened patch embeddings. These embeddings encode the spatial

relationship between the patches and help the ViT model to understand the loca-

tion of each patch in the image. The positional embeddings are typically learned

during training and are added to the flattened patch embeddings using element-wise

addition.

4.3 Vision Transformers (ViT) for Image Translation

The first step is to use a Vision Transformer to extract features from the input images.

ViT is a state-of-the-art model that divides the input image into smaller patches and

processes them through self-attention mechanisms. The ViT model consists of multiple

Transformer encoder layers to encode both global and local information from the image

patches.
4.4 Image Generation

CycleGAN is a framework that uses cycle-consistency loss to enforce the recon-

structed images to be similar to their original counterparts. It consists of two main

components: the generator and the discriminator.

1. Generator (G):

The generator takes images from one domain as input and tries to transform

them to the target domain. In this case, we have two generators: Gs 2c (Sunny to

Cloudy) and Gc 2s (Cloudy to Sunny).

2. Discriminator (D):

The discriminator tries to distinguish between the real images from the tar-

get domain and the generated images by the respective generators. There are two

discriminators: Ds (Sunny Discriminator) and Dc (Cloudy Discriminator).

4.5 Loss Function

1. Generator (G):

The generator takes images from one domain as input and tries to transform

them to the target domain. In this case, we have two generators: Gs 2c (Sunny to

Cloudy) and Gc 2s (Cloudy to Sunny).

2. Discriminator (D):

The discriminator tries to distinguish between the real images from the tar-

get domain and the generated images by the respective generators. There are two

discriminators: Ds (Sunny Discriminator) and Dc (Cloudy Discriminator).


4.6 Image Reconstruction

To reconstruct the image, you can apply the generators Gs 2c and Gc 2s consecutively.

For example, to reconstruct the sunny image after making it cloudy and then sunny again:

Sunny Image (Is ) − > Gs 2c − > Cloudy Image (Ic ) − > Gc 2s − > Reconstructed

Sunny Image

Similarly, to reconstruct the cloudy image after making it sunny and then cloudy

again:

Cloudy Image (Ic ) − > Gc 2s − > Sunny Image (Is ) − > Gs 2c − > Reconstructed

Cloudy Image

The reconstruction process helps ensure that the model is learning meaningful trans-

lations and not introducing artifacts or losing information during the process.
Chapter 5

Experimental Setup

5.1 Dataset Requirement

1. Cityscapes:

Cityscapes is a comprehensive and widely-used dataset comprising labeled videos

captured from vehicles traveling in various urban environments across Germany.

This dataset has been widely utilized for various computer vision tasks, including se-

mantic segmentation, object detection, and image translation. The dataset consists

of a total of 2975 images designated for training and 500 images for validation. Each

image is of dimensions 256x512 pixels, ensuring a high-resolution representation of

the urban scenes. In the context of image translation, the images in the Cityscapes

dataset are initially categorized into two distinct classes: sunny and cloudy. This

classification is essential to provide meaningful input to our image translation model,

as weather conditions greatly impact the appearance and characteristics of urban

scenes. By segregating the data into these two classes, the model can learn to trans-

form images between the sunny and cloudy conditions, simulating different weather

scenarios and enabling diverse applications in the computer vision domain. After

the initial classification into sunny and cloudy classes, the dataset is further split

17
into three distinct subsets: training, testing, and validation data. The training set,

consisting of a majority of the images, is used to train the image translation model

on the task of converting images between sunny and cloudy conditions. The testing

set is employed to evaluate the model’s performance and generalization abilities on

previously unseen data. Finally, the validation set is used to fine-tune the model and

make adjustments to hyperparameters, ensuring optimal performance and prevent-

ing overfitting. The Cityscapes dataset’s unique combination of real-world urban

scenes, high-quality annotations, and diverse weather conditions makes it an ideal

choice for testing and benchmarking image translation models.

2. Imagenet:

The dataset used in this research is derived from the renowned ImageNet

project, which serves as a vast visual database primarily employed in the advance-

ment of visual object recognition algorithms. ImageNet comprises an extensive col-

lection of over 20,000 categories, providing a diverse range of visual data for research

purposes. For this specific study, a subset of the ImageNet dataset consisting of

cloudy and sunny images was selected. The subset, obtained from source [7], in-

cludes 5000 images for each class. To ensure proper evaluation and generalization of

the model, the dataset was split into three subsets: a training set with 70% of the

images (3500 images), a validation set with 15%, and a test set with the remain-

ing 15As part of the preprocessing steps, the dataset was centered to enhance data

consistency and improve model performance during training and evaluation.

3. Soyabean crops:

The dataset, titled ”Soybean Leaf Damage Dataset,” is a valuable collection of

soybean leaf images encompassing three distinct categories: Caterpillar, Diabrotica

Speciosa, and Healthy. These images portray soybean leaves damaged by caterpil-
lars, diabrotica speciosa, as well as healthy leaves without any damage from the

mentioned insects. Captured in a real environment, the images capture the natural

interferences of wind, sun, shadows, and cloudy conditions, ensuring a realistic rep-

resentation of soybean leaf conditions. In total, the dataset comprises 6,410 images,

distributed across three folders: caterpillar (3,309 images), diabrotica speciosa (2,205

images), and healthy (896 images) [4]. To increase the dataset size, the images have

been standardized to dimensions of 500 x 500 pixels, and augmentations such as

rotations of 45, 90, and 180 degrees have been applied. This augmentation strategy

facilitates better model generalization and robustness during training. The data’s

natural quality enables researchers to apply various filters and pre-processing tools as

needed for their specific applications. This flexibility allows for greater adaptability

to different image translation and classification tasks, enhancing the dataset’s versa-

tility. The images were captured using smartphones and drones, ensuring a diverse

range of perspectives and capturing conditions. The availability of drone-captured

images further enriches the dataset by providing aerial views of soybean leaf damage,

which can be particularly useful for certain agricultural analyses. In preparation for

image translation tasks, the dataset’s images are classified into two classes: sunny

and cloudy. This classification allows the development of an image translation model

capable of transforming soybean leaf images between different weather conditions.

Subsequently, the dataset is divided into three subsets: training, testing, and vali-

dation data, following standard practices in machine learning. This division ensures

that the image translation model is trained on a significant portion of the data,

evaluated on unseen samples, and fine-tuned for optimal performance. Overall, the

”Soybean Leaf Damage Dataset” serves as a valuable resource for researchers and

practitioners in the field of agriculture, particularly those working on soybean leaf


damage detection, image translation, and other related tasks. Its comprehensive

collection of real-world images, diverse conditions, and clear categorization facili-

tate the development of robust and accurate models for agricultural analysis and

decision-making.

5.2 Software Requirement

1. Google Colab/ Jupyter Notebook

2. Pytorch

3. Python 3 Programming language


Chapter 6

Implementation and Result

6.1 Implementation Steps

1. Dataset Preprocessing:

The implementation of unpaired image translation using Vision Transformer

(ViT) involved preprocessing the datasets - ImageNet, Cityscape, and Soybean Crop.

Images were resized to a uniform dimension of 256x256 pixels, and a binary classi-

fication was applied, categorizing images into sunny and cloudy classes to facilitate

the image translation task. The datasets were further divided into training, testing,

and validation sets using standard proportions.

2. Model Architecture:

The chosen model for unpaired image translation was the Vision Transformer, a

state-of-the-art architecture known for its effectiveness in handling visual data. The

ViT model was configured with multiple transformer layers, self-attention mecha-

nisms, and feed-forward neural networks. The model was adapted for conditional

image-to-image translation, enabling it to transform images between sunny and

cloudy conditions.

3. Training:

21
The ViT-based image translation model was trained using an adversarial learn-

ing framework with cycle-consistency loss. The model was optimized using the Adam

optimizer, with a learning rate schedule and gradient clipping to stabilize training.

The training process involved minimizing the adversarial loss between translated

and real images, ensuring accurate translations while maintaining the image’s con-

tent and style.

4. Testing:

The performance of the implemented ViT model was evaluated using various

metrics, including Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio

(PSNR), and the Frchet Inception Distance (FID). These metrics provided quanti-

tative insights into the quality of the generated images and the model’s ability to

preserve essential visual features during translation.

6.2 Evaluation Parameters

1. Quantitative Evaluation:

The quantitative evaluation of the ViT-based image translation model revealed

promising results. The model achieved high SSIM and PSNR scores, indicating a

close resemblance between the translated and real images. Additionally, the FID

score demonstrated the model’s ability to generate images that were statistically

similar to the real dataset, showcasing its capability to learn meaningful representa-

tions.

2. Qualitative Evaluation:

Qualitative analysis of the translated images demonstrated the ViT model’s

proficiency in transforming sunny images to cloudy conditions and vice versa. The
model effectively captured weather-specific attributes and successfully translated

images while preserving crucial details like object shapes and scene context.

3. Comparative Analysis:

The performance of the implemented ViT model was compared with existing

state-of-the-art methods for unpaired image translation. The results highlighted

the superiority of the ViT model in terms of image quality, fidelity, and translation

accuracy, solidifying its position as a leading approach in this domain.

6.3 Evaluation Metrics

1. Frchet Inception Distance (FID):

FID is a metric used to evaluate the quality and diversity of generated images

compared to real images. It computes the Frchet distance between the feature rep-

resentations of real and generated images using an Inception network.

It doesn’t have a direct physical unit like length or weight. FID values are real

numbers that indicate the dissimilarity between two image distributions. Lower FID

values indicate better image quality and similarity between the distributions.

Mathematical Equation:

Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2

and Σ2 be the mean and covariance of the generated image features. The FID score

is computed as follows:

F ID = kµ1 − µ2 k2 + Tr (Σ1 + Σ2 − 2 (Σ1 ∗ Σ2 ))0 .5

Here, ||.|| denotes the L2 norm, and Tr() denotes the trace of a matrix.

2. Structural Similarity Index (SSIM):

SSIM is a metric that measures the structural similarity between two images.
SSIM values range between -1 and 1, where 1 indicates perfect similarity and -1

indicates maximum dissimilarity. SSIM doesn’t have a unit in the traditional sense,

as it’s a normalized measure that assesses the image’s structural content, luminance,

contrast, and texture.

A value of 1 indicates that the compared images are identical in terms of struc-

ture, luminance, contrast, and texture. A value of -1 indicates that the images are

maximally dissimilar. Higher SSIM values generally indicate better image quality

and greater similarity between images.

Mathematical Equation:

Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2

and Σ2 be the mean and covariance of the generated image features. The FID score

is computed as follows:

SSIM (I1 , I2 ) = (2 ∗ µ1 ∗ µ2 + c1 ) ∗ (2 ∗ σ1 2 + c2 ) / (µ21 + µ22 + c1 ) ∗ (σ12 + σ22 + c2 )

where,

- I1 and I2 are the two images being compared.

- µ1 and µ2 are the means of the two images.

- σ1 and σ2 are the standard deviations of the two images.

- σ1 2 is the covariance between the images.

- c1 and c2 are constants to stabilize the division.

3. Peak Signal-to-Noise Ratio (PSNR):

PSNR is a metric commonly used to measure the quality of images. It calculates

the ratio between the maximum possible pixel value (peak signal) and the mean

squared error between two images (noise). The unit of PSNR is usually decibels

(dB), which is a logarithmic unit for expressing the ratio between the original signal’s

maximum power and the power of noise. Higher PSNR values indicate better image
quality and less perceptible distortion.

Mathematical Equation:

Let µ1 and Σ1 be the mean and covariance of the real image features, and µ2

and Σ2 be the mean and covariance of the generated image features. The FID score

is computed as follows:

PSNR (I1 , I2 ) = 20 ∗ log 10 (M AXI ) − 10 ∗ log 10(M SE)

where,

- M AXI is the maximum pixel value (e.g., 255 for 8-bit images).

- MSE is the mean squared error between the two images I1 and I2 .

In the equations above, I1 and I2 represent the real and generated images,

respectively, for FID and SSIM calculations. For PSNR, I1 is the real image, and I2

is the generated (reconstructed) image.


Chapter 7

Results

7.1 Result of CityScape Dataset

Figure 7.1: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column

2 is patches of input image and Column 3 is Sunny output image

26
Figure 7.2: Results of Sunny to Cloudy translation Column 1 is Sunny input image Column

2 is patches of input image and Column 3 is Cloudy output image


Table 7.1: Quantitative evaluation with FID metric, SSIM index and PSNR for cityscape dataset

Cityscape FID SSIM PSNR

Sunny to Cloudy 86.35 0.23 11.72

Cloudy to Sunny 85.68 0.22 11.91

7.2 Result of ImageNet Dataset

Figure 7.3: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column

2 is patches of input image and Column 3 is Sunny output image


Figure 7.4: Results of Sunny to Cloudy translation Column 1 is Sunny input image Column

2 is patches of input image and Column 3 is Cloudy output image


Table 7.2: Quantitative evaluation with FID metric, SSIM index and PSNR for ImageNet dataset

ImageNet FID SSIM PSNR

Sunny to Cloudy 111.66 0.04 8.05

Cloudy to Sunny 136.04 0.07 8.28

7.3 Result of Soybean Dataset

Figure 7.5: Results of Cloudy to Sunny translation Column 1 is Cloudy input image Column

2 is patches of input image and Column 3 is Sunny output image


Figure 7.6: Results of Sunny to Cloudy translation Column 1 is Sunny input image Column

2 is patches of input image and Column 3 is Cloudy output image


Table 7.3: Quantitative evaluation with FID metric, SSIM index and PSNR for Soybean dataset

Soybean FID SSIM PSNR

Sunny to Cloudy 104.81 0.17 9.74

Cloudy to Sunny 100.66 0.14 9.76

Table 7.4: Quantitative evaluation on Cityscape Dataset

Experiment CycleGAN UNIT ViT

CIS CIS CIS

Sunny to Cloudy 0.014 0.081 0.293

Cloudy to Sunny 0.090 0.219 0.189


Chapter 8

Conclusion

In this I explored the application of Vision Transformer (ViT) for unpaired image

translation, specifically focusing on three diverse datasets: ImageNet, Cityscape, and

Soybean Crop. We set out to investigate the effectiveness of the ViT model on these

datasets and to determine its performance in comparison to existing methods, particularly

on the benchmark dataset Cityscape. Moreover, we aimed to assess the applicability of the

ViT model on previously unexplored datasets, such as agricultural imagery represented

by the Soybean Crop dataset.

The results obtained from our experiments showcased the remarkable potential of the

Vision Transformer for unpaired image translation tasks. The model achieved exceptional

performance on the Cityscape dataset, surpassing the results achieved by previously im-

plemented methods. This significant improvement demonstrates the effectiveness of the

ViT approach and its capability to tackle complex and diverse urban scene images present

in the Cityscape dataset. The success of ViT on Cityscape also makes it a promising can-

didate for addressing similar challenges in other urban-centric datasets like ImageNet.

Furthermore, the application of the Vision Transformer on the Soybean Crop dataset

revealed promising outcomes. To the best of our knowledge, this is the first instance of

image translation being applied to an agricultural dataset. While the results on Soybean

33
Crop might not have reached the levels of the Cityscape dataset, it lays the groundwork

for potential future improvements and advancements in applying ViT to agriculture-

specific image translation tasks. This pioneering effort in the agriculture domain opens

up exciting possibilities for further research and development in this field.

Considering the overall performance and generalizability of the Vision Transformer, we

can confidently conclude that its effectiveness on the benchmark Cityscape dataset serves

as an encouraging indication of its potential for other datasets like ImageNet and Soybean

Crop. The ViT model’s ability to understand and adapt to diverse visual data makes it

a versatile and robust choice for unpaired image translation tasks across various domains.
Chapter 9

Timeline

Table 9.1: Action Plan

Sr.No. Timeline Activity

1. Analysis of various publicly available crop image dataset

2. Literature Review, Problem statement, and objectives


1 Phase 1 Presentation

3. Study of I2I translation and working vision transformer

1. Study of transformer encoder

2 Phase 2 Presentation 2. Implementation of image patches and Patch embeddings

1. Implementation of vision transformer on cityscapes, imagenet and

3 Phase 3 Presentation soybean dataset

1. Result and Conclusion

4 Phase 4 Presentation 2. Research Paper

35
9.1 Gantt Chart

Figure 9.1: Gantt Chart


Chapter 10

Publication Details

The 14th International Conference on Computing, Communication and Networking

Technologies (ICCCNT) is a premier conference. This conference is highly esteemed for

its research in the fields of computing, communication and networking, which have been

proven to have a wealth of applications.

This event is organized annually with the intention of providing an excellent platform

for leading academics, researchers, industrial participants and students to share their re-

search findings with renowned experts.

1. Paper Title:A Comprehensive Survey of Weed Detection and Classification Datasets

for Precision Agriculture

2. Name of the Conference: International Conference on Computing, Communication

and Networking Technologies (ICCCNT)

3. Organized By: IIT - Delhi, Delhi, India

4. Paper Accepted / Rejected: Accepted

5. Corrective Actions: None

37
Bibliography

[1] Yiğit Gündüç. Tensor-to-image: Image-to-image translation with vision transformers.

arXiv preprint arXiv:2110.08037, 2021.

[2] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-

lation with conditional adversarial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages 1125–1134, 2017.

[3] Soohyun Kim, Jongbeom Baek, Jihye Park, Gyeongnyeon Kim, and Seungryong Kim.

Instaformer: Instance-aware image-to-image translation with transformer. In Pro-

ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 18321–18331, 2022.

[4] Maria Eloisa Mignoni, Aislan Honorato, Rafael Kunst, Rodrigo Righi, and Angélica

Massuquetti. Soybean images dataset for caterpillar and diabrotica speciosa pest

detection and classification. Data in Brief, 40:107756, 2022.

[5] Haseeb Nazki, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. Unsupervised image

translation using adversarial networks for improved plant disease recognition. Com-

puters and Electronics in Agriculture, 168:105117, 2020.

[6] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-image translation:

Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.

38
[7] Wanfeng Zheng, Qiang Li, Guoxin Zhang, Pengfei Wan, and Zhongyuan Wang.

Ittr: Unpaired image-to-image translation with transformers. arXiv preprint

arXiv:2203.16015, 2022.

You might also like