Optimisation of Semantic Segmentation Algorithm For Autonomous Driving Using U-NET Architecture
Optimisation of Semantic Segmentation Algorithm For Autonomous Driving Using U-NET Architecture
Corresponding Author:
Mrinal R. Bachute
Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University)
Pune, India
Email: mrinal.bachute@sitpune.edu.in
1. INTRODUCTION
Software simulation can expedite the research process and reduce the cost of research in autonomous
urban driving. This paper uses the car learning to act (CARLA) simulator platform to generate a data set of
images. CARLA is an open-source simulation software platform with unreal engine 4 for research in
autonomous driving [1]. CARLA was established from a bottom-up approach to develop, train, and validate
"autonomous driving systems" for the urban area. CARLA simulator works as a server-client system, as shown
in Figure 1. The software simulation platform supports flexible sensors and environmental conditions [2].
CARLA permits the flexible configuration of the agent and sensors. It comprises aesthetically premeditated
townships with pedestrian road traffic [3]. Town 1 is aimed at training, while Town 2 is utilised solely for
testing. CARLA offers additional statistics like distance covered, collisions, and infractions manifestation like
gist onto the reverse lane or a footpath [4]. CARLA provides a means for developing and training autonomous
systems and evaluating them in controlled situations [5]. In this work, we use the CARLA simulator software
platform to spawn the vehicle with sensors in the town to gather semantic segmentation data.
So as to identify the different objects as seen by the autonomous driving system, semantic
segmentation is necessary to differentiate various categories or classes of objects like trees, cars, and
pedestrians. Semantic segmentation is partitioning an image into many segments called pixels. Semantic
segmentation predicts whether each pixel belongs to a particular class. Semantic segmentation is complex
because of scene complexity, complicated object boundaries, and small objects.
There are many deep-learning models for semantic segmentation which is part of perception task of
the autonomous driving system [7]. Some of these models, which include the U-Net Model, PSPNet
architecture, DeepLab, PPANet, and SegNet architecture, are discussed. This paper focuses on the UNET
model for semantic segmentation for the CARLA image data set and the comparative study of optimisation
techniques by building and testing the models in Keras.
2. RELATED WORK
The advanced semantic segmentation system consists of three essential mechanisms: i) a full
convolutional network, primarily announced via substituting the end-limited fully connected layers with
convolution layers to build effective end-to-end learning with the implication that can receive random input
size; ii) conditional random fields, to gather resident, extended-range dependencies in a scene to improve the
estimated map; and iii) dilated convolution (or atrous convolution), meant for enhancing the tenacity of
in-between feature maps to produce extra precise estimates while up-holding the exact computational rate [8].
Semantic segmentation can be categorised as two-stage and one-stage conduits. In the two-stage conduit,
region proposals are initially produced and then tweaked primarily for instance-level segmentation. The one-
stage conduit using a fully convolution network is the other standard semantic segmentation method. Image
semantic scene segmentation is achieved by depicting single images using supervised and unsupervised
techniques. Different approaches are used for semantic segmentation. The region-based semantic segmentation
approach first extracts freeform regions belonging to an image and designates them, trailed by region-based
classification [9]. In this approach, the whole region features and foreground features improve results when
concatenating them collectively as the region of the feature. Drawbacks of this approach include the feature
needing to be better suited to the segmentation mission, requiring more spatial information for exact boundary
generation, and taking more time, which could significantly affect the final performance. For semantic
segmentation, a full convolutional network is another approach that can be used. In this approach, a full
convolutional network acquires the mapping from pixel to pixel without removing the region proposals. Since
a full convolutional network only has convolution and pooling layers, this approach has the merit of making
predictions on arbitrary-sized inputs. The drawback of this approach is that direct predictions of a fully
convolutional network are typically in low resolution. The role of semantic segmentation in critical tasks of
self-driving technologies is essential. For example, through the path planning phase, it is crucial to understand
an object's pose, magnitude, and distance and decide whether to apply brakes or negotiate with the obstacle.
There are many semantic segmentation model architectures. Some of them are discussed in the following
sections.
The network has no fully connected layers. It applies that the practical portion of every convolution
encompasses pixels aimed at the entire framework and is obtainable in the given image. The UNet uses a
network and a strategy for training that depends on data augmentation to use the existing annotated samples
[10]. The path for contracting uses the distinctive style of a convolution network. It is made of the recurrent
use of dual 3 by 3 convolutions, individually succeeded by one rectified linear unit (ReLU) and one 2 by 2
max-pooling of stride 2 meant for down-sampling. At every down-sampling phase, the quantity of feature
channels is twice. Each step of the pathway of expansion has an up-sampling of the feature map succeeded
through a 2 by 2 convolution, which divides the feature channel, a concatenation with the congruently gathered
feature to map out of the path of contraction, and a dual 3 by 3 convolutions, individually succeeded by one
ReLU. Cropping is essential as edge pixels are lost in each convolution. A 1×1 convolution map for every
64-component feature vector is the chosen sum of classes at the final layer. Overall, 23 convolutional layers
are included in the network. The U-Net network's advantages include higher accuracy given a suitable dataset
and training time. U-Net uses a fully convolutional network without contingent on the input size. As a
shortcoming, the dimensions of the U-Net must be equivalent to the dimensions of the features. U-Net uses
various layers to increase the training time [13].
Another variant of U-Net, termed smoke-UNet, is built on UNet architecture. Smoke-UNet uses an
attention mechanism along with the residual block [14]. The residual block improves the feature learning ability
of the network. U-Net is successfully applied in many applications for medical image analysis. Two different
encoders and one standard decoder are used [15]. Ghost-Net can be combined with U-Net to form a new model
architecture called Ghost-UNet. This model combines low-level spatial information and high-level feature
maps [16]. Another version of U-Net called the more-residual (MR) U-Net is used for commodity
segmentation. This model enhances the model's ability to extract details of goods. This model uses bilinear
interpolation instead of deconvolution in the U-Net model to reduce the checker-board effect [17] concatenated
residual attention (CRA) U-Net combines residual structure and channel attention mechanism [18]. The
convolutional block channel attention (CBCA) module enhances the extraction of deep convolution features
and replaces the skip connection in the UNet model. Faster heterogeneous image (FHI)-U-Net uses an
autoencoder and decoder for more rapid heterogeneous image segmentation. This structure has fewer layers,
resulting in lower parameters and adding high inference speed. FHI-U-Net uses autonomous feature extraction
for the images [19]. UNet++ has U-Nets of variable depths. The decoders are connected at the exact resolution
by the modified skip connections. The UNet++ has a UNet of varying depth, so it is not susceptible to the
choice of the network [20]. The swin transformer boosted U-Net (ST-Unet) uses a swin transformer into a
convolutional neural network (CNN) based UNet. It consists of an innovative twin encoder structure of swin
transformer and CNN [21]. The Eff-UNet uses the effectiveness of EfficientNet as the encoder for feature
extraction and the UNet decoder for reconstructing the segmentation map. This architecture uses high-level
and low-level features and spatial info for segmentation [22]. UNet++ can also be used for mapping the cattle
rib-eye area. The UNet++ architecture thus can be used to segment cattle rib-eye areas in ultrasound images
[23]. The enhancement of ultraviolet based generative adversarial networks (UV-GAN) can be done using U-
Net. Boosting of each skip connection is done using attention gates. The irrelevant low-level information is
suppressed from the encoders [24].
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
3990 ISSN: 2252-8938
Figure 3. PSP-Net architecture, (a) input image, (b) feature map, (c) pyramid pooling module, and (d) final
prediction [25]
runtime and accuracy in the encoder-decoder [13] edifice. DeepLabv3+ outspreads DeepLabv3 by adding
a modest, hitherto practical decoder module to improve segmentation, mainly item borders. DeepLabv3+
has high semantic information from the encoder module, whereas the meek yet active decoder module
recovers the precise item borders. The encoder module permits extracting features at a subjective tenacity
using atrous convolution [26]. The DeepLab V3+ semantic segmentation provides segmentation frames for
model input in the natural environment.
The DeepLab model is indicated in Figure 5. Visual geometry group (VGG)-16 or ResNet-101, which
is proficient in the mission of image sorting, is used in the study of semantic segmentation through; i) changing
completely the fully connected layers to convolutional and ii) growing feature tenacity by atrous convolutional
layers, permitting it for calculating feature retorts each of 8 pixels in place of each of 32 pixels in the initial
net. After that, a bilinear interpolation is used to up-sample by the score map factor of eight to grasp the unique
image resolve, getting the input to a CRF and enhancing the segmentation outcomes. The three critical merits
of the DeepLab V1 system are: the speed with atrous convolution, accuracy, and simplicity.
DeepLabv3 enhances the atrous spatial pyramid pooling module. The post-processing step CRF is not
used in DeepLab V3 but was initially used in DeepLabv1 and DeepLabv2. DeepLabv3 uses atrous convolution
with up-sampled filters to get dense feature maps and apprehend extensive range context. In DeepLabv3, the
connected module progressively duets the atrous rates for encoding multiscale info. In contrast, the atrous
spatial pyramid pooling module is improved by image-level features that review the features utilising filters at
various sampling rates and active field-of-views. In DeepLabv3, the ground truths are kept intact. DeepLabv3
expressively advances above the earlier DeepLab versions short of CRF post-processing and attains
comparable performance.
DeepLabv3+ outspreads DeepLabv3 by accumulating a modest hitherto active decoder module for
improving segmentation, particularly the object boundaries. DeepLabv3+ comprises amusing semantic info
from the encoder module, whereas the fair, hitherto functional decoder module recovers the precise item
borders. The encoder module permits extracting features at a subjective resolution using atrous convolution.
Adding a meek, heretofore practical decoder module for retrieving the object borders DeepLabv3+ outspreads
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
3992 ISSN: 2252-8938
DeepLabv3. At the end of DeepLabv3, the enhanced semantic info is encoded by atrous convolution, permitting
unique regulation of the concentration of the encoder features, relying on reasonable calculation means.
Moreover, the decoder consents to complete item boundary retrieval. DeepLabv3+ augments the
decoder module over and above the output. For the decoder, three places of diverse design options are
deliberated, explicitly, i) the 1×1 convolution utilised to lessen the channels for low-level feature map within
the encoder, ii) the 3 by 3 convolution meant to acquire strident segmentation outcomes, and iii) which
low-level encoder features ought to be utilised. Deeplabv3+ can semantically segment synthetic aperture radar
(SAR) images [28]. The model, Deeplabv3+, with a cross-attention means, can overcome the deficits of
accurately extracting the intended features of the image edge. Similarly, it simulates the affiliation amongst the
local features of the large-scale target to facilitate a whole singularity in the excellent-scale target segmentation.
The module permits extracting features at a subjective resolution by using atrous convolution.
DeepLabv3+ outspreads DeepLabv3 by adding a meek, hitherto effective decoder for recovering the object
borders. The enhanced semantic info is encoded at the end of "DeepLabv3" by atrous convolution, permitting
one to regulate the density of the encoder features, reliant on a budget of computation; moreover, the decoder
module consents to complete item boundary retrieval.
The PPA element illustrated in Figure 8 adequately gathers global background info. The PPA
comprises dual chunks: the nonlocal portion and vortex pooling. The nonlocal unit produces impenetrable
pixel-wise weight and excerpts extensive range reliance. The vortex atrous convolution helps spot an item at
various scales. The two modules' recompenses are joined in a single PPA module by analysing the vortex
pooling beside the nonlocal dependence. The tri-parallel vortex atrous convolution chunks of distention rates
of 3, 6, and 9 and one nonatrous convolution chunk are included in the PPA. The PPA is used to utilise
contextual information effectively.
The PPA module enriches the receptive field and provides sufficient context information. However,
the limited memory problem has a drawback, which is caused by multiple downsampling and upsampling and
storage of parameter values at every step. Also, it is challenging to sustain local details as the whole image is
unified into the network.
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
3994 ISSN: 2252-8938
The thirteen convolutional layers corresponding to the initial convolutional layers in the VGG16
network intended for entity classification are included in the encoder network. Every encoder part has a
conforming decoder layer; thus, the decoder network consumes thirteen layers, and to yield class likelihoods
of every pixel independently, the last decoder output is given to a multiclass soft-max classifier. Convolution
by a filter group for producing a group of feature maps is achieved for every encoder within the network.
Afterwards, batch normalising is done, and an element-wise ReLU max (0; x) is used. Succeeding that, max-
pooling by 2×2 size window along with stride of 2 (not overlapping window) is achieved, and the resultant
output is sub-sampled with a factor of 2. Max pooling accomplishes transformation invariance above minor
spatial modifications within the input image; the sub-sampling ends in a hefty input image framework for every
pixel within the map.
Upsampling the input feature map utilising the learned max-pooling measures from the equivalent
encoder feature map is done by a suitable decoder in the decoder network. The result of this stage is a sparse
feature map. The feature maps are convolved with a decoder filter bank to create dense feature maps. A batch
normalisation step is then applied to each map. The decoder equivalent to the opening encoder generates a
feature map with many channels, even though its encoder input takes three channels (red, green, and blue).
Distinct from the additional decoders in the net, this makes feature maps through sizes and channels similar to
the encoder inputs. A soft-max classifier is used for higher-dimensional feature illustration at the end of the
ultimate decoder. The soft-max classifier categorises every pixel individually. The K channel image of
probabilities k indicates that the total sum of classes is the output from the soft-max classifier. The foretold
segmentation resembles the class through extreme probability at every pixel. As only the max-pooling indices
of the feature maps are saved, the SegNet is efficient.
3. METHOD
The methodology involves the following steps: the CARLA simulation and the U-Net implementation
for different optimisation techniques. First, the client is implemented in Python with the sensor at the
appropriate coordinates on the vehicle. The vehicle is spawned once the client runs. During the simulation, the
images are captured, stored in the local directory, and act as the dataset for the algorithm implementation. The
methodology diagram is shown in Figure 10.
Figure 12. CARLA simulation-camera view Figure 13. CARLA simulation-dataset creation
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
3996 ISSN: 2252-8938
and M3-UNET architecture with reduced filters. The weights associated with M1, M2, and M3 are W1,
W2, and W3 respectively. These fixed weights are multiplied by the prediction made by the model and used
in average prediction calculation. Figure 16 represents the weighted average ensemble modelling technique.
Figure 15. Average ensemble modelling Figure 16. Weighted average ensemble modelling
technique technique
− No batch normalisation: the training loss reduces from 3.8962 to 0.1151 after 100 epochs. The validation
loss reduces from 2.1547 to 0.4084 after 100 epochs. The graphs for training and validation losses for 100
epochs are shown in Figure 17. The training accuracy increased from 36.27% to 95.71% after 100 epochs.
The validation accuracy rose from 39.52% to 90.28% after 100 epochs. The graphs for training and
validation accuracy increase are shown in Figure 18. Figure 19(a) shows the test image, Figure 19(b) shows
the ground truth label, and Figure 19(c) shows the prediction on the test image after the model is trained
for 100 epochs. The mean IoU observed was 0.46047738 after 100 epochs. The dice metric was 0.8832081,
and the dice_coefficient_loss was 0.116791904.
Figure 17. Training and validation loss after 100 Figure 18. Training and validation accuracy after
epochs without batch normalization technique 100 epochs without batch normalization technique
Figure 19. Comparison of (a) test image, (b) ground truth, and (c) predicted image for without batch
normalization technique
− With batch normalisation, the training loss was reduced from 1.2546 to 0.0807 after 100 epochs. The
validation loss reduces from 5.974 to 0.3011 after 100 epochs. The graphs for training and validation losses
are shown in Figure 20. The training accuracy increased from 65.33% to 96.89% after 100 epochs. The
validation accuracy rose from 39.52% to 91.61% after 100 epochs. The graphs for training and validation
accuracy increase are shown in Figure 21. Figure 22(a) shows the test image, Figure 22(b) shows the ground
truth label, and Figure 22(c) shows the prediction on the test image after the model is trained for 100 epochs.
Figure 20. Training and validation loss after 100 Figure 21. Training and validation accuracy after
epochs with batch normalization technique 100 epochs with batch normalization technique
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
3998 ISSN: 2252-8938
Figure 22. Comparison of (a) test image, (b) ground truth, and (c) predicted image with batch normalization
technique
The mean IoU observed was 0.48857138 after 100 epochs. The dice metric was 0.9041302, and the
dice_coefficient_loss was 0.09586978.
− Filter reduction: the training loss reduced from 2.1707 to 0.1299 after 100 epochs. The validation loss
reduced from 1.7368 to 0.3923 after 100 epochs. The graphs for training and validation losses are shown
in Figure 23. The training accuracy increased from 30.22% to 95.26% after 100 epochs. The validation
accuracy rose from 37.32% to 89.80% after 100 epochs. The graphs for training and validation accuracy
increase are shown in Figure 24. Figure 25(a) shows the test image, Figure 25(b) shows the ground truth
label, and Figure 25(c) shows the prediction on the test image after the model is trained for 100 epochs.
The mean IoU observed was 0.4433159 after 100 epochs. The dice metric was 0.88062567, and the dice
coefficient loss was 0.119374335.
− Average ensemble: the training loss reduced from 0.1145 to 0.0339 after 100 epochs. The validation loss
increases from 0.2686 to 0.4325 after 100 epochs. The graphs for training and validation losses are shown
in Figure 26. The training accuracy rose from 95.74% to 98.44% after 100 epochs. The validation accuracy
increased from 91.69% to 92.10% after 100 epochs. The graphs for training and validation accuracy
increase are shown in Figure 27.
Figure 23. Training and validation loss after 100 Figure 24. Training and validation accuracy after 100
epochs (filter reduction technique) epochs (filter reduction technique)
Figure 25. Comparison of (a) test image, (b) ground truth, and (c) predicted image for filter reduction
technique
Figure 26. Training and validation loss after 100 Figure 27. Training and validation accuracy after
epochs (average ensemble technique) 100 epochs (average ensemble technique)
Figure 28(a) shows the test image, Figure 28(b) shows the ground truth label, and Figure 28(c) shows
the prediction on the test image after the model is trained for 100 epochs. The mean IoU observed was
0.5070416 after 100 epochs. The dice metric was 0.9050604, and the dice_coefficient_loss was 0.09493959.
Figure 28. Comparison of (a) test image, (b) ground truth, and (c) predicted image for average ensemble
technique
− Weighted average ensemble: since model 2 has the highest accuracy, the weight assigned is 0.6; the next
highest accuracy is for model 3, so the weight given is 0.3, and for model 2, the weight shown is 0.1. For
this combination of weights, the weighted average ensemble model [32] training loss reduced from 0.1038
to 0.0219 after 100 epochs. The validation loss increases from 0.2557 to 0.4315 after 100 epochs. The
graphs for training and validation losses are shown in Figure 29. The training accuracy rose from 96.32%
to 99.39% after 100 epochs. The validation accuracy increased from 91.39% to 91.90% after 100 epochs.
The graphs for training and validation accuracy increase are shown in Figure 30.
Figure 29. Training and validation loss after 100 Figure 30. Training and validation accuracy after
epochs (weighted average ensemble technique) 100 epochs (weighted average ensemble technique)
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
4000 ISSN: 2252-8938
Figure 31(a) shows the test image, Figure 31(b) shows the ground truth label, and Figure 31(c) shows
the prediction on the test image after the model is trained for 100 epochs. The mean IoU observed was
0.5254637 after 100 epochs. The dice metric was 0.91319996, and the dice_coefficient_loss was 0.08680004.
The comparison of results for the optimisation techniques for models with i) without batch normalisation, ii)
with batch normalisation, iii) with reduced filter, iv) with average ensemble, and v) weighted average ensemble
for the CARLA datasets after 100 epochs is shown in Table 2.
(a) (b)
(c)
Figure 31. Comparison of (a) test image, (b) ground truth, and (c) predicted image for weighted average
ensemble technique
5. CONCLUSION
The study evaluates the different optimisation techniques for semantic segmentation intended for
autonomous driving using the UNET Model. The validation accuracy was lowest for the model in Sr. No. 3
and highest for the model in Sr. No. 5. The computation time was lowest for the model in Sr. No. 3 and highest
for the model in Sr. No. 5. There are minor variations in class IoU for the different optimisation techniques.
The dice metrics is highest for the model in Sr. No. 5 and lowest for the model in Sr. No. 3. From the study, it
can be concluded that the ensemble techniques help improve the model's accuracy at the cost of computational
time. We hope these findings in semantic segmentation using UNET architecture will benefit other researchers
in the autonomous driving ecosystem. Future works would consider additional experimentation and simulations
to optimise other semantic segmentation algorithms in the autonomous driving system. A similar experiment,
for instance segmentation for an autonomous driving application study, can be planned using different
algorithms to extend this research. Also, other data sets can used for experimentation.
ACKNOWLEDGEMENTS
The authors would like to thank the staff of Symbiosis Institute of Technology Pune.
REFERENCES
[1] CARLA, “CARLA documentation,” CARLA Simulator. Accessed: Jul. 04, 2021. [Online]. Available:
https://carla.readthedocs.io/en/latest/#getting-started
[2] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: an open urban driving simulator,” in 1st Annual
Conference on Robot Learning, pp. 1-16, 2017.
[3] F. Codevilla, M. Miiller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-to-end driving via conditional imitation learning,” 2018
IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, pp. 4693–4700, 2018, doi:
10.1109/ICRA.2018.8460487.
[4] L. G. Cuenca, E. Puertas, J. F. Andrés, and N. Aliane, “Autonomous driving in roundabout maneuvers using reinforcement learning
with q-learning,” Electronics, vol. 8, no. 12, 2019, doi: 10.3390/electronics8121536.
[5] T. Buhet, E. Wirbel, and X. Perrotton, “Conditional vehicle trajectories prediction in carla urban environment,” 2019 International
Conference on Computer Vision Workshop, ICCVW 2019, pp. 2310–2319, 2019, doi: 10.1109/ICCVW.2019.00284.
[6] CARLA, “CARLA: open-source simulator for autonomous driving research.,” CARLA, 2022. [Online]. Available: https://carla.org//
[7] M. R. Bachute and J. M. Subhedar, “Autonomous driving architectures: insights of machine learning and deep learning algorithms,”
Machine Learning with Applications, vol. 6, 2021, doi: 10.1016/j.mlwa.2021.100164.
[8] X. Wang, Y. Qian, C. Wang, and M. Yang, “Map-enhanced ego-lane detection in the missing feature scenarios,” IEEE Access, vol.
8, pp. 107958–107968, 2020, doi: 10.1109/ACCESS.2020.3000777.
[9] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” International
Journal of Multimedia Information Retrieval, vol. 7, no. 2, pp. 87–93, 2018, doi: 10.1007/s13735-017-0141-z.
[10] W. Weng and X. Zhu, “UNet: convolutional networks for biomedical image segmentation,” IEEE Access, vol. 9, pp. 16591–16603,
2021, doi: 10.1109/ACCESS.2021.3053408.
[11] O. Ronneberger, P. Fischer, and T. Brox, “Dental X-ray image segmentation using a U-shaped deep convolutional network,”
International Symposium on Biomedical Imaging, pp. 1–13, 2015.
[12] O Ronneberger, “U-Net: convolutional networks for biomedical image segmentation,” Vision: Pattern Recognition and Image
Processing, 2021. Accessed: Aug. 02, 2021. [Online]. Available: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/
[13] A. Gurita and I. G. Mocanu, “Image segmentation using encoder-decoder with deformable convolutions,” Sensors, vol. 21, no. 5,
pp. 1–27, 2021, doi: 10.3390/s21051570.
[14] G. Men, G. He, and G. Wang, “Concatenated residual attention unet for semantic segmentation of urban green space,” Forests, vol.
12, no. 11, 2021, doi: 10.3390/f12111441.
[15] J. L. Arrastia et al., “Deeply supervised unet for semantic segmentation to assist dermatopathological assessment of basal cell
carcinoma,” Journal of Imaging, vol. 7, no. 4, 2021, doi: 10.3390/jimaging7040071.
[16] I. A. Kazerouni, G. Dooly, and D. Toal, “Ghost-UNet: an asymmetric encoder-decoder architecture for semantic segmentation from
scratch,” IEEE Access, vol. 9, pp. 97457–97465, 2021, doi: 10.1109/ACCESS.2021.3094925.
[17] Z. Wu, L. Zhao, and H. Zhang, “MR-UNet commodity semantic segmentation based on transfer learning,” IEEE Access, vol. 9, pp.
159447–159456, 2021, doi: 10.1109/ACCESS.2021.3130578.
[18] Q. Yang, T. Ku, and K. Hu, “Efficient attention pyramid network for semantic segmentation,” IEEE Access, vol. 9, pp. 18867–
18875, 2021, doi: 10.1109/ACCESS.2021.3053316.
[19] M. H. Sheu, S. M. S. Morsalin, S. H. Wang, L. K. Wei, S. C. Hsia, and C. Y. Chang, “FHI-Unet: faster heterogeneous images
semantic segmentation design and edge AI implementation for visible and thermal images processing,” IEEE Access, vol. 10, pp.
18596–18607, 2022, doi: 10.1109/ACCESS.2022.3151375.
[20] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++: redesigning skip connections to exploit multiscale features in
image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2020, doi:
10.1109/TMI.2019.2959609.
[21] X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, and Y. Xue, “Swin transformer embedding UNet for remote sensing image semantic
segmentation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, 2022, doi: 10.1109/TGRS.2022.3144165.
[22] B. Baheti, S. Innani, S. Gajre, and S. Talbar, “Eff-UNet: A novel architecture for semantic segmentation in unstructured
environment,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2020-June, pp.
1473–1481, 2020, doi: 10.1109/CVPRW50498.2020.00187.
[23] M. J. D. Melo et al., “Automatic segmentation of cattle rib-eye area in ultrasound images using the UNet++ deep neural network,”
Computers and Electronics in Agriculture, vol. 195, 2022, doi: 10.1016/j.compag.2022.106818.
[24] I. S. Na, C. Tran, D. Nguyen, and S. Dinh, “Facial UV map completion for pose-invariant face recognition: a novel adversarial
approach based on coupled attention residual UNets,” Human-centric Computing and Information Sciences, vol. 10, no. 1, 2020,
doi: 10.1186/s13673-020-00250-w.
[25] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” 30th IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2017, vol. 2017, pp. 6230–6239, 2017, doi: 10.1109/CVPR.2017.660.
[26] M. A. M. Elhassan et al., “PPANet: point-wise pyramid attention network for semantic segmentation,” Wireless Communications
and Mobile Computing, vol. 2021, no. 1, 2021, doi: 10.1155/2021/5563875.
[27] Y. Kong, Y. Liu, B. Yan, H. Leung, and X. Peng, “A novel deeplabv3+ network for sar imagery semantic segmentation based on
the potential energy loss function of gibbs distribution,” Remote Sensing, vol. 13, no. 3, pp. 1–13, 2021, doi: 10.3390/rs13030454.
Optimisation of semantic segmentation algorithm for autonomous driving using … (Javed Subhedar)
4002 ISSN: 2252-8938
[28] H. Zeng, S. Peng, and D. Li, “Deeplabv3+ semantic segmentation model based on feature cross attention mechanism,” Journal of
Physics: Conference Series, vol. 1678, no. 1, 2020, doi: 10.1088/1742-6596/1678/1/012106.
[29] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: a deep convolutional encoder-decoder architecture for image
segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017, doi:
10.1109/TPAMI.2016.2644615.
[30] S. Bhattiprolu, “Python for microscopists,” GitHub, 2021. Accessed: Nov. 13, 2021. [Online]. Available:
https://github.com/bnsreenu/python_for_microscopists/blob/master/219-unet_model_with_functions_of_blocks.py
[31] A. Abdollahi, B. Pradhan, and A. M. Alamri, “An ensemble architecture of deep convolutional Segnet and Unet networks for
building semantic segmentation from high-resolution aerial images,” Geocarto International, vol. 37, no. 12, pp. 3355–3370, 2022,
doi: 10.1080/10106049.2020.1856199.
[32] M. Aslam, T. M. Khan, S. S. Naqvi, G. Holmes, and R. Naffa, “Ensemble convolutional neural networks with knowledge transfer
for leather defect classification in industrial settings,” IEEE Access, vol. 8, pp. 198600–198614, 2020, doi:
10.1109/ACCESS.2020.3034731.
BIOGRAPHIES OF AUTHORS
Javed Subhedar was born in India. He received his degree in B.E. [Electronics]
from Walchand College of Engineering Sangli, Maharashtra State, India, and M.S. [Automotive
Electronics] from Coventry University, and he is pursuing a Ph.D. from Symbiosis Institute of
Technology, Pune Symbiosis International (Deemed University). He has rich experience in the
automotive industry, and his current research area includes autonomous driving. He can be
contacted at email: javed.subhedar.phd2020@sitpune.edu.in.
Ketan Kotecha has Ph.D. and M.Tech. from (IIT Bombay) and is currently holding
the positions of Head Symbiosis Centre for Applied A.I. (SCAAI), Director, Symbiosis Institute
of Technology, CEO, Symbiosis Centre for Entrepreneurship and Innovation (SCEI), Dean,
Faculty of Engineering, Symbiosis International (Deemed University). He has expertise and
experience in cutting-edge research and projects in AI and deep learning for the last 25 + years.
He has published 100+ widely in a number of excellent peer-reviewed journals on various
topics ranging from cutting-edge AI, education policies, teaching-learning practices and AI for
all. He is a recipient of the two SPARC projects worth INR 166 lacs from the MHRD govt of
India in AI in collaboration with Arizona State University, USA and the University of
Queensland Australia, and also the recipient of numerous prestigious awards like Erasmus+
faculty mobility grant to Poland, DUO-India professors fellowship for research in Responsible
AI in collaboration with Brunel University, UK, LEAP grant at Cambridge University UK,
UKIERI grant with Aston University UK, and a grant from Royal Academy of Engineering,
the UK under Newton Bhabha Fund. He has published 3 patents and delivered keynote speeches
at various national and international forums, including at Machine Intelligence Lab, USA, at
IIT Bombay under the World Bank project, at the International Indian Science Festival
organised by the Department of Science Technology, Govt of India and many more. He is also
an academic editor of the Peerj Computer Science journal and Associate Editor of IEEE Access
journal. He can be contacted at email: head@scaai.siu.edu.in.