On the Effect of Image Resolution on Semantic Segmentation

Ritambhara Singh Abhishek Jain Pietro Perona Shivani Agarwal Junfeng Yang
Department of Computer Science, Duke University
rsingh@cs.duke.edu

Abstract

High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique.

1 Introduction

Deep convolutional neural networks have set new benchmarks across a range of computer vision applications, including image classification, object detection, semantic segmentation, and human pose estimation, among others. Semantic segmentation, the task of classifying each pixel in an image into a category label, offers a detailed scene understanding by identifying the label, location, and shape of every element within an image. This capability has significant implications for autonomous driving, robotic perception, and other areas.

One of the inherent challenges in semantic segmentation involves balancing the trade-offs between inference resolutions. Certain predictions, especially those requiring attention to minute details like object edges or slender structures, benefit from processing at higher resolutions. This approach allows for finer granularity in the segmentation output. Conversely, the identification of larger structural elements within an image, which necessitates a broader contextual understanding, tends to be more effective at lower resolutions. Here, the network’s receptive field can encompass a larger portion of the scene, providing the context needed for accurate large-scale segmentation. This duality highlights the need for adaptable strategies in DCNNs to optimally process and interpret images across different scales and contexts.

Contemporary leading-edge approaches typically downscale input images by factors of 1/2 or 1/4 before processing. This reduction in resolution facilitates the handling of images within the constraints of computational resources—specifically, memory and processing capacity. As a result, lower-resolution processing becomes a practical necessity for complex neural networks. Additionally, feature maps at reduced resolutions tend to be information-rich, contributing to more precise overall network predictions. However, this practice has the significant drawback of omitting fine details.

Some strategies [1, 36, 22, 19] aim to produce high-resolution outputs by upsampling these lower-resolution feature maps and applying further convolutions. Yet, these methods are prone to overfitting, primarily because the convolutions at higher resolutions are not complemented by analogous operations at lower resolutions, leaving no mechanism within these networks to refine or adjust the ultimate output based on broader contextual information.

More recently developed architectures [45, 40, 35] incorporate intricate designs that have shown promising results. Despite their advancements, the complexity of these models presents a significant challenge: it remains impractical to train them directly on high-resolution images due to the extensive computational demands. This limitation underscores the ongoing challenge in neural network design: balancing the need for detailed image processing with the practical constraints of available computational resources.

Numerous approaches in the field leverage network architectures that have demonstrated efficacy in image classification tasks, including variations of ResNet [48] and EfficientNet [30]. Utilizing networks pre-trained on auxiliary classification tasks offers a significant advantage by initializing a substantial portion of the model’s weights, thereby shortening the training duration and frequently leading to better performance than models trained from scratch, especially when the data for the specific application may be scarce.

However, a principal challenge with adopting such pre-trained models is the constraints they impose on the innovation of new methodologies. These limitations are particularly evident in the inability to seamlessly integrate novel network components, such as batch normalization [16] or innovative activation functions, into the pre-existing architectures. This restriction can stifle the development of unique and potentially more effective approaches by limiting the exploration of the architectural design space.

Recent works [38, 42, 18] adopt the Noisy Student Training where a teacher model is trained using the labeled images. Then, the teacher model is used to generate pseudo labels on unlabeled images. And finally, a student model is trained to minimize the combined cross entropy loss on both labeled images and unlabeled images.

In this study, we introduce a streamlined network architecture that demonstrates enhanced generalizability and is capable of producing high-resolution segmentation outputs directly. Leveraging a classifier network that has been pre-trained on a comprehensive dataset like ImageNet [37], our design benefits from the robust foundational knowledge gained from extensive pre-training. We rigorously evaluate our model across several prestigious semantic segmentation datasets, including Mapillary Vistas [31], Cityscapes [8], CamVid [2], COCO [25], and PASCAL-VOC2012 [9], to validate its effectiveness.

Our findings demonstrate that a simplified yet strategically designed network, by directly delivering high-resolution segmentations, can attain leading-edge performance metrics. Specifically, within the Cityscapes dataset, we further refine our model’s precision by incorporating the Noisy Student Training algorithm, showcasing a significant advancement in accuracy. This approach underscores the potential of minimalist designs in achieving remarkable results in semantic segmentation tasks.

Refer to caption — Figure 1: The proposed network. The residual blocks are the same as in [13]. The gray areas are the bottom-up propagation stages.

2 Related Work

Current state-of-the-art methods for semantic segmentation are based on convolutional neural networks. These networks have different architectures. Encoder-decoder or hourglass networks are used in many computer vision tasks like object detection [26, 20], human pose estimation [32, 17], image-based localization [29, 14], and semantic segmentation [27, 1, 33]. Generally, they are made of an encoder and decoder parts such that, the encoder gradually reduces the feature maps resolution and captures high-level semantic information, and the decoder gradually recovers the low-level details. Because these networks lose the image details during the encoder path, they are not able to achieve the highest results without using skip connections. In U-net [36] by reusing the feature maps from the encoder part of the network (skip connections), low-level image details are recovered. In U-net’s decoder, high-resolution convolutions are not supervised by lower resolution convolutions. so, going deeper by adding convolutional layers does not substantially improve the accuracy. Spatial pyramid pooling models perform spatial pyramid pooling [23, 11, 28] at different grid scales or apply several parallel atrous convolution [4] with different rates. These models include the two famous PSPNet [48] and DeepLab [5]. DeepLabv3+ [6] adds one skip connection to utilize some of the low-level image details. High-resolution representation networks [40, 15, 10, 49] try to maintain a high-resolution hidden state from input to output. By doing low-resolution convolutions in parallel streams, high-level features are gained while low-level details are not lost. Since these networks require a lot of memory, they first downsample the input image to a lower resolution before the main body.

Some approaches [4, 3, 21] do post-processing, such as conditional random fields, on the network’s output to improve the segmentation details, especially around the object boundaries. These approaches add some processing overhead to training and testing. Pyramid pooling techniques learn square context regions because pooling and dilation are typically employed in a symmetric fashion. However, relational context methods build context by attending to the relationship between pixels and are not bound to square regions. This nature of relational context methods allow context to be built based on image composition. Such techniques can build more appropriate context for unusual semantic regions, such as scattered regions or a tall thin column. OCRNet [45], DANet [43], CFNet [46] augment the representation for each pixel by aggregating the representations of the contextual pixels, where the context consists of all the pixels. These works consider the relation (or similarity) between the pixels, which is based on the self-attention scheme [41, 39], and perform a weighted aggregation with the similarities as the weights. These methods are used as an extension to an existing segmentation method.

Self-training has been previously used to improve classification networks [44]. In [42], self-training with Noisy Student algorithm is used to achieve a new state-of-the-art on ImageNet [37]. In Cityscapes, a significant amount of each coarse image is unlabelled due to the coarseness of the labels. In [38], authors use Noisy Student Training but they generate hard thresholded labels instead of soft labels for the Cityscapes coarse set. This is because storing soft labels for the Cityscapes coarse set requires around 3.2 TB of storage space [38].

Based on the pros and cons of the aforementioned methods, in this work we design an independent network which processes images at the original, high-resolution input, and directly outputs the high-resolution segmentation with high generalization ability.

3 Method

We introduce a dual-component network architecture consisting of an initial classifier unit followed by a dedicated segmentation head.

3.1 Classifier Network

The initial segment of our proposed framework functions as a classifier network, designed to produce six feature maps across a spectrum of resolutions, ranging from high to low. Traditional classifiers typically downscale the input resolution early in the process, resulting in the absence of high-resolution feature maps in their outputs. To address this limitation, we have developed a custom network that closely mirrors the architecture of ResNet-34 [13], albeit with minor adjustments to better suit our requirements, as illustrated in Figure 1. Specifically, we enhance the network by integrating additional residual blocks aimed at refining the processing capabilities at the first and second scales.

3.2 Segmentation Head

The latter portion of our architecture is dedicated to the task of semantic segmentation. Following the classifier network’s operation, which yields primary feature maps alongside their associated classes for broadly defined regions, the focus shifts towards refining these maps to generate the definitive segmentation output. Initially, the process involves the upward propagation of information from the coarser, low-resolution feature maps to their higher-resolution counterparts, a technique we refer to as Bottom-Up Propagation. This approach enriches the higher-resolution feature maps with greater contextual depth and an expanded receptive field.

Subsequently, the architecture employs a series of stacked residual blocks [13] to construct final segmentations across various scales. The culmination of this process involves applying the Bottom-Up Propagation technique once again, this time to amalgamate the multi-scale segmentations into a singular, high-resolution outcome. This methodological framework ensures that the resulting segmentation is both detailed and contextually informed, leveraging the strengths of multi-scale processing to achieve superior accuracy.

Bottom-Up Propagation. The process commences by feeding the two feature maps of the lowest resolutions into a Merge Module. Subsequently, the output of this module, along with the next feature map in ascending order of resolution, is input into another Merge Module. This sequential operation continues until it encompasses the feature map with the highest resolution, ensuring a progressive enhancement of detail and context as the resolution increases.

Merge Module. Within the Merge Module, the feature map of lower resolution undergoes bilinear interpolation to match the dimensions of its higher-resolution counterpart. Following this resizing, the two feature maps are concatenated to form a unified representation. To streamline the combined feature map for efficient processing, a convolutional layer is then applied to condense the channel count, effectively refining the feature integration. This procedure is graphically depicted in Figure 2, illustrating the transformation and consolidation steps integral to the Merge Module’s function.

3.3 Noisy Student Training

Building upon the success of recent studies [38, 42] that have demonstrated the effectiveness of Noisy Student Training, we incorporate this technique within our framework for the Cityscapes dataset to enhance both the volume and the quality of the dataset. Initially, we train a teacher model on the labeled images, employing the conventional cross-entropy loss method. Following this, the teacher model is utilized to generate pseudo labels for the images in the coarse set.

In alignment with the approach outlined by [38], we opt for generating hard labels rather than soft labels, addressing the challenge of storage constraints. Subsequently, a student model is trained, aiming to minimize the combined cross-entropy loss derived from both the originally labeled images and those adorned with pseudo labels. Within our experimental setup, data augmentation serves as the source of ’noise’ for the student model, and we choose not to iterate the process further. This strategy allows us to leverage the robustness introduced by the Noisy Student Training algorithm, thereby improving the model’s performance on semantic segmentation tasks.

4 Experiments

Our classifier network undergoes an initial pretraining phase on the ImageNet dataset [37], establishing a foundational layer of knowledge that is leveraged across all subsequent experiments. We further refine our model through pretraining on the Mapillary Vistas dataset [31], followed by dedicated training and evaluation phases on the Cityscapes [8] and CamVid [2] datasets to assess its performance across different urban scenes.

Additionally, the model is pretrained on the COCO dataset [25], with subsequent training and testing conducted on the PASCAL-VOC2012 dataset [9], allowing us to validate its efficacy in a broad range of visual contexts. Across all these experiments, segmentation accuracy is quantified using the standard mean Intersection over Union (mIoU) metric, providing a consistent measure of the model’s ability to accurately delineate and classify various elements within an image.

Scale	Output Size	Residual Blocks	Channels
1	$H\times W$	1	50
2	$H/2\times W/2$	1	75
3	$H/4\times W/4$	2	125
4	$H/8\times W/8$	4	200
5	$H/16\times W/16$	4	320
6	$H/32\times W/32$	4	450

Table 1: Our classifier network architecture. We use the same residual blocks as in [13]. Downsampling is performed by scales 2 to 6.

4.1 ImageNet

Table 1 presents the configuration of our classifier network in detail. For training image preparation, we employ a data augmentation strategy consistent with the one described by Russakovsky et al. [37]. Specifically, image resizing is conducted by randomly selecting the length of the shorter side from the range [256, 480], followed by cropping to dimensions of 224 $\times$ 224. Subsequent steps include the application of random horizontal flips and conventional color augmentation techniques to enhance model robustness and generalizability.

The training regimen spans 100 epochs, utilizing batches of 256 images. Optimization is carried out using Stochastic Gradient Descent (SGD) with a weight decay parameter set at 0.0001. The initial learning rate is established at 0.1, with scheduled reductions by a factor of 10 occurring at the 30th, 60th, and 90th epochs, ensuring adaptive learning pace adjustments in response to the evolving training landscape.

Method	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic Light	Traffic Sign	Vegetation	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	Motorcycle	Bicycle	mIoU
ResNeSt200 [47]	98.9	88.4	94.3	66.0	66.0	72.5	78.6	82.5	94.2	72.9	96.3	88.4	74.8	96.6	77.0	92.3	90.0	73.2	79.1	83.3
GALD-Net [24]	98.8	87.7	94.2	65.0	66.7	73.1	79.3	82.4	94.2	72.9	96.0	88.4	76.2	96.5	79.8	89.6	87.7	74.1	79.9	83.3
EfficientPS [30]	98.8	88.2	94.3	67.6	67.7	73.4	80.2	83.3	94.3	74.4	96.0	88.7	75.3	96.6	83.5	94.0	91.1	73.5	79.6	84.2
Panoptic Deeplab [7]	98.8	88.1	94.5	68.1	68.1	74.5	80.5	83.5	94.2	74.4	96.1	89.2	77.1	96.5	78.9	91.8	89.1	76.4	79.3	84.2
Ours	98.8	87.5	94.3	65.7	65.5	72.9	80.0	82.7	94.2	73.1	96.0	88.9	75.6	96.8	81.2	93.7	90.3	73.2	79.8	83.7

Table 2: Results on Cityscapes [8] test set. All the methods are pretrained on Mapillary Vistas [31].

4.2 Mapillary Vistas

The Mapillary Vistas dataset (research edition) [31] serves as a comprehensive resource for street-level image analysis, featuring 25,000 images with dense annotations. These are divided into sets of 18,000 for training, 2,000 for validation, and 5,000 for testing. The dataset encompasses 65 object categories alongside a single void class, with the images presenting a diverse array of aspect ratios and resolutions, extending up to 22 Megapixels.

For training, we employ Stochastic Gradient Descent (SGD) with a weight decay parameter of 0.0001 and a batch size of 10. The learning rate follows a polynomial adjustment policy, characterized by a poly exponent of 0.9 and an initial rate of 0.01, over a total of 500 epochs. Data augmentation techniques are applied to enhance model robustness, including random cropping to dimensions of $768\times 768$ , scaling within the range of [0.5, 2.0], and random horizontal flipping. Notably, our training process utilizes both the designated training and validation sets to maximize the learning potential from the available data.

4.3 Cityscapes

The Cityscapes dataset [8] is a benchmark collection of 5,000 high-definition street images, meticulously annotated at the pixel level for detailed semantic analysis. The dataset is stratified into subsets comprising 2,975 training images, 500 validation images, and 1,525 test images that are finely annotated. Additionally, it includes 20,000 images with coarse annotations, broadening the scope for model training and evaluation. Out of the 30 categorically distinct classes provided, 19 are earmarked for evaluation purposes.

For comprehensive assessment on the test set, we report not only the mean Intersection over Union (mIoU) for class-wise accuracy but also three supplementary scores: IoU category, iIoU class, and iIoU category, offering a nuanced insight into model performance across different segmentation categories.

The training regimen employs Stochastic Gradient Descent (SGD) with a weight decay setting of 0.0001 and a batch size of 10. We adhere to a ”polynomial” learning rate adjustment strategy with a poly exponent of 0.9 and an initial learning rate of 0.01, spanning a total of 300 epochs. To enhance the model’s adaptability to varied scene compositions, data augmentation techniques such as random cropping to $768\times 768$ pixels, scaling within a [0.5, 2.0] range, and random horizontal flipping are applied.

In pursuit of achieving optimal accuracy, our training strategy encompasses not just the fine-annotated training and validation images, but also the extensive set of coarsely annotated images, thereby leveraging the full spectrum of available data within the Cityscapes dataset.

Noisy Student Training. Our methodology employs a stringent labelling approach, where, for any given pixel, the class with the highest prediction probability from the teacher network is designated as the top class. The decision to accept this prediction as the true label is contingent upon surpassing a specified confidence threshold derived from the teacher network’s output probability. Only teacher predictions meeting or exceeding this threshold are considered valid labels; pixels failing to meet this criterion are assigned to an ”ignore” class, effectively excluding them from contributing to the loss calculation and subsequent training steps. In our implementation, we have set this confidence threshold at 0.9, striking a balance between precision and coverage in the generated pseudo labels.

4.4 CamVid

Compared to Cityscapes [8], CamVid [2] is a very smaller dataset focusing on semantic segmentation for driving scenarios. The original version is composed of 701 annotated images in 32 classes with size 960 $\times$ 720 from five video sequences. However, most literature only focuses on the protocol proposed in [1] which splits the dataset into 367/101/233 images for training, validation, and test in 11 classes. We follow this protocol for splitting the dataset and train on the training and validation sets to get the highest accuracy on the test set. We use SGD with a weight decay of 0.0001 and batch size of 16. We apply the ”polynomial” learning rate policy with a poly exponent of 0.9 and initial learning rate of 0.01, and train for 300 epochs. The data are augmented by random cropping ( $768\times 768$ ), random scaling in the range of [0.5, 2.0], and random horizontal flipping.

4.5 COCO

The Microsoft COCO [25] dataset contains 118k/5k images for training and validation in 80 object categories. The images have varying aspect ratios and sizes. We use SGD with a weight decay of 0.0001 and batch size of 12. We apply the ”polynomial” learning rate policy with a poly exponent of 0.9 and initial learning rate of 0.01, and train for 90 epochs. The data are augmented by random cropping ( $480\times 480$ ), random scaling in the range of [0.5, 2.0], and random horizontal flipping.

4.6 PASCAL-VOC2012

The PASCAL-VOC2012 [9] segmentation dataset contains 20 object categories and one background class. The dataset has 1,465 training, 1,450 validation, and 1,456 test images. We augment the dataset by the extra annotations provided by [12], resulting in 10,582 training images. We use SGD with a weight decay of 0.0001 and batch size of 12. We apply the ”polynomial” learning rate policy with a poly exponent of 0.9 and initial learning rate of 0.01, and train for 300 epochs. The data are augmented by random cropping ( $480\times 480$ ), random scaling in the range of [0.5, 2.0], and random horizontal flipping.

5 Results

Our method has been evaluated across multiple datasets, with the outcomes benchmarked against leading state-of-the-art techniques.

5.1 Cityscapes

The performance of our approach on the Cityscapes [8] test set is detailed in Table 2. Each method under comparison has undergone pretraining on the Mapillary Vistas dataset [31] to ensure a consistent basis for evaluation. Our training regimen encompasses the comprehensive use of Cityscapes’ training, validation, and coarse annotation sets to maximize the learning potential from the available data.

6 Conclusion

In this study, we introduced a simplified network architecture that directly produces high-resolution semantic segmentations, challenging the conventional methodology that relies on downscaling and upscaling processes. Our findings demonstrate that this streamlined approach not only matches but, in certain instances, surpasses the performance of more complex, lower-resolution systems. Key to our success is the implementation of a bottom-up information propagation strategy, which effectively enhances segmentation accuracy by ensuring that higher-resolution feature maps are informed by contextually rich, lower-resolution counterparts.

Extensive experimentation across various leading semantic segmentation datasets has validated the efficacy of our model. Notably, our application of the Noisy Student Training technique on the Cityscapes dataset signifies a notable advancement in segmentation accuracy, showcasing the potential of leveraging semi-supervised learning methods within high-resolution segmentation tasks.

The implications of our research are twofold. Firstly, it affirms that high-resolution segmentation can be achieved without the computational penalties traditionally associated with processing at native image resolutions. Secondly, it underscores the versatility and adaptability of simplified network structures in handling complex visual tasks, opening avenues for further exploration in efficient network design and training methodologies.

Future work will focus on refining the bottom-up propagation mechanism to further enhance detail capture and exploring the integration of additional semi-supervised and unsupervised learning techniques to expand the model’s applicability and robustness across diverse and challenging segmentation scenarios.

References

Badrinarayanan et al. [2017] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
Brostow et al. [2008] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, xx(x):xx–xx, 2008.
Chandra and Kokkinos [2016] Siddhartha Chandra and Iasonas Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In European Conference on Computer Vision, pages 402–418. Springer, 2016.
Chen et al. [2017a] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017a.
Chen et al. [2017b] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017b.
Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
Cheng et al. [2019] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab. arXiv preprint arXiv:1910.04751, 2019.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Everingham et al. [2012] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
Fourure et al. [2017] Damien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau, and Christian Wolf. Residual conv-deconv grid network for semantic segmentation. arXiv preprint arXiv:1707.07958, 2017.
Grauman and Darrell [2005] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pages 1458–1465. IEEE, 2005.
Hariharan et al. [2011] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pages 991–998. IEEE, 2011.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hosseini et al. [2022] Parisa Hosseini, Seyedalireza Khoshsirat, Mohammad Jalayer, Subasish Das, and Huaguo Zhou. Application of text mining techniques to identify actual wrong-way driving (wwd) crashes in police reports. International Journal of Transportation Science and Technology, 2022.
Huang et al. [2017] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Khoshsirat and Kambhamettu [2022] Seyedalireza Khoshsirat and Chandra Kambhamettu. Semantic segmentation using neural ordinary differential equations. In International Symposium on Visual Computing, pages 284–295. Springer, 2022.
Khoshsirat and Kambhamettu [2023a] Seyedalireza Khoshsirat and Chandra Kambhamettu. Empowering visually impaired individuals: A novel use of apple live photos and android motion photos. In 25th Irish Machine Vision and Image Processing Conference, 2023a.
Khoshsirat and Kambhamettu [2023b] Seyedalireza Khoshsirat and Chandra Kambhamettu. Sentence attention blocks for answer grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6080–6090, 2023b.
Khoshsirat and Kambhamettu [2023c] Seyedalireza Khoshsirat and Chandra Kambhamettu. A transformer-based neural ode for dense prediction. Machine Vision and Applications, 34(6):1–11, 2023c.
Khoshsirat and Kambhamettu [2023d] Seyedalireza Khoshsirat and Chandra Kambhamettu. Embedding attention blocks for the vizwiz answer grounding challenge. VizWiz Grand Challenge Workshop, 2023d.
Khoshsirat and Kambhamettu [2024] Seyedalireza Khoshsirat and Chandra Kambhamettu. Improving normalization with the james-stein estimator. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
Lazebnik et al. [2006] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 2169–2178. IEEE, 2006.
Li et al. [2019] Xiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, and Yunhai Tong. Global aggregation then local distribution in fully convolutional networks. arXiv preprint arXiv:1909.07229, 2019.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
Maserat et al. [2017] Elham Maserat, Reza Safdari, Hamid Asadzadeh Aghdaei, Alireza Khoshsirat, and Mohammad Reza Zali. 43: Designing evidence based risk assessment system for cancer screening as an applicable approach for the estimating of treatment roadmap. BMJ Open, 7(Suppl 1):bmjopen–2016, 2017.
Melekhov et al. [2017] Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Image-based localization using hourglass networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 879–886, 2017.
Mohan and Valada [2020] Rohit Mohan and Abhinav Valada. Efficientps: Efficient panoptic segmentation. arXiv preprint arXiv:2004.02307, 2020.
Neuhold et al. [2017] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4990–4999, 2017.
Newell et al. [2016] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
Noh et al. [2015] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
Odena et al. [2016] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.
Pohlen et al. [2017] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4151–4160, 2017.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
Tao et al. [2020] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
Wang et al. [2019] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. arXiv preprint arXiv:1908.07919, 2019.
Wang et al. [2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
Xie et al. [2020] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
Xue et al. [2019] Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. Danet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 6589–6598, 2019.
Yalniz et al. [2019] I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
Yuan et al. [2019] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065, 2019.
Zhang et al. [2019] Hang Zhang, Han Zhang, Chenguang Wang, and Junyuan Xie. Co-occurrent features in semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 548–557, 2019.
Zhang et al. [2020] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.
Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
Zhou et al. [2015] Yisu Zhou, Xiaolin Hu, and Bo Zhang. Interlinked convolutional neural networks for face parsing. In International symposium on neural networks, pages 222–231. Springer, 2015.