survey

Open access

Efficient High-Resolution Deep Learning: A Survey

Authors:

Arian Bakhtiarnia,

Qi Zhang,

Alexandros IosifidisAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 7

Article No.: 181, Pages 1 - 35

https://doi.org/10.1145/3645107

Published: 09 April 2024 Publication History

PDF eReader

Abstract

Cameras in modern devices such as smartphones, satellites and medical equipment are capable of capturing very high resolution images and videos. Such high-resolution data often need to be processed by deep learning models for cancer detection, automated road navigation, weather prediction, surveillance, optimizing agricultural processes and many other applications. Using high-resolution images and videos as direct inputs for deep learning models creates many challenges due to their high number of parameters, computation cost, inference latency and GPU memory consumption. Simple approaches such as resizing the images to a lower resolution are common in the literature, however, they typically significantly decrease accuracy. Several works in the literature propose better alternatives in order to deal with the challenges of high-resolution data and improve accuracy and speed while complying with hardware limitations and time restrictions. This survey describes such efficient high-resolution deep learning methods, summarizes real-world applications of high-resolution deep learning, and provides comprehensive information about available high-resolution datasets.

1 Introduction

Many modern devices such as smartphones, drones, augmented reality headsets, vehicles and other Internet of Things (IoT) devices are equipped with high-quality cameras that can capture high-resolution images and videos. With the help of image stitching techniques, camera arrays [126, 157], gigapixel acquisition robots [110] and whole-slide scanners [41], capture resolutions can be increased to billions of pixels (commonly referred to as gigapixels), such as the image depicted in Figure 1. One could attempt to define high-resolution based on the capabilities of human visual system. However, many deep learning tasks rely on data captured by equipment which behaves very differently compared to the human eye, such as microscopes, satellite imagery and infrared cameras. Furthermore, utilizing more detail than the eye can sense is beneficial in many deep learning tasks, such as in the applications discussed in Section 2. The amount of detail that can be captured and is useful if processed varies greatly from task to task. Therefore, the definition of high-resolution is task-dependent. For instance, in image classification and computed tomography (CT) scan processing, a resolution of 512 \(\times\) 512 pixels is considered to be high [17, 37]. In visual crowd counting, datasets with High-Definition (HD) resolutions or higher are common [45], and whole-slide images (WSIs) in histopathology, which is the study of diseases of the tissues, or remote sensing data, which are captured by aircrafts or satellites, can easily reach gigapixel resolutions [134, 135].

Fig. 1.

Moreover, with the constant advancement of hardware and methodologies, what deep learning literature considers high-resolution has shifted over time. For instance, in the late 1990s, processing the 32 \(\times\) 32-pixel MNIST images with neural networks was an accomplishment [78], whereas in early 2010s, the 256 \(\times\) 256-pixel images in ImageNet were considered high-resolution [76]. This trend can also be seen in the consistent increase of the average resolution of images in popular deep learning datasets, such as crowd counting [45] and anomaly detection [101] datasets. Therefore, the definition of high-resolution is also period-dependent. Based on the task- and period-dependence properties, it is clear that the term “high-resolution” is technical, not fundamental or universal. Therefore, instead of trying to derive such a definition, we shift our focus to resolutions that create technical challenges in deep learning at the time of this writing.

Using high-resolution images and videos directly as inputs to deep learning models creates challenges during both training and inference phases. With the exception of fully-convolutional networks (FCNs), the number of parameters in deep learning models typically increases with larger input sizes. Moreover, the amount of computation, which is commonly measured in terms of floating point operations (FLOPs), and therefore inference/training time, as well as GPU memory consumption increase with higher-resolution inputs, as shown in Figure 2. This issue is especially problematic in Vision Transformer (ViT) architectures, which use the self-attention mechanism, where the inference speed and number of parameters scale quadratically with input size [37, 122]. These issues are exacerbated when the training or inference needs to be done on resource-constrained devices, such as smartphones, that have limited computational capabilities compared to high-end computing equipment, such as workstations or servers.

Fig. 2.

Even though methods such as model parallelism can be used to split the model between multiple GPUs during both the training [113, 146] and inference [39] phases, and thus avoid memory and latency issues, these methods require a large amount of resources, such as a large number of GPUs and servers, which can incur high costs, especially when working with extreme resolutions such as gigapixel images. Furthermore, in many applications, such as self-driving cars and drone image processing, there is a limit for the hardware that can be mounted, and offloading the computation to external servers is not always possible because of unreliability of the network connection due to movement and the time-critical nature of the application. Therefore, the most common approach for deep learning training and inference is to load the full model on each single GPU instance. Multi-GPU setups are instead typically used to speed up the training by increasing the overall batch size, to test multiple sets of hyper-parameters in parallel or to distribute the inference load. Consequently, in many cases, there is an effective maximum resolution that can be processed by deep learning models. As an example, the maximum resolution for inference using SASNet [116], which is the state-of-the-art model for crowd counting on the Shanghai Tech dataset [162] at the time of this writing, is around 1024 \(\times\) 768 (less than HD) on Nvidia 2080 Ti GPUs which have 11 GBs of video memory.

Although newer generations of GPUs are getting faster and have more memory available, the resolution of images and videos captured by devices is also increasing. Figure 3 shows this trend across recent years for multiple types of devices. Therefore, the aforementioned issues will likely persist even with advances in computation hardware technology. Furthermore, current imaging technologies are nowhere near the physical limits of image resolutions, which is estimated to be in petapixels [11].

Fig. 3.

Whether or not capturing and processing a higher resolution leads to improvements depends on the particular problem at hand. For instance, in image classification, it is unlikely that increasing the resolution for images of objects or animals to gigapixels would reveal more beneficial details and improve the accuracy. On the other hand, if the goal is to count the total number of people in scenes such as the one presented in Figure 1, using an HD resolution instead of gigapixels would mean that several people could be represented by a single pixel, which significantly increases the error. Similarly, it has been shown that using higher resolutions in histopathology can lead to better results [89].

Assuming there is an effective maximum resolution for a particular problem due to hardware limitations or latency requirements, there are two simple baseline approaches for processing the original captured inputs which are commonly used in deep learning literature [21, 30, 102]. The popularity of these baselines can be attributed to the simplicity of their implementation. The first one is to resize (downsample) the original input to the desired resolution, however, this will lead to a lower accuracy if any important details for the problem at hand are lost. This approach is called uniform downsampling (UD) since the quality is reduced uniformly throughout the image. The second approach is to cut up the original input into smaller patches that each have a maximum resolution, process the patches independently, and aggregate the results, for instance, by summing them up for regression problems and majority voting for classification problems. We call this approach cutting into patches (CIP). There are two issues with this approach. First, many deep learning models rely on global features which will be lost since features extracted from each patch will not be shared with other patches, leading to decreased accuracy. For instance, crowd counting methods typically heavily rely on global information such as perspective or illumination [45, 116], and in object detection, objects near the boundaries may be split between multiple patches. Secondly, since multiple passes of inference are performed, that is, one pass for each patch, inference will take much longer. This issue is worse when patches overlap.

To highlight these issues, we test the two baseline approaches (UD and CIP) on the Shanghai Tech Part B dataset [162] for crowd counting, which contains images of size 1024 \(\times\) 768 pixels, as well as the PANDA dataset [144], which contains gigapixel images. However, we resize the gigapixel images to 2,560 \(\times\) 1,440 in order to comply with our hardware limitations. We reduce the original image size by factors of 4 and 16 and measure the mean absolute error (MAE) for both baselines. To test UD, we take pre-trained a SASNet model [116] and fine-tune it for the target input size using the AdamW optimizer [88]. Note that the original SASNet paper uses the Adam optimizer [71]. We train the model for 100 epochs with batch size of 12 per GPU instance using 3 \(\times\) Nvidia A6000 GPUs for Shanghai Tech Part B experiments, and a batch size of 1 for PANDA experiments. We empirically found that fine-tuning does not improve the accuracy of cutting into patches, therefore, we cut the original image into 4 and 16 patches, and obtain the count for each patch using the pre-trained SASNet mentioned above, then aggregate the results by summing up the predicted count for each patch.

The results of these experiments are shown in Table 1. It can be observed that uniform downsampling significantly increases the error compared to processing the original input size. Keep in mind that even though the increase in error is not as drastic with cutting into patches, and there are even improvements in some cases, the inference time of this approach is increased by the same factor (i.e., 4 and 16) when using the effective maximum resolution possible for hardware. This is due to the fact that patches cannot be processed in parallel, as the entire hardware is required to process a single patch. Indeed, with the PANDA experiments, which are close to the maximum effective resolution of our hardware, we can see this drastic increase in computation time when using CIP compared to UD.

Table 1.

Input Size	Shanghai Tech Part B				PANDA
	Uniform Downsampling		Cutting Into Patches		Uniform Downsampling		Cutting Into Patches
	MAE	Time (ms)	MAE	Time (ms)	MAE	Time (ms)	MAE	Time (ms)
Original	6.31	7.02	6.31	7.02	262.21	30.91	262.21	30.91
Reduced 4 \(\times\)	9.01	2.00	6.40	7.11	335.81	8.16	203.51	31.68
Reduced 16 \(\times\)	16.06	1.14	6.67	7.48	440.46	2.21	193.36	31.99

Table 1. Performance of Baseline Approaches on the Shanghai Tech Part B Dataset

Since these baseline approaches are far from ideal, in recent years, several alternative methods have been proposed in the literature in order to improve accuracy and speed while complying with the maximum resolution limitation caused either by memory limitations or speed requirements. The goal of this survey is to summarize and categorize these contributions. To the best of our knowledge, no other survey on the topic of high-resolution deep learning exists. However, there are some surveys that include aspects relevant to this topic. A survey on methods for reducing the computational complexity of Transformer architectures is provided in [122], which discusses the issues related to the quadratic time and memory complexity of self-attention and analyzes various aspects of efficiency including memory footprint and computational cost. While reducing the computational complexity of Transformer models can contribute to efficient processing of high-resolution inputs, in this survey, we only include Vision Transformer methods that explicitly focus on high-resolution images. Some application-specific surveys include high-resolution datasets and methods that operate on such data. For instance, a survey on deep learning for histopathology, which mentions challenges with processing the giga-resolution of WSIs, is provided in [118]; a survey of methods that achieve greater spatial resolution in computed tomography (CT) is provided in [111], which highlights improved diagnostic accuracy with ultra high-resolution CT, and briefly discusses deep learning methods for noise reduction and reconstruction; a survey on crowd counting where many of the available datasets are high-resolution is provided in [45]; a survey on deep learning methods for land cover classification and object detection in high-resolution remote sensing imagery is provided in [161]; and a survey on deep learning-based change detection in high-resolution remote sensing images is provided in [66].

It is important to mention that some methods operate on high-resolution inputs, yet do not make any effort to address the aforementioned challenges. For instance, multi-column (also known as multi-scale) networks [45, 116] incorporate multiple columns of layers in their architecture, where each column is responsible for processing a specific scale as shown in Figure 4. However, since the columns process the same resolution as the original input, most of these methods in fact require even more memory and computation compared to the case where only the original scale is processed. The primary goal of these methods is instead to increase the accuracy by taking into account the scale variances that occur in high-resolution images, although there are some multi-scale methods that improve both accuracy and efficiency [15, 138, 164]. Therefore, these methods do not fall within the scope of this survey, unless they explicitly address the efficiency aspect for high-resolution inputs. ZoomCount [109], Locality-Aware Crowd Counting [167], RAZ-Net [86] and Learn to Scale [149] are all examples of multi-scale methods in crowd counting, and DMMN [57] and KGZNet [139] in medical image processing.

Fig. 4.

The primary purpose of this survey is to collect and describe methods that exist in deep learning literature, which can be used in situations where the high resolution of input images and videos creates the aforementioned technical challenges regarding memory, computation and time. The rest of this paper is organized as follows: Section 2 lists applications where high-resolution images and videos are processed using deep learning. Section 3 categorizes efficient methods for high-resolution deep learning into five general categories and provides several examples for each category. This section also briefly discusses alternative approaches for solving the memory and processing time issues caused by high-resolution inputs. Section 4 lists existing high-resolution datasets for various deep learning problems and provides details for each of them. Section 5 discusses the advantages and disadvantages of using efficient high-resolution methods belonging to different categories and provides recommendations about which method to use in different situations. Finally, Section 6 concludes the paper by summarizing the current state and trends in high-resolution deep learning as well as suggestions for future research. The code for experiments conducted in this survey is available at https://gitlab.au.dk/maleci/high-resolution-deep-learning.

2 Applications of High-Resolution Deep Learning

In this section, we list some real-world applications where high-resolution images are processed with deep learning. Most of these methods do not focus on the efficiency angle, however, some of the methods address issues encountered with high-resolution images. For instance, [91] mentions that “it was not possible to train the model with the original 6,000 \(\times\) 4,000 pixel images because of GPU memory limitation” and [151], which uses the cutting into patches approach, states that “a raw remote image has millions of pixels and is difficult to process directly”.

2.1 Medical and Biomedical Image Analysis

Multi-gigapixel whole-slide pathology images can be processed with deep learning in order to detect breast cancer [87], skin cancer [140, 147], prostate cancer [147], lung cancer [147], cervical cancer [22], and cancer in the digestive tract [128]. Some methods are even able to detect the cancer subtypes [147] or detect the spread of cancer to lymph nodes (metastasis) [83]. Semantic segmentation of such images can be useful in neuropathology [77], which is the study of diseases of the nervous system, and identifying tissue components such as tumor, muscle, and debris in medical images [65].

Moreover, the processing of high-resolution computed tomography (CT) scans with deep learning is becoming more prevalent. The studies in [153] and [17] detect COVID-19 in high-resolution CT scans of the lung, and [3] uses deep learning to improve the quality of captured ultra-high-resolution CT scans. In addition, the study in [70] performs semantic segmentation on high-resolution electron microscopy images from hearts and brains of mice, which is useful for fundamental biomedical research. Additionally, high-resolution deep learning can be used for reconstruction of CT images and reduction of image noise, which has been shown to obtain results similar to other conventional methods with clinically feasible speed [43, 95].

Even though medical image analysis methods primarily focus on improving the accuracy of particular tasks, inference speed can be crucial in some applications, for instance, speed might be a requirement in clinical practice [83]. Furthermore, real-time augmented reality under microscopes can provide suitable human–computer interaction for AI-assisted slide screening [22]. Finally, there might be situations where the speed for processing a single input is acceptable, however, the sheer number of input data is so high that inputs collectively cannot be processed within a deadline. For instance, 55,000 high-resolution images are taken during the examination of a single patient using wireless capsule endoscopy, where a tiny wireless camera is swallowed to take pictures of the digestive tract, which can be used to detect lesions and inflammation [148].

2.2 Remote Sensing

Processing high-resolution aerial and satellite imagery with deep learning has various applications [7], such as detecting buildings [133], which is useful for urban planning and monitoring; detecting airplanes [4], which can be used for defense and military applications as well as airport surveillance; extracting road networks [151], which has applications in automated road navigation with unmanned vehicles, urban planning and real-time updating of geospatial databases; detecting areas in a forest that are damaged due to natural disasters such as storms [52]; identifying weed plants, which can be used for targeted spraying of pesticides in agricultural fields; semantic segmentation of satellite data which can help with crop monitoring, natural resource management and digital mapping [31]; and remote sensing image captioning which is useful for applications such as image retrieval and military intelligence generation [165]. Moreover, significant accuracy improvements can be obtained by taking low-resolution weather data as input and interpolating high-resolution data using super-resolution [106]. The motivation behind this approach is that high-resolution data are only available with a few days delay, and this method can be used to more accurately process low-resolution but up-to-date data.

2.3 Surveillance

Capturing and processing gigapixel images for surveillance is becoming increasingly widespread, and such images can be processed with deep learning for searching and identifying people [42, 117] as well as detecting pedestrians [26, 80] which can be used for human behavior analysis and intelligent video surveillance such as enforcing social distancing restrictions during a pandemic [1, 2]. It should be noted that capturing gigapixel images for surveillance has several advantages over capturing lower resolutions with multiple cameras at different locations of the scene. First, cameras in a multi-camera setup typically have some overlap in their fields of view to avoid blindspots. This may result in errors for many applications, such as crowd counting, due to duplicates, as shown in Figure 5. Reducing this error is not an easy task, since it requires information about the geometry of the scene and the use of re-identification methods for identifying and deduplicating people in multiple views of the same scene. Secondly, tracking the trajectory of people, vehicles and other moving objects is difficult with multiple cameras, since it also requires identifying them in multiple views of the scene. Finally, in many deep learning applications such as crowd counting, incorporating global information from the entire scene, such as illumination and perspective, improves the accuracy of the task [45, 116]. Note that images captured from drastically different locations and perspectives, such as the ones in in Figure 5, cannot be stitched together to form a single image.

Fig. 5.

2.4 Other Applications

High-resolution deep learning can be beneficial in many other applications and various domains of science. For instance, the study in [91] estimates the density of wheat ears, which are the grain-bearing parts of the plant, from high-resolution images taken from grain fields, which aids plant breeders in optimizing their yield; and the study in [59] introduces a deep learning method for segmentation of high-resolution electron microscopy images, which has applications in material science such as understanding the degradation process of industrial catalysts. [84] proposes a method for real-time high-resolution background replacement, which is useful in video calls and conferencing.

3 Methods for Efficient Processing of High-Resolution Inputs with Deep Learning

We classify deep learning methods for efficient processing of high-resolution inputs into five categories, as summarized in Figure 6. First, non-uniform downsampling methods use the result of saliency detection methods to define a nonlinear sampling grid, and downsample the image in a non-uniform fashion. These methods often rely on external supervision functions and custom loss functions for optimal training. Second, selective zooming and skipping methods partition the high-resolution image into several patches. These patches are then prioritized using saliency detection or reinforcement learning. Alternatively, the relationship between the patches can be modeled using graph neural networks, which can help determine patch priority. High-priority patches are then processed using computationally expensive high-performance DNNs, whereas low-priority patches are either processed using lightweight DNNs or discarded entirely. Third, lightweight scanner networks design one or more ultra-lightweight architectures tailored to the specific task at hand. Neural architecture search can be used to aid the design of such architectures. Furthermore, multiple models may be designed to process the image across multiple scales and resolutions, which are then combined to produce a final result. Fourth, task-oriented input compression methods use encoders, graph representation learning or frequency-domain transforms to obtain compressed representations for high-resolution images, which require less computation to process. Multi-modal attention can also be used to reduce the size of representations for high-resolution modalities. Finally, high-resolution Vision Transformers reduce the quadratic cost of the attention operation by various approximation approaches. High-resolution images can also be processed with ViTs in a hierarchical manner to alleviate the quadratic cost imposed as a result of large input sizes.

Fig. 6.

3.1 Non-Uniform Downsampling

Non-uniform downsampling (NUD) is based on the idea that for any deep learning task, some locations of an input image are more important than others. For instance, in gaze estimation, where the goal is to detect where a person is looking given an image including the person’s face, the image locations depicting the person’s eyes are much more important than other parts of the image. Therefore, when reducing the resolution of the image, it might be beneficial to sample more pixels from salient areas and less pixels from non-salient locations, resulting in a warped and distorted image. This operation requires salient areas to be determined before introducing the downsampled image to the task DNN. Therefore, a small saliency detection network is utilized in order to obtain this saliency map. Figure 7 provides a schematic illustration of the non-uniform downsampling approach. Note that non-uniform downsampling is a broad process that encompasses any method that downsamples the input image in any manner other than uniform. [102] further subdivides non-uniform downsampling into three categories: attention mechanisms, saliency-based methods and adaptive image sampling methods. However, as the authors point out, there is a lot of overlap between these categories and it is difficult to draw a clear border between them.

Fig. 7.

Formally, the saliency map S can be obtained by applying saliency detection network \(f_s(\cdot)\) on a uniformly downsampled image \(I_l\) , that is, \(S = f_s(I_l)\) . The input to the saliency detection network is downsampled in order to keep the overhead of the saliency detection process low. The non-uniformly downsampled image J can then be obtained based on \(J = g(I, S)\) , where \(g(\cdot)\) is the non-uniform resampler and I is the original image. Essentially, the resampler should compute a mapping \(J(x, y) = I(u_c(x, y), v_c(x, y))\) from the original image to the downsampled one. Functions \(u_c(\cdot)\) and \(v_c(\cdot)\) need to map pixels proportionally to the weight assigned to them in the saliency map. Assuming the saliency map is normalized and \(\forall x, y: 0 \le u_c(x, y) \le 1\) and \(\forall x, y: 0 \le v_c(x, y) \le 1\) , this problem can be written as

\begin{equation} \int _0^{u_c(x, y)} \int _0^{v_c(x, y)} S(x^{\prime }, y^{\prime }) dx^{\prime } dy^{\prime } = xy. \end{equation}

(1)

However, methods for determining this transformation based on Equation (1) are not efficient [102]. An alternative approach is to presume each pixel \((x^{\prime }, y^{\prime })\) is pulling all other pixels with a force proportional to its saliency \(S(x^{\prime }, y^{\prime })\) , which can be formulated as

\[\begin{eqnarray} u_c(x, y) = \frac{\sum _{x^{\prime }, y^{\prime }} S(x^{\prime }, y^{\prime }) k((x, y), (x^{\prime }, y^{\prime })) x^{\prime }}{\sum _{x^{\prime }, y^{\prime }} S(x^{\prime }, y^{\prime }) k((x, y), (x^{\prime }, y^{\prime }))}, \end{eqnarray}\]

(2)

\[\begin{eqnarray} v_c(x, y) = \frac{\sum _{x^{\prime }, y^{\prime }} S(x^{\prime }, y^{\prime }) k((x, y), (x^{\prime }, y^{\prime })) y^{\prime }}{\sum _{x^{\prime }, y^{\prime }} S(x^{\prime }, y^{\prime }) k((x, y), (x^{\prime }, y^{\prime }))}, \end{eqnarray}\]

(3)

where \(k((x, y), (x^{\prime }, y^{\prime }))\) is a distance kernel, for instance, the Gaussian kernel. Using this formulation, salient areas will be sampled more, since they attract more pixels. Moreover, based on this formulation, \(u_c (\cdot)\) and \(v_c (\cdot)\) can be computed with simple convolutions. Therefore, this operation can be easily plugged into neural network architectures as a layer, and has the added benefit of preserving the differentiability, which is a requirement for training neural networks with the backpropagation algorithm. The overall result is that the entire module, including the saliency detection network and the task network, can be trained end-to-end. The method in [102] uses this approach to improve the performance of gaze estimation as well as fine-grained classification, which is the task of differentiating between hard-to-distinguish objects such as different species of animals.

The method in [92] applies the idea of non-uniform downsampling to semantic segmentation. If the input image \(I = I_{ij}\) has a size \(H \times W\) and must be downsampled to size \(h \times w\) , the first step is to generate ideal sampling tensors from ground truth (GT) labels based on

\begin{equation} E(\phi) = \sum _{i, j} \Vert \phi _{ij} - b(u_{ij}) \Vert ^2 + \lambda \sum _{|i - i^{\prime }| + |j - j^{\prime }| = 1} \Vert \phi _{ij} - \phi _{i^{\prime }j^{\prime }} \Vert ^2, \end{equation}

(4)

where \(\phi \in [0,1]^{h \times w \times 2}\) is the sampling tensor to be determined, \(E(\phi)\) is the (energy) cost function to minimize, \(u \in [0,1]^{h \times w \times 2}\) is the uniform downsampling tensor and \(b(u_{ij})\) is the coordinates of the closest point to pixel \(u_{ij}\) on semantic boundaries in the GT labels. Equation (4) corresponds to a least squares problem with convex constraints that can be efficiently solved using a set of sparse linear equations. The first term in Equation (4) ensures the sampling locations are close to the semantic boundaries, and the second term ensures that the distortion is not excessive by forcing the transformations of adjacent pixels to be similar. Equation (4) is also subject to covering constraints that ensure the sampled locations cover the whole image. The contribution of the second term is controlled by a parameter \(\lambda\) which is empirically set to 1. The next step is to train a neural network to generate sampling tensors from input images. The images are then downsampled based on the output of this neural network and introduced to the task network. Finally, the segmentation output is upsampled to remove distortions and match the original resolution.

Similarly, the method in [68] utilizes non-uniform downsampling for semantic segmentation. However, in contrast with the previous method, the saliency detector in this method is optimized based on the performance of semantic segmentation rather than external supervision signals. This method is similar to [102], however, applying a straightforward adaptation of [102] to semantic segmentation does not perform well. To improve the performance, an edge loss is added as a regularization term, which is calculated by using the mean squared error (MSE) between the deformation map d obtained by the saliency detector and target deformation map \(d_t\) calculated based on segmentation labels. To combat trivial solutions, the target deformation map has denser sampling around object boundaries and is formulated by \(d_t = f_{\text{edge}}(f_{\text{gauss}}(Y_{lr}))\) , where \(Y_{lr}\) is the uniformly downsampled segmentation label, \(f_{\text{edge}}\) is an edge detection filter by convolution with a specific \(3\times 3\) kernel, and \(f_{\text{gauss}}\) is Gaussian blur with \(\sigma = 1\) .

Since the distortions caused by the customized grids defined in Equations (2) and (3) can be severe, the method in [148] introduces structured grids that can be combined with customized grids to obtain a more subtle spatial distortion effect for wireless capsule endoscopy (WCE) image classification. These structured grids ensure that pixels that were in the same row/column in the input image are also in the same row/column in the output image, and can be obtained by

\[\begin{eqnarray} u(x) = \frac{\sum _{x^{\prime }} S(x^{\prime }) k(x, x^{\prime }) x^{\prime }}{\sum _{x^{\prime }} S(x^{\prime }) k(x, x^{\prime })}, \end{eqnarray}\]

(5)

\[\begin{eqnarray} v(y) = \frac{\sum _{y^{\prime }} S(y^{\prime }) k(y, y^{\prime }) x^{\prime }}{\sum _{y^{\prime }} S(y^{\prime }) k(y, y^{\prime })}, \end{eqnarray}\]

(6)

where \(S(x) = \max _y S(x, y)\) and \(S(y) = \max _x S(x, y)\) . \(u(x)\) and \(v(y)\) are then copied and stacked to form \(u_s(x, y) = u(x)\) and \(v_s(x, y) = v(y)\) . Finally, the combined deformation grids can be computed by

\[\begin{eqnarray} u(x, y) = \lambda u_s(x, y) + (1 - \lambda) u_c(x, y), \end{eqnarray}\]

(7)

\[\begin{eqnarray} v(x, y) = \lambda v_s(x, y) + (1 - \lambda) v_c(x, y), \end{eqnarray}\]

(8)

where parameter \(\lambda\) is empirically set to 0.5.

Similarly, FOVEA [124] discards custom grids and solely relies on structured grids for object detection in autonomous driving use cases. It also introduces anti-cropping regularization to combat cropping which may result in missing objects, by using reflect padding on the saliency map. In [102], the saliency detector is trained end-to-end along with the task network, however, as mentioned, finding saliency maps in object detection is more difficult. Therefore FOVEA uses intermediate supervision to train the saliency detection network.

Even though the primary goal of the spatial transformer module in spatial transformer networks (STNs) [64] is to learn invariance to translation, scale, rotation and warping in order to improve performance, in the special case where the module is the first layer of the network, it can learn to crop the raw high-resolution input to a lower resolution and increase computational efficiency, thus it could be considered a form of NUD. Figure 8 shows the architecture of the spatial transformer module, where the localization network determines the parameters \(\theta\) for the transformation \(\tau _{\theta }\) from input features U. \(\tau _{\theta }(\cdot)\) can be a 2D affine transformation, a more constrained transformation such as

\begin{equation} A_{\theta } = \begin{bmatrix}s & 0 & t_x\\ 0 & s & t_y \end{bmatrix}, \end{equation}

(9)

which only allows cropping, translation and scaling, or a more general transformation such as plane projective transformation with 8 parameters, piecewise affine, thin plate spline [40], or any transformation as long as it is differentiable with respect to its parameters.

Fig. 8.

SALISA [8] uses spatial transformer modules to perform non-uniform downsampling for object detection in high-resolution videos. In SALISA, the output of a video frame is used to determine the saliency map for the next frame. Figure 9 shows this method, where the first frame is introduced to a high-performing detector without any downsampling. The detected objects are subsequently used to create a saliency map, which is then given to the resampling module. The resampling module contains a spatial transformer module with a thin plate spline transformation, where the localization network receives the saliency map as input. The downsampled image provided by the resampling module is then introduced to a lightweight detector. Since the lightweight detector detects objects in the warped image, the detected bounding boxes need to be transformed back into the original grid. Therefore, an inverse transformation is applied before generating the saliency map. To prevent cascading errors, the method is reset to use the original high-resolution frame and high-performing detector every few frames.

Fig. 9.

3.2 Selective Zooming and Skipping

Selective zooming and skipping (SZS) methods take a more efficient approach to cutting into patches by only zooming into regions of the input image that are important. The zoom level may differ across different patches, and some patches may be entirely skipped. Reinforced Auto-Zoom Net (RAZN) [35] uses reinforcement learning to determine where to zoom in WSIs for the task of breast cancer segmentation. RAZN assumes the zoom-in action can be performed at most m times and the zooming rate is a constant r. At each zoom level i, there is a different segmentation network \(f_{\theta _i}\) and a different policy network \(g_{\theta _i}\) . Initially, policy network \(g_{\theta _0}\) takes a cropped image \(x_0 \in \mathbb {R}^{H \times W \times 3}\) as input and determines whether to zoom-in or to break. If there is no need to zoom in, \(x_0\) is given as input to segmentation network \(f_{\theta _0}\) which produces the output, otherwise, a higher-resolution image \(\hat{x}_0 \in \mathbb {R}^{rH \times rW \times 3}\) is sampled from the same area and will be cut into \(r^2\) patches of size \(H \times W \times 3\) . Each patch is then given to policy network \(g_{\theta _1}\) and this process is recursively repeated until all policy networks break or the maximum zoom level is reached. RAZN achieves an improved performance over other state-of-the-art methods while reducing the inference time by a factor of \(\sim\) 2. Similarly, the methods in [46] and [132] use reinforcement learning for efficient object detection and aerial image classification, respectively.

Instead of reinforcement learning, the method in [38] uses a hierarchical graph neural network to classify whether a mammogram (X-ray image of a breast) is normal/benign (contains a tumor that is not cancerous) or malignant (contains a tumor that is cancerous). At each zoom level i, the graph \(G^i\) is defined by the adjacency matrix \(A^i \in \mathbb {R}^{N_i \times N_i}\) where there is an edge between each zoomed-in patch and its original image. The feature matrix of the graph is defined as \(X_i \in \mathbb {R}^{N_i \times D \times D}\) , and the maximum zoom level is R. The features on the nodes are zoomed-in regions of the input image, resized to \(D \times D\) . A pre-trained CNN is used to extract feature vectors \(H_i \in \mathbb {R}^{N_i \times H}\) from \(X_i\) . \(\text{GAT}_{\text{node}}(\cdot)\) is a graph attention network [136] used to classify whether to zoom in for each node. Therefore, the output of the i-th level in the hierarchical graph is

\begin{equation} P_i = {\left\lbrace \begin{array}{ll} 1, & i = 1, \\ \text{softmax}(\text{GAT}_{\text{node}}(A_i, H_i)), & 1 \lt i \lt R, \end{array}\right.} \end{equation}

(10)

where \(P_i \in \mathbb {R}^{N_i \times 2}\) represents the decision to zoom or not for each node of the i-th level. At the final zoom level R, another graph attention network \(\text{GAT}_{\text{graph}}(\cdot)\) is used to perform the final classification for the entire mammogram based on \(\hat{Y} = \text{softmax}(\text{GAT}_{\text{graph}}(A_R, H_R)W)\) , where W is a trainable weight matrix. The loss function contains both node losses and graph losses, with the zoom labels for nodes being obtained from lesion segmentation labels. This method achieves an accuracy comparable to the state of the art, however, it is unclear how much it improves the inference speed.

GigaDet [18] achieves near real-time object detection in gigapixel videos. At the core of GigaDet is the Patch Generation Network (PGN). PGN takes a uniformly downsampled image as input and outputs a dense regression map which counts the number of objects that are completely contained within the corresponding area in the image, referred to as the patch candidate. PGN is applied at different scales in order to obtain patch candidates of varying scales. The patch candidates selected by the PGN go through post-processing which includes non-maximum suppression (NMS), and are subsequently sorted based on their count. The top K patch candidates are then selected to be processed by the Decorated Detector (DecDet) to detect objects. VGG [114] and YOLO [103] are used for the PGN and DecDet networks, respectively. Given gigapixel videos, GigaDet is capable of running 5 FPS on a single Nvidia 2080 Ti GPU, which is \(50 \times\) faster than Faster RCNN [104], yet obtains a comparable performance in terms of average precision.

REMIX [67] detects pedestrians in high-resolution videos within a latency budget given by the user. The input frame is partitioned into several blocks, where more salient blocks are processed using a computationally expensive but accurate network, whereas less salient blocks are processed using a computationally cheap network or even skipped, as shown in Figure 10. REMIX uses historical frames to determine the object distribution, and establishes the optimal partition using a dynamic programming algorithm that takes into account the given latency budget, the estimated object distribution, as well as the accuracy and speed of available neural networks for object detection. REMIX achieves up to \(8.1 \times\) inference speedup with an accuracy comparable to state-of-the-art methods.

Fig. 10.

3.3 Lightweight Scanner Networks

Lightweight scanner networks (LSNs) are lightweight fully convolutional neural networks (FCNs) that efficiently scan the entire high-resolution input. To achieve a lightweight architecture, LSNs are typically designed and trained for very specific tasks. Moreover, as opposed to the cutting into patches approach, FCNs are inherently efficient in a sliding-window setting since they share the computation in overlapping regions [112].

VGG-720p and VGG-1080p [129, 130] are LSNs capable of running in real-time on drones and provide heatmaps for input images of size 1280 \(\times\) 720 and 1920 \(\times\) 1080 pixels, respectively, that specify whether or not there are people, faces, or bicycles at each location in the input image. Both models take patches of size 32 \(\times\) 32 or 64 \(\times\) 64 pixels as input. The architectures of VGG-720p and VGG-1080, shown in Tables 2 and 3, respectively, contain only 5 convolutional layers with only 2 to 24 output channels. In contrast, the original VGG architectures have 11 to 19 layers with up to 512 output channels in some layers [114].

Table 2.

Layer	Kernel	Stride	Pad \(^{\dagger }\) (X/Y) \(^{*}\)	Max Pool (X/Y)	Channels
conv1_1	3 \(\times\) 3	1/1	1/1	\(-\) / \(-\)	16
conv1_2	3 \(\times\) 3	1/1	1/1	\(\checkmark\) / \(-\)	16
conv2_1	3 \(\times\) 3	1/1	1/1	\(-\) / \(-\)	24
conv2_2	3 \(\times\) 3	1/4	1/1	\(\checkmark\) / \(\checkmark\)	16
conv_last	8 \(\times\) 8	1/1	0/0	\(-\) / \(-\)	2

Table 2. Architecture of VGG-720p

\(^{\dagger }\) Zero padding.

\(^{*}\) X and Y represent the horizontal and vertical axes.

Table 3.

Layer	Kernel	Stride	Pad \(^{\dagger }\) (X/Y) \(^{*}\)	Max Pool (X/Y)	Channels
conv1_1	3 \(\times\) 3	2/1	0/0	\(-\) / \(-\)	8
conv1_2	3 \(\times\) 3	1/2	0/0	\(\checkmark\) / \(-\)	8
conv2_1	3 \(\times\) 3	1/1	0/0	\(-\) / \(-\)	6
conv2_2	3 \(\times\) 3	1/2	0/0	\(-\) / \(-\)	6
conv_last	8 \(\times\) 8	1/1	0/0	\(-\) / \(-\)	2

Table 3. Architecture of VGG-1080p

\(^{\dagger }\) Zero padding.

\(^{*}\) X and Y represent the horizontal and vertical axes.

Similarly, the study in [131] proposes an architecture with 6 convolutional layers for the same problem of generating a crowd heatmap from high-resolution images. The study in [127] proposes lightweight FCNs for face detection with 7 convolutional layers and 76K parameters, for facial parts detection (such as eyes, nose and mouth) with 4 convolutional layers and 20K parameters, and for combined face and parts detection with 9 convolutional layers and 101K parameters.

You only look twice (YOLT) [135] is a method that detects objects of different scales in DigitalGlobe satellite images which have a size of over 250 megapixels. The architecture of YOLT is based on the YOLO architecture [103], however, it reduces the number of layers from the original 30 down to 22. Furthermore, YOLT trains two separate models: one which processes images that correspond to areas of 200 \(\times\) 200m \(^2\) for detecting relatively small objects such as cars, airplanes, boats and buildings, and another which processes images that correspond to areas of 2500 \(\times\) 2500m \(^2\) for detecting large objects such as airports. YOLT has an inference speed of 32km \(^2\) /min for the former model and 6000km \(^2\) /min for the latter on an Nvidia Titan X GPU.

Fast ScanNet [83] converts VGG16 [114] to a fully convolutional network by replacing the last fully-connected layers in VGG16 with convolutional layers of kernel size 1 \(\times\) 1. Fast ScanNet is applied to patches of size 2800 \(\times\) 2800 pixels, a size which is dictated by GPU memory limitations, taken from WSIs, which have \(\sim\) 400 patches on average. It takes about one minute for Fast ScanNet to process a WSI on a workstation with 8 \(\times\) Nvidia Titan X GPUs.

ICNet [164] takes advantage of both the efficiency of processing lower resolutions and the accuracy of processing higher ones by uniformly downsampling the input image to two smaller scales, processing each scale separately, and fusing the result of processing lower resolutions with higher ones. Lower resolutions are processed with more convolution layers and higher resolutions with less, which makes the entire architecture efficient, as shown in Figure 11. In addition, some of the layers share weights in order to increase the efficiency. ICNet is able to perform semantic segmentation on 2048 \(\times\) 1024 images at 30 frames per second with high accuracy on a Titan X GPU. Even though ICNet does not obtain state-of-the-art accuracy, it is \(\sim 15 \times\) faster than methods with similar performance.

Fig. 11.

ESPNet [96] relies on efficient spatial pyramid (ESP) modules which reduce the amount of computation by decomposing standard convolutions with \(n \times n\) kernels into two steps. The first step applies a 1 \(\times\) 1 convolution to project feature maps with dimension N to feature maps with dimension \(\frac{N}{K}\) . The second step applies K dilated convolutions with kernel size \(n \times n\) and dilation rates \(2^{k-1}, k \in \lbrace 1, \dots , K\rbrace\) to the new feature maps simultaneously, and combines the results. Concatenating the outputs of dilated convolutions creates checkerboard artifacts, therefore, a simple solution is used where the outputs of dilated convolutions are hierarchically added to each other before concatenation. ESPNet can perform semantic segmentation on 2048 \(\times\) 1024 images at 54 frames per second with an accuracy comparable to the state of the art.

Neural architecture search (NAS) techniques can be used for designing better LSNs. Since LSNs need to be lightweight and contain few layers and parameters, the search space is relatively small, making NAS easier. HR-NAS [32] is one such method that searches for network architectures that can contain both convolutions and lightweight Transformers, and may have parallel branches. HR-NAS obtains state-of-the-art results in the trade-off between efficiency and accuracy in semantic segmentation, human pose estimation and 3D object detection tasks with high-resolution inputs.

3.4 Task-Oriented Input Compression

Task-oriented input compression (TOIC) methods compress high-resolution inputs into lightweight representations. These representations are then given to the task DNN as input instead of the high-resolution images or videos. The exact nature of the lightweight representations and the compression procedure varies from method to method and is often highly dependent on the underlying task.

There is an important distinction between this approach and neural image compression methods such as SlimCAE [152]. The goal of neural image compression is to learn optimal compression algorithms for the task at hand, in order to reduce the size of stored or transmitted data. Therefore, the network that compresses and decompresses this data may be very large and inefficient. Moreover, neural image compression aims to reconstruct the input from the compressed representations, whereas TOIC does not reconstruct the input data and strives to extract compact representations that are suitable for the second part of the network, which is responsible for performing the task.

Slide Graph [89] recognizes the loss of visual context that comes with using the cutting into patches method, and fixes this issue by building and processing a compact graph representation of the cellular architecture in breast cancer WSIs in order to predict the status of human epidermal growth factor receptor 2 (HER2) and progesterone receptor (PR), which are proteins that promote the growth of cancer cells. Slide Graph has four stages: The first stage uses a HoVer-Net [48], which is a CNN for segmentation and classification of cellular nuclei, trained on the PanNuke dataset [44] to extract features of the tissue cells. The second stage uses agglomerative clustering [99] to group neighboring nuclei to further reduce the computational cost. The third stage constructs a graph where each vertex corresponds to a cluster and contains features extracted in the previous stage. Graph edges are constructed based on Delauney triangulation where vertices are represented by the geometric center of their corresponding cluster, which results in a planar graph. In the final stage, HER2 and PR status predictions are obtained from the constructed graph using a graph convolutional network (GCN) [73]. Slide graph is more accurate than state-of-the-art methods and reduces the average inference time from 1.2 seconds of the baseline down to 0.4 milliseconds. However, these measurements do not include the graph construction phase. Therefore, the end-to-end improvement in efficiency obtained by Slide Graph is unclear.

The method in [123], shown in Figure 12, compresses gigapixel histopathology WSIs down to a size that can be processed with a CNN on a single GPU. This compression is obtained by training an autoencoder (either VAE [72] or bidirectional GAN [34]) on image patches of size \(P \times P \times 3\) . The WSI image of size \(M \times N \times 3\) is then cut into patches of the aforementioned size, and compressed embeddings of size \(1\times 1 \times C\) are obtained from the patches using the encoder part of the autoencoder. These embeddings are then concatenated to form a compressed image of size \(\lceil \frac{M}{P} \rceil \times \lceil \frac{N}{P} \rceil \times C\) , which can be given as input to the CNN. In experiments where \(M = N = 50,000\) and \(P = C = 128\) , the input size is reduced by a factor of \(\sim\) 43.

Fig. 12.

MCAT [20] uses a combination of WSIs and genomics data for cancer survival outcome prediction. At the core of MCAT is the Genomic-Guided Co-Attention (GCA) layer which reduces the spatial complexity of processing WSIs. MCAT processes the input in data structures known as bags, which are unordered sets of objects of varying size without individual labels. MCAT constructs one bag ( \(H_{\text{bag}}\) ) from multiple WSIs in order to utilize the entire tissue microenvironment, and another bag ( \(G_{\text{bag}}\) ) from genomic features. \(H_{\text{bag}}\) is constructed by cutting the WSIs into non-overlapping \(256 \times 256\) pixel patches and processing each patch with a ResNet50 CNN [55] pre-trained on the ImageNet dataset [29] to obtain \(d_k\) -dimensional feature embeddings. \(G_{\text{bag}}\) is constructed by categorizing genes into N different sets based on similarity and applying a fully-connected (FC) layer to obtain genomic embeddings. GCA then takes these two bags as input and performs the co-attention operation

\[\begin{eqnarray} \text{CoAttn}_{G \rightarrow H}(G, H) &=& \text{softmax} \left(\frac{QK^T}{\sqrt {d_k}} \right)V \\ \nonumber \nonumber &=&\text{softmax} \left(\frac{W_qGH^TW_k^T}{\sqrt {d_k}} \right)W_vH, \end{eqnarray}\]

(11)

where \(Q = W_q G\) is the query matrix, \(K = W_k H\) is the key matrix, \(V = W_v H\) is the value matrix, and \(W_q, W_k, W_v \in \mathbb {R}^{d_k \times d_k}\) are trainable weights. The output of this operation, as shown in Figure 13, has a dimension of \(N \times d_k\) . Therefore, the subsequent self-attention layers in the MCAT network are quadratic with respect to N instead of M. Since on average \(M = 15,231\) and \(N = 6\) , this results in a massive reduction in complexity.

Fig. 13.

A subcategory of TOIC methods are frequency-domain DNNs, which convert input RGB pixels to frequency domain representations with the help of operations such as discrete cosine transform (DCT) or wavelet transform. The intuition behind this approach is that the first few layers in CNNs often learn filters that resemble such transforms. Therefore, not only are image representations more compact in the frequency domain, but also a lower number of layers is required for processing such representations.

The method in [50] uses the DCT coefficients obtained in the middle of JPEG encoding as inputs to a modified ResNet50 CNN [55] for the image classification task. JPEG encoding consists of three stages. The first stage converts the input 3-channel 24-bit RGB image to the YCbCr color space by

\begin{equation} \begin{bmatrix}Y\\ Cb\\ Cr \end{bmatrix} = \begin{bmatrix}0.299 & 0.587 & 0.114\\ -0.168935 & -0.331665 & 0.50059\\ 0.499813 & -0.418531 & -0.081282 \end{bmatrix} \begin{bmatrix}R\\ G\\ B \end{bmatrix}. \end{equation}

(12)

The luma component (Y) represents the brightness, and the chroma components (Cb and Cr) represent color. The resolution of chroma components is then reduced by a factor of 2 due to the fact that the human eye is less sensitive to fine color detail than fine brightness. Figure 14 shows an example image and its corresponding Y, Cb and Cr components. The second stage is a blockwise DCT, where each of the three components is partitioned into \(8 \times 8\) blocks that undergo a 2D DCT. The amplitude values of the frequency domain are the input representations used by this method. The DCT representations of Cb and Cr are upsampled by a factor of two and concatenated with the DCT representation of Y before being given as input to the task DNN, as shown in Figure 15. The rest of the JPEG encoding process contains the quantization of these representations as well as lossless compression techniques such as Huffman coding. However, this method uses the representations obtained before quantization and lossless compression.

Fig. 14.

Fig. 15.

With the help of these input representations, this method obtains DNNs that are both more accurate and up to \(1.77 \times\) faster than ResNet50. Moreover, [50] includes experiments attempting to learn convolutions behaving like DCT, however, they find that this learned DCT transform leads to higher error compared to the conventional DCT transform.

The method in [150] uses the same idea for image classification and semantic segmentation tasks using ResNet50 and MobileNetV2 architectures. However, this method also prunes the 192 DCT channels with the help of a gating module that generates a binary decision for each channel. Furthermore, this study discovers that some channels are consistently pruned regardless of the particular task, and develops a static frequency channel selection scheme based on these results. This scheme prunes up to 87.5% of the channels with little accuracy drop, if any. The method in [141] uses the same approach for image classification, however, it uses several variants of discrete wavelet transform (DWT) instead of DCT. The advantage of DWT over DCT is that it can obtain a better compression ratio without loss of information, however, it is more computationally expensive [69]. Experiments show that using DWT instead of DCT can lead to higher accuracy, however, the impact of DWT on inference time is unclear.

Finally, similar to images, DNNs can directly process the compressed representations obtained by video compression formats. MMNet [143] performs efficient object detection on H.264/MPEG-4 Part 10 compressed videos [105], one of the most commonly used video compression formats, by taking advantage of the motion information already embedded in the video compression format. It only runs the complete feature extractor DNN on a few reference frames in the video and aggregates the visual information from the subsequent frames with the help of an LSTM [58]. H.264 has two types of frames: I-frames which contain a complete image, and P-frames, also known as delta frames, which store the offset to previous frames using motion vectors and residual errors. In MMNet, the extracted motion vectors and residual errors for each P-frame following an I-frame are passed on to the LSTM. MMNet is \(3 \times\) to \(10 \times\) faster than competing models with minor loss in accuracy.

3.5 High-Resolution Vision Transformers

As previously mentioned, the self-attention operation in Transformers has a high complexity that increases in a quadratic fashion with respect to the number of input tokens. This operation is formulated by

\begin{equation} Z = \mathit {softmax} \left(\frac{QK^T}{\sqrt {d_k}} \right)V, \end{equation}

(13)

where query \(Q = XW^Q \in \mathbb {R}^{n \times d_q}\) , key \(K = XW^K \in \mathbb {R}^{n \times d_k}\) and value \(V = XW^V \in \mathbb {R}^{n \times d_v}\) are obtained from a sequence of input tokens \(X = (x_1, \dots , x_n) \in \mathbb {R}^{n \times d}\) , and \(W^Q\) , \(W^K\) and \(W^V\) are learnable weight matrices. Due to this quadratic complexity, naive approaches, such as ViT [37], that create a long sequence of input tokens from a high-resolution image will lead to massive complexity. On the other hand, if X contains few tokens, each input token represents a large area of the original image, leading to loss of detailed information that might be crucial to some applications.

Vision Longformer (ViL) [160] is a variant of Longformer [9] which has a linear complexity with respect to the number of input tokens, and is capable of processing high-resolution images. This linear complexity is achieved by adding \(n_g\) global tokens, which include the classification token \({\it cls}\) , that serve as global memory by attending to all input tokens. Input tokens are only allowed to attend to the global tokens as well as their neighbors within a 2D window. If the number of input tokens are \(n_l\) and the 2D window size is w, then the memory complexity is \(\mathcal {O}(n_g(n_g + n_l) + n_l w^2)\) . When \(n_g \ll n_l\) , the complexity is significantly reduced from the original \(n_l^2\) in ViT. By using ViL in a multi-scale architecture, multi-scale Vision Longformer is able to obtain superior performance compared to the state-of-the-art in image classification, object detection and semantic segmentation while requiring less computation in terms of FLOPs in some cases.

High-Resolution Transformer (HRFormer) [158] reduces the computational complexity of self-attention by partitioning the input representations into non-overlapping patches, and performing the self-attention only within each patch. Figure 16 shows the building block of HRFormer, which contains a depth-wise convolution that facilitates information exchange between patches. By utilizing this augmented self-attention in a multi-scale architecture, HRFormer obtains superior performance in human pose estimation and semantic segmentation with fewer parameters and FLOPs.

Fig. 16.

Multi-Scale High-Resolution Vision Transformer (HRViT) [49] uses cross-shaped self-attention [36] and parameter sharing to decrease the computational cost of self-attention. Cross-shaped self-attention, shown in Figure 17, splits the K self-attention heads present in multi-head attention into two groups: \(\lbrace h_1, \dots , h_{\frac{K}{2}} \rbrace\) and \(\lbrace h_{\frac{K}{2}+1}, \dots , h_K \rbrace\) . These groups perform self-attention in horizontal and vertical strips in parallel. Strip width \(\text{sw}\) can be adjusted to achieve a trade-off between efficiency and performance. The linear projections for key and value tensors are shared in HRViT’s blocks to save computation and parameters. In addition to efficient self-attention, HRViT employs a convolutional stem to reduce the spatial dimension of the input. HRViT achieves the best performance-efficiency trade-off compared to state-of-the-art models for semantic segmentation.

Fig. 17.

Instead of restricting self-attention to patches that are neighbors in the 2D grid, Glance and Gaze Transformer (GG-Transformer) [156], shown in Figure 18, performs the self-attention within dilated partitions. Since these dilations create holes in the receptive field, a parallel branch containing depth-wise convolution is added to compensate for the local interactions with negligible cost. GG-Transformer achieves superior performance in image classification, object detection and semantic segmentation and reduces the parameters or FLOPs in some cases.

Fig. 18.

Hierarchical Image Pyramid Transformer (HIPT) [19] processes gigapixel WSIs for the task of cancer subtyping and survival prediction. Since the input WSIs are as large as 150,000 \(\times\) 150,000 pixels, processing them with a normal ViT and small patch size, such as 16 \(\times\) 16, results in a massive number of parameters and computational cost requirements, and using large patch sizes such as 4,096 \(\times\) 4,096 pixels directly would result in loss of cellular information. Therefore, HIPT takes a hierarchical approach, shown in Figure 19, where an initial ViT processes patches of 16 \(\times\) 16 in an area of size 256 \(\times\) 256 pixels. A second ViT then takes the aggregated tokens from the previous ViT and processes an area of size 4096 \(\times\) 4096 pixels. A final ViT takes the aggregated tokens from the second ViT and processes the entire image.

Fig. 19.

Recent works on efficient ViTs and Transformers reduce memory consumption as well as latency, allowing for more efficient processing of high-resolution images. Even though most of these developments do not explicitly include experiments on high-resolution images, benefits obtained on low-resolution images are likely to be useful for high-resolution images as well. Furthermore, most works on efficient Transformers include experiments on long sequences of text, therefore, the memory and computation improvements are probably beneficial for high-resolution images as well, which are processed as long sequences of image patches.

Conventional model compression techniques have been successfully applied to Vision Transformers. For instance, Q-ViT [82] quantizes Vision Transformers down to 3-bytes without significant reduction in performance, MiniViT [159] applies knowledge distillation to compress the parameters of Vision Transformers by up to \(9.7 \times\) , and SPViT [54] prunes the Vision Transformer architecture to achieve a \(52\%\) reduction in terms of FLOPs, while slightly increasing the performance.

FlashAttention [27] enhances the attention process by incorporating IO-awareness, that is, taking into account the total number of read and write operations between different levels of GPU memory. By reducing the number of read and write operations in GPU memory using tiling, which is the incremental application of softmax reduction, FlashAttention is able to speed up the computation by up to \(7.6 \times\) and reduce the memory requirement to linear with respect to the input size. The authors also introduce block-sparse FlashAttention, an approximate extension of FlashAttention, which is up to \(4 \times\) faster than the original FlashAttention, while obtaining competitive results on several tasks.

In practice, the implementation of dynamic sparse attention algorithms typically leads to slower inference times compared to the full attention algorithm using the FlashAttention framework. Therefore, [100] modifies FlashAttention to facilitate various attention sparsity patterns such as hash-based attention mechanisms as well as query/key dropping attention. Their method obtains speedups for both inference and training on long sequences of text by up to \(3.3 \times\) .

4 High-Resolution Datasets

Table 4 lists popular datasets used in high-resolution deep learning literature and provides information about their attributes, such as the deep learning application they are primarily used for, the number of images/videos in the dataset and their resolution, the type of available annotations, whether they specify training/validation/test set splits, the year of publication, and whether they are publicly available. It is important to note that studies reported in some papers create customized datasets. For instance, [46] constructs a dataset from YFCC100M [125]; [129] constructs datasets from AFLW [93], MTFL [163] and WIDER FACE [154]; and [135] constructs datasets from DigitalGlobe satellites, Planet satellites, and aerial platforms.

Table 4.

Name	Applications	Resolution (Pixels)	# of Samples	Annotations	Splits	Year	Availability
Supervisely Persons \(^{\ddagger }\)	Person Segmentation	800 \(\times\) 1116 to 9933 \(\times\) 6622	5,711 images	Pixel Mask	None	2018	Public
PANDA [144]	Person Detection	\(\gt\) 25K \(\times\) 14K	555 frames^§	Person Bounding Box	None	2020	Upon Request
UCF_CC_50 [62]	Crowd Counting	2888 \(\times\) 2101 on average	50 images	Head Annotations \(^{*}\)	None	2013	Public
Shanghai Tech Part A [162]	Crowd Counting	868 \(\times\) 589	482 images	Head Annotations	Train & Test	2016	Public
Shanghai Tech Part B [162]	Crowd Counting	1024 \(\times\) 768	716 images	Head Annotations	Train & Test	2016	Public
UCF-QNRF [63]	Crowd Counting	2902 \(\times\) 2013 on average	1,535 images	Head Annotations	Train & Test	2018	Public
PANDA Crowd [144]	Crowd Counting	25,151 \(\times\) 14,151 to 26,908 \(\times\) 15,024	45 images	Person Bounding Box	None	2020	Upon Request
JHU-CROWD++ [115]	Crowd Counting	1430 \(\times\) 910 on average	4,372 images	Head Annotations	Train, Val & Test	2020	Public
NWPU-Crowd [142]	Crowd Counting	3209 \(\times\) 2191 on average	5,109 images	Head Annotations	Train, Val & Test	2020	Public
DISCO [60]	Audio-Visual Crowd Counting	1920 \(\times\) 1080 (Full HD)	1,935 images	Head Annotations	Train & Test	2020	Public
CityScapes [25]	Autonomous Driving	2048 \(\times\) 1024	5K images	Pixel Mask	Train, Val & Test	2016	Upon Request
SYNTHIA-RAND [107]	Autonomous Driving	1280 \(\times\) 720 (HD)	\(\sim\) 13K images	Pixel Mask	Train & Test	2016	Public
ApolloScape [61]	Autonomous Driving	3384 \(\times\) 2710	\(\sim\) 113K images	Pixel Mask	Train & Test	2020	Upon Request
Argoverse-HD [81]	Autonomous Driving	1920 \(\times\) 1200	89 videos	Bounding Box	Train, Val & Test	2020	Public
BDD100K [155]	Autonomous Driving	1280 \(\times\) 720 (HD)	100K videos	Bounding Box	Train, Val & Test	2020	Upon Request
nuScenes [13]	Autonomous Driving	1600 \(\times\) 900	1,000 videos	3D Bounding Box	Train, Val & Test	2020	Upon Request
Waymo Open [119]	Autonomous Driving	1920 \(\times\) 886 to 1920 \(\times\) 1280	1,150 videos	2D & 3D Bounding Box	Train, Val & Test	2020	Upon Request
PASCAL-Context [98]	Scene Understanding	500 \(\times\) 375 to 500 \(\times\) 500	10,103 images	Pixel Mask	Train & Test	2014	Public
ADE20K [166]	Scene Understanding	683 \(\times\) 512 to 2100 \(\times\) 2100	27,574 images	Pixel Mask	Train & Test	2017	Upon Request
COCO-Stuff 10K [14]	Scene Understanding	\(\sim\) 640 \(\times\) 480	10K images	Pixel Mask	Train & Test	2018	Public
DeepGlobe [28]	Land Cover Classification	2448 \(\times\) 2448	1,146 images	Pixel Mask	Train, Val & Test	2018	Public
Copernicus [12]	Land Cover Classification	20,160 \(\times\) 20,160	94 images	Pixel Mask	None	2015-2019	Public
fMoW [24]	Aerial Image Classification	up to 16,032 \(\times\) 14,840	1,047,691 images	Classes	Train, Val & Test	2018	Public
KID [75]	Capsule Endoscopy	360 \(\times\) 360	\(\sim\) 2,500 frames	Pixel Mask	None	2017	Public (N/A)
CAD-CAP [79]	Capsule Endoscopy	576 \(\times\) 576	25,124 frames	Pixel Mask	Train & Test	2020	Upon Request
CAMELYON16 [154]	Pathology	up to 200,000 \(\times\) 100,000	400 images	Pixel Mask	Train & Test	2016	Public
TUPAC16 [137]	Pathology	\(\sim\) 50,000 \(\times\) 50,000	821 images	Proliferation Score \(^{\dagger }\)	Train & Test	2016	Public
BACH Part B [5]	Pathology	(39,980-62,952) \(\times\) (27,972-44,889)	40 images	Pixel Mask	Train & Test	2019	Public
TCGA-BRCA [74]	Pathology	up to 150,000 \(\times\) 100,000	709 images	Classes	None	2020	Public
PCa-Histo [68]	Pathology	(1968±216) \(\times\) (9392±4794)	266 images	Pixel Mask	Train, Val & Test	2021	Private
INbreast [97]	Breast Cancer Detection	2560 \(\times\) 3328 to 3328 \(\times\) 4084	410 images	Pixel Mask	Train & Test	2012	Public
UA-DETRAC [145]	Video Object Detection	960 \(\times\) 540	140K frames	Bounding Box	Train & Test	2015	Public
ImageNet-VID [108]	Video Object Detection	176 \(\times\) 132 to 1280 \(\times\) 720 (HD)	5,354 videos	Bounding Box	Train, Val & Test	2015	Public
FAIR1M [120]	Fine-Grained Object Detection	600 \(\times\) 600 to 10,000 \(\times\) 10,000	40,000 images	Bounding Box	Train & Test	2021	Public (N/A)
COCO [85]	Object Detection Human Pose Estimation	\(\sim\) 640 \(\times\) 480	\(\gt\) 200K images	Pixel Mask Keypoints	Train, Val & Test	2014	Public

Table 4. List of Popular High-Resolution Datasets

^§A frame is a single image in a sequence representing a video.

\(^{*}\) The location for the center of each human head in the image is specified.

\(^{\dagger }\) A measure of the number of cells in a tumor that are dividing.

\(^{\ddagger }\) https://github.com/supervisely-ecosystem/persons

The Cancer Genome Atlas (TCGA) program is a collaboration between National Cancer Institute (NCI) and National Human Genome Research (NHGRI)¹. Since 2006, TCGA has generated over 2.5 petabytes of publicly available data which has led to improvements in cancer diagnosis, treatment, and prevention. Among efficient high-resolution deep learning methods, the most widely used subset of this data is the breast invasive carcinoma (BRCA), which is outlined in Table 4. However, TCGA provides data for many other types of cancer, such as bladder urothelial carcinoma (BLCA), glioblastoma and lower grade glioma (GBMLGG), lung adenocarcinoma (LUAD), and uterine corpus endometrial carcinoma (UCEC). These are used in some studies, and have properties similar to that of BRCA.

5 Discussion and Open Issues

Each of the approaches introduced in Section 3 has its advantages and disadvantages and is useful in certain situations, which are summarized in Table 5. NUD (Section 3.1) works well in cases where the salient area is small compared to the entire image, and thus, it is possible to sample many pixels from such areas. This requirement is satisfied in gaze estimation or object detection problems. Our conjecture is that it would also work well in problems such as hand gesture detection and non-cropped facial expression recognition, although these tasks are not yet explored in the literature in combination with NUD. However, when the salient area is large, for instance, densely populated scenes in crowd counting or a scene fully covered with objects in object detection, the quality gain obtained by sampling from salient areas will be negligible, and the result of NUD will be similar that of uniform downsampling [8].

Table 5.

Method	Applications	Limitations
NUD	— Tasks with small salient areas (e.g., gaze estimation, object detection, hand gesture recognition, facial expression recognition, cancer tumor detection)	— Not applicable to tasks with large salient areas (e.g., crowd counting, monocular depth estimation) — Leads to severe distortion and low performance in case of massive reduction in image resolution — Requires high-quality saliency detection — May require design of custom loss or regularization — Not suitable for videos with frequent cuts
SZS	— More efficient on scenes with small salient areas	— Less efficient on scenes with large salient areas
LSNs	— Suitable for dense tasks where output is a map (e.g., crowd counting, monocular depth estimation) — Suitable for scenes without perspective (e.g., WSIs, remote sensing)	— Requires custom architecture design; difficult and time-consuming
TOIC	— Frequency-domain methods are general-purpose and applicable to a wide variety of tasks — Can lead to massive speedup compared to other methods	— Requires domain-knowledge and expertise to extract proper representations
HR-ViTs	— General-purpose and applicable to a wide variety of tasks	— Not as efficient as CNNs and other methods

Table 5. Summary of Applications and Limitations of Efficient High-resolution Methods

Similarly, SZS methods (Section 3.2) require the salient area to be small, otherwise they zoom everywhere and save little time and computation. This also means that the effectiveness of NUD and SZS methods may vary based on the specific input. For instance, the more people there are in an image processed for crowd counting, or the more tumors there are in cancer detection, the less efficient such methods will be, unless there are specific safeguards that prevent them from performing an enormous number of computations, such as GigaDet [18] which processes at most K patch candidates.

Furthermore, NUD methods are not effective when the resulting resolution is extremely smaller compared to the input resolution, for instance, when gigapixel inputs need to be resized down to HD, as this would result in highly distorted images, which makes it difficult for the task DNN to perform well. Even when the gap between the two resolutions is not extremely large, NUD can lead to high distortions in some cases, for instance, it may completely distort and change the shape of the edges of a gastrointestinal lesion, making it difficult for the task network to detect useful features. This may reduce accuracy despite the fact that more pixels are sampled from salient areas. As explained in Section 3.1, some methods try to mitigate the distortion by using structured grids. However, this may limit the benefits obtained by NUD.

In addition, since NUD is enlarging some parts of the image compared to uniform downsampling, some areas of the resulting image will be smaller than they would be with uniform downsampling. Thus, if the saliency map is not of high quality, unimportant areas will be enlarged and the ones important for the final task will shrink, resulting in accuracy loss. This is directly at odds with the requirement that the saliency detection method should be low-overhead, creating another trade-off that needs to be carefully balanced. Moreover, as explained in Section 3.1, some variations of NUD require an external supervision signal or regularization term to train the saliency detection network, which can be difficult to design. In NUD or SZS methods that detect saliency in videos based on the results obtained from previous frames, such as SALISA [8] and REMIX [67], when the difference between subsequent frames is high, the method needs to be reset to processing the entire high-resolution image. When this occurs frequently, the obtained benefits are diminished.

As mentioned in Section 3.3, LSNs need to be designed, trained and well optimized for the specific problem at hand, which is not an easy task. Furthermore, since LSNs produce an output for each scanned area of the input, they are suitable for tasks where the output has the form of a map, such as dense classification or dense regression problems. Moreover, the scanning nature of LSNs means that all areas of the image are treated similarly, therefore, they are better suited for situations where there is no perspective and objects of the same type have the same size regardless of their location, such as WSIs and remote sensing, as opposed to surveillance and crowd counting where people close to the camera are larger than people far away.

Since TOIC methods extract representations that are both compressed and suitable for the task at hand, they often need to be tailored to the specific problem, which requires high domain knowledge. Both Slide Graph [89] and MCAT [20] presented in Section 3.4 are based on domain knowledge about cellular structure of tissues and biological function of genes, respectively. Almost all frequency-domain DNNs try to preserve the architecture of the CNNs they are based on. However, since the interpretation of features in frequency-domain is different, and they have certain properties such as being non-negative, it might be better to customize the architectural elements for the frequency domain, as CS-Fnet [90] does.

Most high-resolution Vision Transformer methods try to reduce the quadratic cost of self-attention to linear, and then compensate the accuracy loss by learning data transformations using convolutions. To keep the overhead of convolutions low, depth-wise convolution is typically used. Additionally, most high-resolution ViTs utilize a multi-scale architecture in order to capture features of various scales. High-resolution ViTs are more general purpose than other high-resolution deep learning methods and are often used for a large variety of tasks.

Quantitative comparison of various methods is a serious challenge in efficient high-resolution deep learning. As methods available in the literature rarely provide code, in order to compare them against the same benchmark, they need to be reproduced from scratch, which requires massive effort. The next best approach is to compare these methods based on results reported on the same benchmark. However, methods rarely use the same datasets and metrics in their experiments. To shed some light on these challenges, consider Table 6 as an example. Although a single common benchmark among these methods does not exist, several pairs include experiments on the same dataset. However, upon further inspection, it is not possible to make fair comparisons. GigaDet and REMIX both use the PANDA dataset, and ViT and GG-Transformer both use COCO. However, both pairs belong to the same category of methods, therefore, there is little benefit in comparing them. SALISA and MMNet both use ImageNet VID, and they do not belong to the same category of methods. However, SALISA uses GFLOPs as efficiency metric, which is hardware agnostic, while MMNet evaluates efficiency using frames-per-second (FPS), which is hardware dependent. Slide Graph, MCAT and HIPT all use TCGA-BRCA, however, neither MCAT nor HIPT report any efficiency metrics. Finally, Fast ScanNet and [123] both use CAMELYON16, however, Fast ScanNet reports performance using the AUC and FROC metrics, while MCAT reports performance in terms of c-Index, and does not measure efficiency. Due to the trade-off between efficiency and performance, both metrics must be taken into account to properly compare methods and draw meaningful conclusions.

Table 6.

Task	Method Category	Method	Dataset
Object Detection	NUD	FOVEA [124]	ArgoverseHD [81] & BDD100K [155]
Object Detection	NUD	SALISA [8]	ImageNet VID [108] & UA-DETRAC [145]
Object Detection	SZS	[46]	Caltech Pedestrian [33]
Object Detection	SZS	[132]	fMoW [24]
Object Detection	SZS	GigaDet [18]	PANDA [144]
Object Detection	SZS	REMIX [67]	PANDA [144]
Object Detection	LSNs	HR-NAS [32]	KITTI (3D) [47]
Object Detection	TOIC	MMNet [143]	ImageNet VID [108]
Object Detection	HR-VITs	ViL [160]	COCO [85]
Object Detection	HR-VITs	GG-Transformer [156]	COCO [85]
Histopathology	SZS	RAZN [35]	BACH [5]
Histopathology	LSNs	Fast ScanNet [83]	CAMELYON16 [154]
Histopathology	TOIC	Slide Graph [89]	TCGA-BRCA [74]
Histopathology	TOIC	[123]	CAMELYON16 [154]
Histopathology	TOIC	MCAT [20]	TCGA-BRCA [74]
Histopathology	HR-ViTs	HIPT [19]	TCGA-BRCA [74]

Table 6. Datasets used in Experiments of Various Methods

6 Conclusion and Outlook

Processing high-resolution images and videos with deep learning is crucial in various domains of science and technology. However, few methods exist that address the computational challenges. Among existing methods, the trend of designing solutions specifically for the problem at hand is clearly visible. This can be an issue in tasks for which high-resolution datasets are not available. Similar to model compression approaches, both modifying existing methods and designing an efficient high-resolution method from scratch are viable approaches.

Efficient high-resolution deep learning is in its infancy and there is a lot of room for improvement. For instance, a number of attention-free MLP-based methods have been recently proposed as lightweight alternatives for Transformers [51], which try to mimic the global receptive field of Transformers without the self-attention mechanism. Exploiting such architectures for efficient processing of high-resolution inputs would be an interesting research direction. Furthermore, the multimodal co-attention in MCAT [20] can be applied to many other multimodal tasks, especially the ones with audio, vision and language modalities. Moreover, frequency-domain representations can be explored as inputs to ViTs, which can lead to more efficiency compared to frequency-domain CNNs. For instance, ViTs can take separate patches from DCT-Cb, DCT-Cr and DCT-Y components, bypassing the need to upsample DCT-Cb and DCT-Cr to match the dimensions of DCT-Y.

The combination of efficient high-resolution deep learning with other efficient deep learning methods, such as model compression [23], dynamic inference [53], collaborative inference [16] and continual inference [56], is an unexplored area of research. For instance, if the saliency detection network is a lightweight version of the task network, NUD can be combined with early exiting, where the output of the saliency detection network would be a fast, but less accurate, early result. This is simple to implement in dense regression problems such as depth estimation and crowd counting, where the output of the task can be interpreted as a form of saliency.

Moreover, with the adoption of edge and cloud computing, transmission of high-resolution inputs to servers for processing is a real challenge. As a solution, efficient high-resolution deep learning methods can be combined with edge computing paradigms. For instance, the downsampled images in NUD and compressed representation in TOIC can be transmitted instead of the original inputs. This would be a form of split computing (also known as collaborative intelligence) [6, 94], where the initial portion of computation is performed on a resource-constrained end-device, and the compact intermediate representation is then transmitted to a server where the rest of the computation is carried out. A study using this idea for high-resolution images captured by drones is reported in [10].

Finally, we strongly recommend that future research on high-resolution deep learning methods begin by examining the datasets employed in previous approaches and incorporate relevant datasets into their experimental evaluation. This approach facilitates a more accurate comparison among different methods. Furthermore, it is essential to employ evaluation metrics consistent with relevant literature. Additionally, to facilitate a thorough comparison of methods and determine their positions on the accuracy-efficiency spectrum, it is crucial to report both efficiency and performance metrics. Moreover, metrics that are independent of hardware, such as FLOPs are preferred for evaluation of efficiency, whereas efficiency metrics tied to specific hardware, such as FPS, are challenging to reproduce consistently.

Footnote

https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

Appendix

A Data Sources

Data Sources and details for device camera resolutions are shown in Table 7.

Table 7.

Device Camera	Year	Resolution (MP)	Source
Apple iPhone Rear Camera	2007	2	link
	2008	2	link
	2009	3	link
	2010	5	link
	2011	8	link
	2012	8	link
	2013	8	link
	2014	8	link
	2015	12	link
	2016	12.2	link
	2017	12	link
	2018	12	link
	2019	12	link
	2020	12	link
	2021	12	link
	2022	12	link
Samsung Galaxy S Rear Camera	2010	5	link
	2011	8	link
	2012	8	link
	2013	13	link
	2014	16	link
	2015	16	link
	2016	12	link
	2017	12	link
	2018	12	link
	2019	16	link
	2020	108	link
	2021	108	link
	2022	108	link
Microsoft HoloLens Camera	2016	2.4	link
	2019	8	link
Raspberry Pi Camera	2013	2.1	link
	2016	8	link
	2020	12.3	link
DJI Phantom Camera	2012	12	link
	2013	14	link
	2014	14	link
	2015	12.4	link
	2016	20	link
	2017	20	link
	2018	20	link

Table 7. Details for Device Camera Resolutions

All links were accessed on 26 July 2022.

References

[1]

Maya Aghaei, Matteo Bustreo, Yiming Wang, Gian Luca Bailo, Pietro Morerio, et al. 2021. Single image human proxemics estimation for visual social distancing. In IEEE Winter Conference on Applications of Computer Vision.

Abstract

1 Introduction

2 Applications of High-Resolution Deep Learning

2.1 Medical and Biomedical Image Analysis

2.2 Remote Sensing

2.3 Surveillance

2.4 Other Applications

3 Methods for Efficient Processing of High-Resolution Inputs with Deep Learning

3.1 Non-Uniform Downsampling

3.2 Selective Zooming and Skipping

3.3 Lightweight Scanner Networks

3.4 Task-Oriented Input Compression

3.5 High-Resolution Vision Transformers

4 High-Resolution Datasets

5 Discussion and Open Issues

6 Conclusion and Outlook

Footnote

Appendix

A Data Sources

References

Cited By

Index Terms

Recommendations

A Systematic Survey of Deep Learning-Based Single-Image Super-Resolution

Deep learning for multisensor image resolution enhancement

Spectral super-resolution meets deep learning: Achievements and challenges

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations