1 Introduction
Many modern devices such as smartphones, drones, augmented reality headsets, vehicles and other
Internet of Things (IoT) devices are equipped with high-quality cameras that can capture high-resolution images and videos. With the help of image stitching techniques, camera arrays [
126,
157], gigapixel acquisition robots [
110] and whole-slide scanners [
41], capture resolutions can be increased to billions of pixels (commonly referred to as
gigapixels), such as the image depicted in Figure
1. One could attempt to define
high-resolution based on the capabilities of human visual system. However, many deep learning tasks rely on data captured by equipment which behaves very differently compared to the human eye, such as microscopes, satellite imagery and infrared cameras. Furthermore, utilizing more detail than the eye can sense is beneficial in many deep learning tasks, such as in the applications discussed in Section
2. The amount of detail that can be captured and is useful if processed varies greatly from task to task. Therefore, the definition of high-resolution is
task-dependent. For instance, in image classification and
computed tomography (CT) scan processing, a resolution of 512
\(\times\) 512 pixels is considered to be high [
17,
37]. In visual crowd counting, datasets with
High-Definition (HD) resolutions or higher are common [
45], and
whole-slide images (WSIs) in histopathology, which is the study of diseases of the tissues, or remote sensing data, which are captured by aircrafts or satellites, can easily reach gigapixel resolutions [
134,
135].
Moreover, with the constant advancement of hardware and methodologies, what deep learning literature considers high-resolution has shifted over time. For instance, in the late 1990s, processing the 32
\(\times\) 32-pixel MNIST images with neural networks was an accomplishment [
78], whereas in early 2010s, the 256
\(\times\) 256-pixel images in ImageNet were considered high-resolution [
76]. This trend can also be seen in the consistent increase of the average resolution of images in popular deep learning datasets, such as crowd counting [
45] and anomaly detection [
101] datasets. Therefore, the definition of high-resolution is also
period-dependent. Based on the task- and period-dependence properties, it is clear that the term “high-resolution” is technical, not fundamental or universal. Therefore, instead of trying to derive such a definition, we shift our focus to resolutions that create technical challenges in deep learning at the time of this writing.
Using high-resolution images and videos directly as inputs to deep learning models creates challenges during both training and inference phases. With the exception of
fully-convolutional networks (FCNs), the number of parameters in deep learning models typically increases with larger input sizes. Moreover, the amount of computation, which is commonly measured in terms of
floating point operations (FLOPs), and therefore inference/training time, as well as GPU memory consumption increase with higher-resolution inputs, as shown in Figure
2. This issue is especially problematic in
Vision Transformer (ViT) architectures, which use the self-attention mechanism, where the inference speed and number of parameters scale quadratically with input size [
37,
122]. These issues are exacerbated when the training or inference needs to be done on resource-constrained devices, such as smartphones, that have limited computational capabilities compared to high-end computing equipment, such as workstations or servers.
Even though methods such as
model parallelism can be used to split the model between multiple GPUs during both the training [
113,
146] and inference [
39] phases, and thus avoid memory and latency issues, these methods require a large amount of resources, such as a large number of GPUs and servers, which can incur high costs, especially when working with extreme resolutions such as gigapixel images. Furthermore, in many applications, such as self-driving cars and drone image processing, there is a limit for the hardware that can be mounted, and offloading the computation to external servers is not always possible because of unreliability of the network connection due to movement and the time-critical nature of the application. Therefore, the most common approach for deep learning training and inference is to load the full model on each single GPU instance. Multi-GPU setups are instead typically used to speed up the training by increasing the overall batch size, to test multiple sets of hyper-parameters in parallel or to distribute the inference load. Consequently, in many cases, there is an effective maximum resolution that can be processed by deep learning models. As an example, the maximum resolution for inference using SASNet [
116], which is the state-of-the-art model for crowd counting on the Shanghai Tech dataset [
162] at the time of this writing, is around 1024
\(\times\) 768 (less than HD) on Nvidia 2080 Ti GPUs which have 11 GBs of video memory.
Although newer generations of GPUs are getting faster and have more memory available, the resolution of images and videos captured by devices is also increasing. Figure
3 shows this trend across recent years for multiple types of devices. Therefore, the aforementioned issues will likely persist even with advances in computation hardware technology. Furthermore, current imaging technologies are nowhere near the physical limits of image resolutions, which is estimated to be in petapixels [
11].
Whether or not capturing and processing a higher resolution leads to improvements depends on the particular problem at hand. For instance, in image classification, it is unlikely that increasing the resolution for images of objects or animals to gigapixels would reveal more beneficial details and improve the accuracy. On the other hand, if the goal is to count the total number of people in scenes such as the one presented in Figure
1, using an HD resolution instead of gigapixels would mean that several people could be represented by a single pixel, which significantly increases the error. Similarly, it has been shown that using higher resolutions in histopathology can lead to better results [
89].
Assuming there is an effective maximum resolution for a particular problem due to hardware limitations or latency requirements, there are two simple baseline approaches for processing the original captured inputs which are commonly used in deep learning literature [
21,
30,
102]. The popularity of these baselines can be attributed to the simplicity of their implementation. The first one is to resize (downsample) the original input to the desired resolution, however, this will lead to a lower accuracy if any important details for the problem at hand are lost. This approach is called
uniform downsampling (UD) since the quality is reduced uniformly throughout the image. The second approach is to cut up the original input into smaller patches that each have a maximum resolution, process the patches independently, and aggregate the results, for instance, by summing them up for regression problems and majority voting for classification problems. We call this approach
cutting into patches (CIP). There are two issues with this approach. First, many deep learning models rely on global features which will be lost since features extracted from each patch will not be shared with other patches, leading to decreased accuracy. For instance, crowd counting methods typically heavily rely on global information such as perspective or illumination [
45,
116], and in object detection, objects near the boundaries may be split between multiple patches. Secondly, since multiple passes of inference are performed, that is, one pass for each patch, inference will take much longer. This issue is worse when patches overlap.
To highlight these issues, we test the two baseline approaches (UD and CIP) on the Shanghai Tech Part B dataset [
162] for crowd counting, which contains images of size 1024
\(\times\) 768 pixels, as well as the PANDA dataset [
144], which contains gigapixel images. However, we resize the gigapixel images to 2,560
\(\times\) 1,440 in order to comply with our hardware limitations. We reduce the original image size by factors of 4 and 16 and measure the
mean absolute error (MAE) for both baselines. To test UD, we take pre-trained a SASNet model [
116] and fine-tune it for the target input size using the AdamW optimizer [
88]. Note that the original SASNet paper uses the Adam optimizer [
71]. We train the model for 100 epochs with batch size of 12 per GPU instance using 3
\(\times\) Nvidia A6000 GPUs for Shanghai Tech Part B experiments, and a batch size of 1 for PANDA experiments. We empirically found that fine-tuning does not improve the accuracy of cutting into patches, therefore, we cut the original image into 4 and 16 patches, and obtain the count for each patch using the pre-trained SASNet mentioned above, then aggregate the results by summing up the predicted count for each patch.
The results of these experiments are shown in Table
1. It can be observed that uniform downsampling significantly increases the error compared to processing the original input size. Keep in mind that even though the increase in error is not as drastic with cutting into patches, and there are even improvements in some cases, the inference time of this approach is increased by the same factor (i.e., 4 and 16) when using the effective maximum resolution possible for hardware. This is due to the fact that patches cannot be processed in parallel, as the entire hardware is required to process a single patch. Indeed, with the PANDA experiments, which are close to the maximum effective resolution of our hardware, we can see this drastic increase in computation time when using CIP compared to UD.
Since these baseline approaches are far from ideal, in recent years, several alternative methods have been proposed in the literature in order to improve accuracy and speed while complying with the maximum resolution limitation caused either by memory limitations or speed requirements. The goal of this survey is to summarize and categorize these contributions. To the best of our knowledge, no other survey on the topic of high-resolution deep learning exists. However, there are some surveys that include aspects relevant to this topic. A survey on methods for reducing the computational complexity of Transformer architectures is provided in [
122], which discusses the issues related to the quadratic time and memory complexity of self-attention and analyzes various aspects of efficiency including memory footprint and computational cost. While reducing the computational complexity of Transformer models can contribute to efficient processing of high-resolution inputs, in this survey, we only include Vision Transformer methods that explicitly focus on high-resolution images. Some application-specific surveys include high-resolution datasets and methods that operate on such data. For instance, a survey on deep learning for histopathology, which mentions challenges with processing the giga-resolution of WSIs, is provided in [
118]; a survey of methods that achieve greater spatial resolution in
computed tomography (CT) is provided in [
111], which highlights improved diagnostic accuracy with ultra high-resolution CT, and briefly discusses deep learning methods for noise reduction and reconstruction; a survey on crowd counting where many of the available datasets are high-resolution is provided in [
45]; a survey on deep learning methods for land cover classification and object detection in high-resolution remote sensing imagery is provided in [
161]; and a survey on deep learning-based change detection in high-resolution remote sensing images is provided in [
66].
It is important to mention that some methods operate on high-resolution inputs, yet do not make any effort to address the aforementioned challenges. For instance,
multi-column (also known as
multi-scale) networks [
45,
116] incorporate multiple columns of layers in their architecture, where each column is responsible for processing a specific scale as shown in Figure
4. However, since the columns process the same resolution as the original input, most of these methods in fact require even more memory and computation compared to the case where only the original scale is processed. The primary goal of these methods is instead to increase the accuracy by taking into account the scale variances that occur in high-resolution images, although there are some multi-scale methods that improve both accuracy and efficiency [
15,
138,
164]. Therefore, these methods do not fall within the scope of this survey, unless they explicitly address the efficiency aspect for high-resolution inputs. ZoomCount [
109], Locality-Aware Crowd Counting [
167], RAZ-Net [
86] and Learn to Scale [
149] are all examples of multi-scale methods in crowd counting, and DMMN [
57] and KGZNet [
139] in medical image processing.
The primary purpose of this survey is to collect and describe methods that exist in deep learning literature, which can be used in situations where the high resolution of input images and videos creates the aforementioned technical challenges regarding memory, computation and time. The rest of this paper is organized as follows: Section
2 lists applications where high-resolution images and videos are processed using deep learning. Section
3 categorizes efficient methods for high-resolution deep learning into five general categories and provides several examples for each category. This section also briefly discusses alternative approaches for solving the memory and processing time issues caused by high-resolution inputs. Section
4 lists existing high-resolution datasets for various deep learning problems and provides details for each of them. Section
5 discusses the advantages and disadvantages of using efficient high-resolution methods belonging to different categories and provides recommendations about which method to use in different situations. Finally, Section
6 concludes the paper by summarizing the current state and trends in high-resolution deep learning as well as suggestions for future research. The code for experiments conducted in this survey is available at
https://gitlab.au.dk/maleci/high-resolution-deep-learning.
5 Discussion and Open Issues
Each of the approaches introduced in Section
3 has its advantages and disadvantages and is useful in certain situations, which are summarized in Table
5. NUD (Section
3.1) works well in cases where the salient area is small compared to the entire image, and thus, it is possible to sample many pixels from such areas. This requirement is satisfied in gaze estimation or object detection problems. Our conjecture is that it would also work well in problems such as hand gesture detection and non-cropped facial expression recognition, although these tasks are not yet explored in the literature in combination with NUD. However, when the salient area is large, for instance, densely populated scenes in crowd counting or a scene fully covered with objects in object detection, the quality gain obtained by sampling from salient areas will be negligible, and the result of NUD will be similar that of uniform downsampling [
8].
Similarly, SZS methods (Section
3.2) require the salient area to be small, otherwise they zoom everywhere and save little time and computation. This also means that the effectiveness of NUD and SZS methods may vary based on the specific input. For instance, the more people there are in an image processed for crowd counting, or the more tumors there are in cancer detection, the less efficient such methods will be, unless there are specific safeguards that prevent them from performing an enormous number of computations, such as GigaDet [
18] which processes at most
K patch candidates.
Furthermore, NUD methods are not effective when the resulting resolution is extremely smaller compared to the input resolution, for instance, when gigapixel inputs need to be resized down to HD, as this would result in highly distorted images, which makes it difficult for the task DNN to perform well. Even when the gap between the two resolutions is not extremely large, NUD can lead to high distortions in some cases, for instance, it may completely distort and change the shape of the edges of a gastrointestinal lesion, making it difficult for the task network to detect useful features. This may reduce accuracy despite the fact that more pixels are sampled from salient areas. As explained in Section
3.1, some methods try to mitigate the distortion by using structured grids. However, this may limit the benefits obtained by NUD.
In addition, since NUD is enlarging some parts of the image compared to uniform downsampling, some areas of the resulting image will be smaller than they would be with uniform downsampling. Thus, if the saliency map is not of high quality, unimportant areas will be enlarged and the ones important for the final task will shrink, resulting in accuracy loss. This is directly at odds with the requirement that the saliency detection method should be low-overhead, creating another trade-off that needs to be carefully balanced. Moreover, as explained in Section
3.1, some variations of NUD require an external supervision signal or regularization term to train the saliency detection network, which can be difficult to design. In NUD or SZS methods that detect saliency in videos based on the results obtained from previous frames, such as SALISA [
8] and REMIX [
67], when the difference between subsequent frames is high, the method needs to be reset to processing the entire high-resolution image. When this occurs frequently, the obtained benefits are diminished.
As mentioned in Section
3.3, LSNs need to be designed, trained and well optimized for the specific problem at hand, which is not an easy task. Furthermore, since LSNs produce an output for each scanned area of the input, they are suitable for tasks where the output has the form of a map, such as dense classification or dense regression problems. Moreover, the scanning nature of LSNs means that all areas of the image are treated similarly, therefore, they are better suited for situations where there is no perspective and objects of the same type have the same size regardless of their location, such as WSIs and remote sensing, as opposed to surveillance and crowd counting where people close to the camera are larger than people far away.
Since TOIC methods extract representations that are both compressed and suitable for the task at hand, they often need to be tailored to the specific problem, which requires high domain knowledge. Both Slide Graph [
89] and MCAT [
20] presented in Section
3.4 are based on domain knowledge about cellular structure of tissues and biological function of genes, respectively. Almost all frequency-domain DNNs try to preserve the architecture of the CNNs they are based on. However, since the interpretation of features in frequency-domain is different, and they have certain properties such as being non-negative, it might be better to customize the architectural elements for the frequency domain, as CS-Fnet [
90] does.
Most high-resolution Vision Transformer methods try to reduce the quadratic cost of self-attention to linear, and then compensate the accuracy loss by learning data transformations using convolutions. To keep the overhead of convolutions low, depth-wise convolution is typically used. Additionally, most high-resolution ViTs utilize a multi-scale architecture in order to capture features of various scales. High-resolution ViTs are more general purpose than other high-resolution deep learning methods and are often used for a large variety of tasks.
Quantitative comparison of various methods is a serious challenge in efficient high-resolution deep learning. As methods available in the literature rarely provide code, in order to compare them against the same benchmark, they need to be reproduced from scratch, which requires massive effort. The next best approach is to compare these methods based on results reported on the same benchmark. However, methods rarely use the same datasets and metrics in their experiments. To shed some light on these challenges, consider Table
6 as an example. Although a single common benchmark among these methods does not exist, several pairs include experiments on the same dataset. However, upon further inspection, it is not possible to make fair comparisons. GigaDet and REMIX both use the PANDA dataset, and ViT and GG-Transformer both use COCO. However, both pairs belong to the same category of methods, therefore, there is little benefit in comparing them. SALISA and MMNet both use ImageNet VID, and they do not belong to the same category of methods. However, SALISA uses GFLOPs as efficiency metric, which is hardware agnostic, while MMNet evaluates efficiency using
frames-per-second (FPS), which is hardware dependent. Slide Graph, MCAT and HIPT all use TCGA-BRCA, however, neither MCAT nor HIPT report any efficiency metrics. Finally, Fast ScanNet and [
123] both use CAMELYON16, however, Fast ScanNet reports performance using the AUC and FROC metrics, while MCAT reports performance in terms of c-Index, and does not measure efficiency. Due to the trade-off between efficiency and performance, both metrics must be taken into account to properly compare methods and draw meaningful conclusions.
6 Conclusion and Outlook
Processing high-resolution images and videos with deep learning is crucial in various domains of science and technology. However, few methods exist that address the computational challenges. Among existing methods, the trend of designing solutions specifically for the problem at hand is clearly visible. This can be an issue in tasks for which high-resolution datasets are not available. Similar to model compression approaches, both modifying existing methods and designing an efficient high-resolution method from scratch are viable approaches.
Efficient high-resolution deep learning is in its infancy and there is a lot of room for improvement. For instance, a number of attention-free MLP-based methods have been recently proposed as lightweight alternatives for Transformers [
51], which try to mimic the global receptive field of Transformers without the self-attention mechanism. Exploiting such architectures for efficient processing of high-resolution inputs would be an interesting research direction. Furthermore, the multimodal co-attention in MCAT [
20] can be applied to many other multimodal tasks, especially the ones with audio, vision and language modalities. Moreover, frequency-domain representations can be explored as inputs to ViTs, which can lead to more efficiency compared to frequency-domain CNNs. For instance, ViTs can take separate patches from DCT-Cb, DCT-Cr and DCT-Y components, bypassing the need to upsample DCT-Cb and DCT-Cr to match the dimensions of DCT-Y.
The combination of efficient high-resolution deep learning with other efficient deep learning methods, such as model compression [
23], dynamic inference [
53], collaborative inference [
16] and continual inference [
56], is an unexplored area of research. For instance, if the saliency detection network is a lightweight version of the task network, NUD can be combined with early exiting, where the output of the saliency detection network would be a fast, but less accurate, early result. This is simple to implement in dense regression problems such as depth estimation and crowd counting, where the output of the task can be interpreted as a form of saliency.
Moreover, with the adoption of edge and cloud computing, transmission of high-resolution inputs to servers for processing is a real challenge. As a solution, efficient high-resolution deep learning methods can be combined with edge computing paradigms. For instance, the downsampled images in NUD and compressed representation in TOIC can be transmitted instead of the original inputs. This would be a form of split computing (also known as collaborative intelligence) [
6,
94], where the initial portion of computation is performed on a resource-constrained end-device, and the compact intermediate representation is then transmitted to a server where the rest of the computation is carried out. A study using this idea for high-resolution images captured by drones is reported in [
10].
Finally, we strongly recommend that future research on high-resolution deep learning methods begin by examining the datasets employed in previous approaches and incorporate relevant datasets into their experimental evaluation. This approach facilitates a more accurate comparison among different methods. Furthermore, it is essential to employ evaluation metrics consistent with relevant literature. Additionally, to facilitate a thorough comparison of methods and determine their positions on the accuracy-efficiency spectrum, it is crucial to report both efficiency and performance metrics. Moreover, metrics that are independent of hardware, such as FLOPs are preferred for evaluation of efficiency, whereas efficiency metrics tied to specific hardware, such as FPS, are challenging to reproduce consistently.