\jmlrvolume

– Under Review \jmlryear2024 \jmlrworkshopFull Paper – MIDL 2024 submission \midlauthor\NameMohammed Adnan\midljointauthortextWork done at Roche.\nametag¹ \Emailadnan.ahmad@ucalgary.ca
\addr¹ University of Calgary, Canada and \NameQinle Ba\midljointauthortextCorresponding author.\nametag² \Emailqinle.ba@roche.com
\NameNazim Shaikh\nametag² \Emailnazim.shaikh@roche.com
\NameShivam Kalra\nametag² \Emailshivam.kalra@roche.com
\NameSatarupa Mukherjee\nametag² \Emailsatarupa.mukherjee@roche.com
\NameAuranuch Lorsakul\nametag² \Emailauranuch.lorsakul@roche.com
\addr² Roche Sequencing Solutions, Santa Clara, CA, USA

Structured Model Pruning for Efficient Inference in Computational Pathology

Abstract

Recent years have seen significant efforts to adopt Artificial Intelligence (AI) in healthcare for various use cases, from computer-aided diagnosis to ICU triage. However, the size of AI models has been rapidly growing due to scaling laws and the success of foundational models, which poses an increasing challenge to leverage advanced models in practical applications. It is thus imperative to develop efficient models, especially for deploying AI solutions under resource-constrains or with time sensitivity. One potential solution is to perform model compression, a set of techniques that remove less important model components or reduce parameter precision, to reduce model computation demand. In this work, we demonstrate that model pruning, as a model compression technique, can effectively reduce inference cost for computational and digital pathology based analysis with a negligible loss of analysis performance. To this end, we develop a methodology for pruning the widely used U-Net-style architectures in biomedical imaging, with which we evaluate multiple pruning heuristics on nuclei instance segmentation and classification, and empirically demonstrate that pruning can compress models by at least 70% with a negligible drop in performance.

keywords:

Model Pruning, Digital and Computational Pathology, Nuclei Segmentation, Nuclei Classification

1 Introduction

With the recent trend to increase the capacity and thus performance of deep learning models in biomedical imaging [Hörst et al.(2023)Hörst, Rempe, Heine, Seibold, Keyl, Baldini, Ugurel, Siveke, Grünwald, Egger, and Kleesiek, Van der Sluijs et al.(2024)Van der Sluijs, Bhaskhar, Rubin, Langlotz, and Chaudhari, Deng et al.(2023)Deng, Cui, Liu, Yao, Remedios, Bao, Landman, Wheless, Coburn, Wilson, et al.], it is increasingly challenging to fully leverage these effective yet computationally expensive models in real-world applications like clinical settings. Training advanced models and running inference at scale require dedicated computing resources, especially high-end GPUs. However, globally, the available GPU resources are unevenly distributed and especially rare in underdeveloped regions (https://www.top500.org/). In addition, the increasing computing needs require regular upgrade of IT infrastructure, imposing hurdles for AI adoption in healthcare institutions [Zhang et al.(2022)Zhang, Xing, Zou, and Wu]. High computing demand is especially challenging for digital and computational pathology (referred to as ”DP”) based applications: due to the large size of each whole-slide image (mega pixels), analysis is typically performed by first dividing an image into thousands of small image patches and then processing each patch with a model(s), followed by assembling patch-wise results into whole-side readouts [Janowczyk and Madabhushi(2016)]. As such, model inference can be time-consuming for a single image, let alone the large volumes of images in large medical centers [Ardon et al.(2023)Ardon, Klein, Manzo, Corsale, England, Mazzella, Geneslaw, Philip, Ntiamoah, Wright, et al.].

Model compression techniques reported in computer vision have shown remarkable success for reducing computation costs, which we classify into two broad categories: (1) techniques that require a predefined target architectures, which are more efficient than the original models, such as knowledge distillation [Hinton et al.(2015)Hinton, Vinyals, and Dean] and neural architecture search (NAS) [Ren et al.(2021)Ren, Xiao, Chang, Huang, Li, Chen, and Wang, Baymurzina et al.(2022)Baymurzina, Golikov, and Burtsev]; (2) techniques that do not require such predefined architectures, such as model pruning [Cheng et al.(2023)Cheng, Zhang, and Shi], quantization [Gholami et al.(2021)Gholami, Kim, Dong, Yao, Mahoney, and Keutzer]. Many of these techniques are complementary and thus multiple techniques can be combined to achieve the best compression performance. For example, KD or NAS can be performed and followed by pruning and then by quantization. In this study, we investigate pruning because of its versatility. (1) Unlike KD or NAS, which requires design/selection of smaller network from larger network, pruning does not rely on such network dependency, and thus flexibly applicable without manual architectural modifications after KD or NAS. (2) Unlike quantization, which reduces parameter precision but keeps model architectures, pruning provides an opportunity to shrink architectures. (3) Pruning achieves higher compression rates than typical KD methods [Javed et al.(2023)Javed, Mahmood, Qaiser, and Werghi, Cho and Hariharan(2019)] (See Appendix A for KD in DP) and thus applying pruning on top of KD may further improve compression performance. In DP, [Choudhary et al.(2021)Choudhary, Mishra, Goswami, and Sarangapani] first studied structured filter pruning for breast cancer classification, but they only assessed one pruning approach, L1-norm pruner. [Mahbod et al.(2022)Mahbod, Entezari, Ellinger, and Saukh] applied unstructured magnitude pruning for nuclei instance segmentation, which nominally ”eliminate” a proportion of weights, but all the weight matrices still go through forward and backward passes and thus did not reduce the actual computation. To the best of our knowledge, no earlier works have systematically evaluated model pruning in DP for reducing computation costs. Moreover, no studies have investigated how to effectively prune the widely used U-Net style encoder-decoder architectures [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] in biomedical imaging. Such architectures can be challenging to prune: because of the shortcut connections between encoder and decoder layers and, between residual blocks in residual U-Nets [Alom et al.(2019)Alom, Yakopcic, Hasan, Taha, and Asari], pruning one layer triggers the necessity to manipulate the shortcut connected layers, potentially impacting many other layers. In this study, we propose a framework to handle such complex scenarios. Our contributions are as follows:

1.

We assess structured filter pruning with recently proposed heuristics and strategies for compressing deep models for DP based analysis across two vision tasks, (1) nuclei instance segmentation and classification and (2) tile-level tissue classification.
2.

We propose a pruning approach for U-Net style architectures, a prevalent model design in biomedical imaging with plain convolution network or residual networks as the encoder.
3.

Using the proposed method, we compare different pruning heuristics on state-of-the-art models and show that pruning can effectively compress model size and reduce the latency of model inference with no or a minor decrease in model performance.

2 Background

Model Compression.

Numerous strategies have been explored in the literature for model compression and four major categories include pruning, knowledge distillation (KD), quantization and neural architecture search (NAS). Pruning refers to removing relatively ‘less important’ model components based on certain heuristics to measure importance [Blalock et al.(2020)Blalock, Ortiz, Frankle, and Guttag]. Quantization refers to adopting less precise byte representation for model parameters and thus requiring less memory footprint, like reducing weight/gradient data type from float32 to int8. KD [Gou et al.(2021)Gou, Yu, Maybank, and Tao] involves a larger model (teacher) and a smaller model (student). The teacher model is trained first and later used to guide the training of the smaller student model via distillation loss. NAS [Ren et al.(2021)Ren, Xiao, Chang, Huang, Li, Chen, and Wang, Baymurzina et al.(2022)Baymurzina, Golikov, and Burtsev] aims at automatically finding an efficient model architecture by searching through a design space. Many of these techniques are complementary and can be combined, among which is pruning. Here we focus on model pruning [Blalock et al.(2020)Blalock, Ortiz, Frankle, and Guttag], which are further categorised into unstructured, semi-structured (See notes in Appendix B) and structured pruning, based on the sparsity patterns that a model pruning strategy targets.

Unstructured Pruning.

Determine the importance of each weight parameter and prune the relatively unimportant ones by replacing them with zero values. Unstructured pruning is straightforward to implement and usually does not harm model performance. However, current hardware is not optimized for unstructured sparse matrices and thus inference speed cannot be achieved without customized software/hardware systems. Therefore, we do not focus on such pruning methods.

Structured Pruning.

Structured pruning follows a fixed pattern. For example, an entire filter of a convolution layer or an entire layer can be removed. Contrary to unstructured pruning, since the entire layer [Ding et al.(2021)Ding, Zhang, Ma, Han, Ding, and Sun, Fu et al.(2022a)Fu, Yang, Yuan, Li, Wan, Krishnamoorthi, Chandra, and Lin] or filter [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf, Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang] is removed, model speedup is achieved. Various pruning heuristics have been proposed. Here, we focus on the highly cited L1/L2 pruner [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] and network slimmer [Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang].

Pruning U-Net like Model Architectures.

Prior work focused on classification models [Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang] for vision tasks. However, U-Net-based architectures are widely used in biomedical imaging due to their optimized design to encode information at multiple scales. Pruning U-Net style architecture is challenging due to shortcut connections between the encoder and decoder. In this work, we address this challenge by employing our pruning approach to the widely used architecture for nuclei segmentation and classification, HoverNet [Gamper et al.(2020)Gamper, Koohbanani, Benes, Graham, Jahanifar, Khurram, Azam, Hewitt, and Rajpoot].

HoverNet.

[Gamper et al.(2020)Gamper, Koohbanani, Benes, Graham, Jahanifar, Khurram, Azam, Hewitt, and Rajpoot] proposed HoverNet for nuclei instance segmentation and classification, which leverages the instance-aware information encoded by the vertical and horizontal distances of nuclear pixels to their centres of mass. The corresponding distance maps are used to separate clustered nuclei, enabling accurate instance segmentation. For each segmented instance, the network predicts its nucleus type via a dedicated decoder branch. HoverNet has three branches, predicting nuclei segmentation, horizontal/vertical distance and nuclei classification.

3 Methodology

Refer to caption — Figure 1: Overview of HoverNet and the proposed model pruning schema. Top: HoverNet design with three identical decoder branches. Middle: dependencies between the last convolution (conv) layer of each residual block and the skip-connected decoder layer. Horizontal red bars denote pruned conv filters (output dimension) that are pruned with the matching channel indices (vertical bars) in interdependent layers. Bottom: a 2D view of pruning consecutive conv layers. Here, each grid represents a 2D kernel (e.g. 3x3). The rows denote the output dimension (number of filters) and the columns denote the input dimension, which is equal to the number of feature maps of layer input. Pruning of conv filters from one layer (i.e red filters in first block) requires removing the kernels from the input dimension in the consecutive conv layer (i.e equivalent red filters in second block). Similarly, green filters illustrate pruning of the next two conv layers.

Pruning Heuristics and Strategies.

Pruning heuristics refer to any approaches and/or metrics that determine the relative importance of target model components subjected to model pruning. By ”pruning strategies” we refer to additional pruning approaches applied on top of heuristics.

1.

L1 Pruner: [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] removes the less important filters in a convolution layer based on the L1-norm of each filter, optionally followed by fine tuning to recover performance.
2.

L2 Pruner: Similar to L1 pruner, except that it uses L2-norm as pruning heuristic to select less important filters to remove. [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] found L1 and L2 norms to behave similarly.
3.

Network Slimmer: Proposed by [Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang], applies L1-regularization on the scaling factors of batch normalization layers [Ioffe and Szegedy(2015)] during baseline model training to induce sparsity of the scaling factors. Such a sparsity inducing regularization pushes less important scaling factors corresponding to the less important feature maps close to zero. It thus selects filters with smaller scaling factors after training as a pruning heuristic. It is thus important to note that to the baseline for network slimmer is trained differently than that used for L1-norm and L2-norm pruner.
4.

Iterative Magnitude Pruning (IMP): A pruning strategy [Frankle et al.(2019)Frankle, Dziugaite, Roy, and Carbin] to stabilize the originally proposed Lottery Ticket Hypothesis [Frankle and Carbin(2018)]. IMP prunes a given network in multiple iterations instead of pruning at once.

Pruning HoverNet Model.

HoverNet has a modified pre-activation ResNet50 [He et al.(2015)He, Zhang, Ren, and Sun] as encoder and a U-Net style design for connecting the encoder and decoders. Two types of shortcut connection in HoverNet include (1) residual connections between the layers within each residual block in the encoder and (2) skip connections connecting outputs of the encoder layers to decoder layers. The main challenge in pruning a U-Net style architecture is to ensure the dimensions of these layers with shortcut connection match after pruning. We thus propose the following approach for pruning. Since the ’residual’ feature maps are presumably more important than the convolved ones within a residual block, we prioritize the former when applying heuristics [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf].

1.

Handling Shortcut Connections between Encoder and Decoder Layers: Use identical indices for pruning interconnected layers to maintain filter dimension compatibility post-pruning. This approach ensures that any interconnected layers are pruned in unison, preserving the structural integrity of the shortcut connections.
2.

Pruning the Last Convolution Layers (Conv3) in Residual Blocks: There are three convolution layers in each residual block in ResNet50 style encoders. The last convolution layer among the three, referred to as conv3 layer, is connected to the conv3 layer of the subsequent residual block due to the identity mapping or 1x1 convolution of residual connection mechanism. Last conv3 of each residual block is also skip connected to the decoder layers. To ensure that the filter dimensions match for concatenation/addition in the shortcut connections, we prune conv3 from the encoder blocks and the corresponding skip connected layer in the decoder with the same filter indexing as depicted in second panel of \autoreffig:enter-label.
3.

Incorporating Non-Uniform Sparsity: We found that pruning conv3 is challenging as pruning one conv3 layer triggers the pruning of conv3 layers in interconnected residual blocks and the skip-connected decoder layers. This may lead to over-pruning of important filters, as the actual filter importance of these interconnected and skip-connected layers are not assessed but rather follows the heuristic applied to the aforementioned conv3 layers. As such, we assessed non-uniform sparsity to limit the maximum sparsity level of the interconnected layers, aiming for flexible pruning rates across different layers of the network.

Pruning Image Classification Models.

Pruning of classification encoders are identical to the pruning of encoder part of an U-Net like encoder-decoder based model for dense prediction except that there’s no skip connection to decoders. See Appendix D for a detailed example.

1.

Independent Pruning within Residual Block: Prune layers within each residual block or convolution block independently. In this option, only certain layers within each residual/convolution block that do not impact the subsequent layers are pruned, such as the first two convolution layers in each residual block of ResNet50. When pruning, it’s crucial to consider the output dimension of the first convolution layer to prune, because it directly influences the input dimension of the following layer as depicted in third panel of \autoreffig:enter-label. With this pruning setting, one can choose the same or different pruning ratios for each block of layers (residual block, or a block composed of multiple convolution layers followed by other types of layers like batch-normalization layers), since the pruning of each block is independent.
2.

Pruning Interconnected Layers across Residual Blocks: Pruning the last convolution layer in a residual block alters the output channel/filter dimension. As such, for the subsequent block, one must match the input feature maps’ spatial and channel dimensions due to the addition operation connecting them with identity mapping. To maintain such compatibility, we ensure the pruning ratio and indexing of interconnected convolution layers to be consistent. The channel number for the preceding batch-normalization layer should also be adjusted in line with the pruning performed on its subsequent convolution layer.

4 Experiments and Results

4.1 Nuclei Instance Segmentation and Classification

Dataset and Implementation. PanNuke dataset [Kingma and Ba(2014)] contains image patches of 256x256 pixels selected from 20x or 40x H&E stained whole-slide images of 19 tumor tissues with 189,744 segmented nucleus instances and 5 clinically important classes. We followed a 3-fold cross-validation [Kingma and Ba(2014)] and trained HoverNet [Graham et al.(2018)Graham, Vu, Ahmed Raza, Azam, Tsang, Kwak, and Rajpoot] following the implementation, general training strategies and an pre-activation ResNet-50 based [He et al.(2016)He, Zhang, Ren, and Sun] encoder pre-trained on ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] provided by the HoverNet authors. Specifically, Adam optimizer was adopted with a learning rate of 0.0001 and weight decay of 0.0001. HoverNet training details and trained weights from three folds for PanNuke were not provided by the authors [Kingma and Ba(2014)], but, without applying an extensive list of tweaks, our model achieved a mean Panoptic Quality (mPQ) of 0.344 over all 5 tissue types (0.397 as reported by the authors). We assessed the following pruning strategies. (1) One-shot uniform pruning with L1-norm, L2-norm and network slimmer as filter importance heuristics for pruning 10% to 90% of filters in target layers. (2) One-shot non-uniform pruning with the L2-norm heuristic: the same setting as (1), but keeping an upper limit of 40% filter pruning for interdependent layers across the residual blocks and across the encoder and decoder. (3) Iterative pruning by removing a small percentage of filters at each pruning step: 5% filter pruning for 19 rounds (5% to 95%). Each pruning step was followed by fine-tuning of 10 epochs with the same training settings. Model performance is evaluated with mean Panoptic Quality (mPQ), mean Segmentation Quality (mSQ) and mean Detection Quality (mDQ) over all classes (See notes in Appendix C.1). Model efficiency is evaluated with parameter number and latency.

One-shot Uniform Pruning. One-shot uniform pruning followed by fine-tuning (\autoreffig:pq-latency-combined Left) led to a initial decrease of PQ by around 0.02 for L1/L2-norm pruners. However, pruning to higher sparsity levels did not lead to performance crashes with L1-norm and L2-norm pruners. Surprisingly, pruning up to 70% sparsity level with L2-norm pruner achieved performance similar with that of 10% sparsity level. On the other hand, model performance with network slimmer degraded drastically beyond 80% sparsity level. See Appendix E for all results from 3-fold testing.

One-shot Non-uniform Pruning. Intuitively, pruning performance might be improved by limiting the maximum sparsity level of interconnected layers with widespread impact across residual blocks and encoder-decoder layer. However, we did not observe superior model performance with such a strategy compared to one-shot uniform pruning. In addition, such a maximum sparsity level constraint limited the total reduction of model parameters from the interconnected layers and thus a smaller reduction in latency. See Appendix D.1 for notes on the evaluation and efficiency metrics.

Iterative Pruning vs One-shot Pruning. Iterative pruning of 5% filters with L2-norm heuristic consistently outperformed one-shot pruning (\autoreffig:pq-latency-combined) in terms of maintaining model performance. Surprisingly, pruning up to 85% - 90% sparsity levels resulted in not only identical model performance compared with pruning of low sparsity levels like 5%, but also very similar performance as that of the baseline model. Critically, latency of the model reduced by 80% (5-6 times faster) and memory footprint reduced by one magnitude, as reflected by the drastically reduced parameter numbers (\autoreftab:hovernet_iterative; see Appendix F for notes on efficiency improvement). Due to the large number of experiments (5 strategies/heuristics $\times$ 19 runs $\times$ 3 folds), we kept the same hyperparameters, but one may further optimize performance by finer-grained hyperparameter tuning.

Table 1: Performance of iterative pruning of HoverNet (averaged over 3-fold).

{tblr}

columneven = c, column3 = c, column5 = c, column7 = c, hline1-2,9 = -0.08em, Sparsity & mPQ mSQ mDQ Latency (ms) FLOPs ( $10^{10}$ ) Params ( $10^{7}$ )
0.05 0.3239 0.8075 0.3968 676.76 27.57 3.47
0.15 0.3343 0.8089 0.4079 638.06 23.41 2.92
0.25 0.3225 0.7999 0.3953 540.09 19.61 2.41
0.50 0.3204 0.8094 0.3915 225.85 11.55 1.37
0.75 0.3274 0.8122 0.3986 148.90 5.56 0.63
0.85 0.3304 0.8074 0.4024 120.65 3.74 0.42
0.90 0.3209 0.8027 0.3934 105.43 2.95 0.33

4.2 Colorectal Cancer (CRC) Tissue Classification

We further assess model pruning in a second vision task, tissue classification, to investigate (1) whether a smaller network can be effectively pruned, (2) performance with different filter importance heuristics and (3) the impact of sparsity increments in iterative pruning on performance.

Dataset and Implementation. We leveraged a CRC dataset [Kather et al.(2019)Kather, Krisam, Charoentong, Luedde, Herpel, Weis, Gaiser, Marx, Valous, Ferber, et al.], which contains 100,000 stain-normalized 224x224 patches at 20x sampled from H&E whole-slide images (136 patients) with patch-wise labels of 9 tissue types. To establish a class-balanced dataset, we randomly selected 8700 image patches from each classes for training (7000 train; 2700 validation) and ran testing on the entire test set (7150 patches). We first trained ResNet18 [He et al.(2015)He, Zhang, Ren, and Sun] with AdamW optimizer at a learning rate of $5\times 10^{-6}$ with weight decay of $1\times 10^{-5}$ for at most 50 epochs with early stopping when validation loss did not improve for 5 consecutive epochs. We applied one-shot and iterative pruning at two different sparsity level increments (0.0625 and 0.25) followed by fine-tuning of at most 30 epochs with the same early stopping criteria.

Pruning Tssue Classification Model. Similar to the much more complex HoverNet, ResNet18 can be effectively pruned: at 0.75 sparsity level, performance scores only dropped $<$ 3% in one-shot pruning \autoreftab:crc-heuristics and with L2-norm iterative pruning \autoreftab:crc, while latency reduced to 1/3 and the parameter number reduced to 1/14 \autoreftab:crc. With one-shot pruning, L1-norm heuristic achieved slightly better performance than L2-norm and about 2% better scores than Network Slimmer. Iterative pruning achieved about 1% better performance scores than one-shot pruning. Iterative pruning with sparsity increments of 0.0625 and 0.25 achieved similar performance, suggesting the pruning performance is not very sensitive to this hyperparameter. ResNet18 is much more prunable in our case than reported for natural scene image classification models [Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf]. such a difference potentially may be due to the different levels of information redundancy of these datasets.

Table 2: Performance of one-shot uniform pruning of ResNet18 CRC tissue classification model.

{tblr}

cell12 = c=2, cell14 = c=2, cell16 = c=2, hline1,3,7 = -, Sparsity & L1 Pruner L2 Pruner Network Slimmer
Accuracy Weighted F1 Accuracy Weighted F1 Accuracy Weighted F1
0 0.954 $\pm$ 0.005 0.953 $\pm$ 0.006 0.954 $\pm$ 0.005 0.953 $\pm$ 0.006 0.951 $\pm$ 0.003 0.949 $\pm$ 0.004
0.25 0.956 $\pm$ 0.001 0.955 $\pm$ 0.002 0.959 $\pm$ 0.004 0.958 $\pm$ 0.004 0.955 $\pm$ 0.004 0.954 $\pm$ 0.004
0.50 0.960 $\pm$ 0.003 0.959 $\pm$ 0.003 0.946 $\pm$ 0.004 0.944 $\pm$ 0.005 0.939 $\pm$ 0.003 0.939 $\pm$ 0.004
0.75 0.943 $\pm$ 0.003 0.942 $\pm$ 0.003 0.937 $\pm$ 0.003 0.936 $\pm$ 0.002 0.924 $\pm$ 0.008 0.921 $\pm$ 0.009

Table 3: Compare pruning performance for CRC tissue classification in one shot and iterative pruning with different sparsity increments with L2-norm pruners (averaged over 3 runs).

{tblr}

cell12 = c=2, cell14 = c=2, cell16 = c=2, cell18 = c=2, hline1,3,7 = -, Sparsity & One shot Iterative 0.25 Iterative 0.0625 Efficiency
Accuracy Weighted F1 Accuracy Weighted F1 Accuracy Weighted F1 Latency (ms) Params ( $10^{6}$ )
0 0.954 0.953 0.954 0.953 0.954 0.953 6.82 11.70
0.25 0.959 0.958 0.959 0.958 0.949 0.948 5.65 6..67
0.50 0.946 0.944 0.948 0.946 0.944 0.943 3.57 3.05
0.75 0.937 0.936 0.948 0.948 0.949 0.948 2.17 0.83

5 Conclusion

In this study, we investigated model pruning in digital pathology applications. We proposed an effective pruning strategy to reduce the computation budget of HoverNet, a widely adopted architecture for nuclei instance segmentation and classification. We further demonstrated effective model pruning in a much smaller model, ResNet18, for tumor classification. Our observations suggest that large models are not necessarily a hard requirement for reliable and effective inference in digital and computational pathology applications. The pruned models, compact and efficient, may enable the deployment of AI in resource-constrained clinical sites and onto edge devices, for example, enabling AI on whole-slide scanners. For future work, we plan to (1) assess the robustness of pruned models compared with original models [Holste et al.(2023)Holste, Jiang, Jaiswal, Hanna, Minkowitz, Legasto, Escalon, Steinberger, Bittman, Shen, et al.] (more discussion in Appendix G) (2) combine pruning with quantization and deploy the pruned model on edge devices and (3) investigate pruning for Vision Transformers models like CellViT [Hörst et al.(2023)Hörst, Rempe, Heine, Seibold, Keyl, Baldini, Ugurel, Siveke, Grünwald, Egger, and Kleesiek].

References

[Aghli and Ribeiro(2021)] Nima Aghli and Eraldo Ribeiro. Combining weight pruning and knowledge distillation for cnn compression. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3185–3192, 2021. 10.1109/CVPRW53098.2021.00356.
[Alom et al.(2019)Alom, Yakopcic, Hasan, Taha, and Asari] Md Zahangir Alom, Chris Yakopcic, Mahmudul Hasan, Tarek M Taha, and Vijayan K Asari. Recurrent residual u-net for medical image segmentation. Journal of medical imaging, 6(1):014006–014006, 2019.
[Ardon et al.(2023)Ardon, Klein, Manzo, Corsale, England, Mazzella, Geneslaw, Philip, Ntiamoah, Wright, et al.] Orly Ardon, Eric Klein, Allyne Manzo, Lorraine Corsale, Christine England, Allix Mazzella, Luke Geneslaw, John Philip, Peter Ntiamoah, Jeninne Wright, et al. Digital pathology operations at a tertiary cancer center: Infrastructure requirements and operational cost. Journal of Pathology Informatics, 14:100318, 2023.
[Baymurzina et al.(2022)Baymurzina, Golikov, and Burtsev] Dilyara Baymurzina, Eugene Golikov, and Mikhail Burtsev. A review of neural architecture search. Neurocomputing, 474:82–93, 2022.
[Blalock et al.(2020)Blalock, Ortiz, Frankle, and Guttag] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning?, 2020.
[Cheng et al.(2023)Cheng, Zhang, and Shi] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network Pruning-Taxonomy, comparison, analysis, and recommendations. August 2023.
[Cho and Hariharan(2019)] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019.
[Choudhary et al.(2021)Choudhary, Mishra, Goswami, and Sarangapani] Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. A transfer learning with structured filter pruning approach for improved breast cancer classification on point-of-care devices. Computers in Biology and Medicine, 134:104432, 2021. ISSN 0010-4825. https://doi.org/10.1016/j.compbiomed.2021.104432. URL https://www.sciencedirect.com/science/article/pii/S0010482521002262.
[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[Deng et al.(2023)Deng, Cui, Liu, Yao, Remedios, Bao, Landman, Wheless, Coburn, Wilson, et al.] Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W Remedios, Shunxing Bao, Bennett A Landman, Lee E Wheless, Lori A Coburn, Keith T Wilson, et al. Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155, 2023.
[Ding et al.(2021)Ding, Zhang, Ma, Han, Ding, and Sun] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. RepVGG: Making VGG-style ConvNets great again. January 2021.
[DiPalma et al.(2021)DiPalma, Suriawinata, Tafe, Torresani, and Hassanpour] Joseph DiPalma, Arief A. Suriawinata, Laura J. Tafe, Lorenzo Torresani, and Saeed Hassanpour. Resolution-based distillation for efficient histology image classification, 2021.
[Frankle and Carbin(2018)] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. March 2018.
[Frankle et al.(2019)Frankle, Dziugaite, Roy, and Carbin] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. March 2019.
[Fu et al.(2022a)Fu, Yang, Yuan, Li, Wan, Krishnamoorthi, Chandra, and Lin] Yonggan Fu, Haichuan Yang, Jiayi Yuan, Meng Li, Cheng Wan, Raghuraman Krishnamoorthi, Vikas Chandra, and Yingyan Lin. DepthShrinker: A new compression paradigm towards boosting Real-Hardware efficiency of compact neural networks. June 2022a.
[Fu et al.(2022b)Fu, Yang, Yuan, Li, Wan, Krishnamoorthi, Chandra, and Lin] Yonggan Fu, Haichuan Yang, Jiayi Yuan, Meng Li, Cheng Wan, Raghuraman Krishnamoorthi, Vikas Chandra, and Yingyan Lin. Depthshrinker: a new compression paradigm towards boosting real-hardware efficiency of compact neural networks. In International Conference on Machine Learning, pages 6849–6862. PMLR, 2022b.
[Gamper et al.(2020)Gamper, Koohbanani, Benes, Graham, Jahanifar, Khurram, Azam, Hewitt, and Rajpoot] Jevgenij Gamper, Navid Alemi Koohbanani, Ksenija Benes, Simon Graham, Mostafa Jahanifar, Syed Ali Khurram, Ayesha Azam, Katherine Hewitt, and Nasir Rajpoot. PanNuke dataset extension, insights and baselines. March 2020.
[Gholami et al.(2021)Gholami, Kim, Dong, Yao, Mahoney, and Keutzer] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. March 2021.
[Gou et al.(2021)Gou, Yu, Maybank, and Tao] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, March 2021. ISSN 1573-1405. 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z.
[Graham et al.(2018)Graham, Vu, Ahmed Raza, Azam, Tsang, Kwak, and Rajpoot] Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. HoVer-Net: Simultaneous segmentation and classification of nuclei in Multi-Tissue histology images. December 2018.
[He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. December 2015.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016.
[Hinton et al.(2015)Hinton, Vinyals, and Dean] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. March 2015.
[Holste et al.(2023)Holste, Jiang, Jaiswal, Hanna, Minkowitz, Legasto, Escalon, Steinberger, Bittman, Shen, et al.] Gregory Holste, Ziyu Jiang, Ajay Jaiswal, Maria Hanna, Shlomo Minkowitz, Alan C Legasto, Joanna G Escalon, Sharon Steinberger, Mark Bittman, Thomas C Shen, et al. How does pruning impact long-tailed multi-label medical image classifiers? In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 663–673. Springer, 2023.
[Hooker et al.(2020)Hooker, Moorosi, Clark, Bengio, and Denton] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising bias in compressed models, 2020.
[Hörst et al.(2023)Hörst, Rempe, Heine, Seibold, Keyl, Baldini, Ugurel, Siveke, Grünwald, Egger, and Kleesiek] Fabian Hörst, Moritz Rempe, Lukas Heine, Constantin Seibold, Julius Keyl, Giulia Baldini, Selma Ugurel, Jens Siveke, Barbara Grünwald, Jan Egger, and Jens Kleesiek. Cellvit: Vision transformers for precise cell segmentation and classification, 2023.
[Ioffe and Szegedy(2015)] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. February 2015.
[Janowczyk and Madabhushi(2016)] Andrew Janowczyk and Anant Madabhushi. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics, 7(1):29, 2016.
[Javed et al.(2023)Javed, Mahmood, Qaiser, and Werghi] Sajid Javed, Arif Mahmood, Talha Qaiser, and Naoufel Werghi. Knowledge distillation in histology landscape by multi-layer features supervision. IEEE Journal of Biomedical and Health Informatics, 27(4):2037–2046, 2023.
[Jian et al.(2022)Jian, Wang, Wang, Dy, and Ioannidis] Tong Jian, Zifeng Wang, Yanzhi Wang, Jennifer Dy, and Stratis Ioannidis. Pruning adversarially robust neural networks without adversarial examples, 2022.
[Kather et al.(2019)Kather, Krisam, Charoentong, Luedde, Herpel, Weis, Gaiser, Marx, Valous, Ferber, et al.] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron Weis, Timo Gaiser, Alexander Marx, Nektarios A Valous, Dyke Ferber, et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
[Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014.
[Li et al.(2016)Li, Kadav, Durdanovic, Samet, and Graf] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient ConvNets. August 2016.
[Liu et al.(2017)Liu, Li, Shen, Huang, Yan, and Zhang] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. August 2017.
[Luo et al.(2024)Luo, Qu, Guo, Song, and Wang] Xiaoyuan Luo, Linhao Qu, Qinhao Guo, Zhijian Song, and Manning Wang. Negative instance guided self-distillation framework for whole slide image analysis. IEEE Journal of Biomedical and Health Informatics, 28(2):964–975, 2024. 10.1109/JBHI.2023.3298798.
[Ma et al.(2018)Ma, Zhang, Zheng, and Sun] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
[Mahbod et al.(2022)Mahbod, Entezari, Ellinger, and Saukh] Amirreza Mahbod, Rahim Entezari, Isabella Ellinger, and Olga Saukh. Deep neural network pruning for nuclei instance segmentation in hematoxylin and eosin-stained histological images. In Applications of Medical Artificial Intelligence: First International Workshop, AMAI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings, page 108–117, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-17720-0. 10.1007/978-3-031-17721-7_12. URL https://doi.org/10.1007/978-3-031-17721-7_12.
[Ren et al.(2021)Ren, Xiao, Chang, Huang, Li, Chen, and Wang] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehensive survey of neural architecture search: Challenges and solutions, 2021.
[Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.
[Van der Sluijs et al.(2024)Van der Sluijs, Bhaskhar, Rubin, Langlotz, and Chaudhari] Rogier Van der Sluijs, Nandita Bhaskhar, Daniel Rubin, Curtis Langlotz, and Akshay S Chaudhari. Exploring image augmentations for siamese representation learning with chest x-rays. In Medical Imaging with Deep Learning, pages 444–467. PMLR, 2024.
[Yu et al.(2023)Yu, Lin, and Xu] Zhimiao Yu, Tiancheng Lin, and Yi Xu. Slpd: Slide-level prototypical distillation for wsis, 2023.
[Zhang et al.(2022)Zhang, Xing, Zou, and Wu] Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, 6(12):1330–1345, 2022.

Appendix A Notes on Knowledge Distillation

Knowledge Distillation (KD) is the process of transferring knowledge from a large model (teacher model) to a relatively smaller model (student model) [Hinton et al.(2015)Hinton, Vinyals, and Dean]. KD has been shown to be effective in training a smaller model (e.g. ResNet-50) to the same level of performance as the larger teacher model (e.g. ResNet-152). However, the level of compression achieved by KD is relatively lower than some other techniques like pruning, because the selected student model is usually an existing well-designed architectures. KD has also been explored in digital pathology for model compression and to improve the performance of the student model. For e.g., Javed et al., proposed a knowledge distillation-based method to improve the performance of shallow networks for tissue phenotyping in histology images [Javed et al.(2023)Javed, Mahmood, Qaiser, and Werghi]. They used ResNet-18 as a student model and Resnet-50 as a teacher model. The student model can be further pruned or quantized depending on the need [Aghli and Ribeiro(2021)]. DiPalma et al. proposed a method for improving the computational efficiency of histology image classification [DiPalma et al.(2021)DiPalma, Suriawinata, Tafe, Torresani, and Hassanpour]. The authors proposed to train the teacher network on the high-resolution images and the student network on the low-resolution images.

It is important to note that KD as a generic concept/method has find its way in many other applications than model compression, for example, for the following two studies on self-supervised learning, the teacher and student models are of the same architecture. Yu et al. proposed SLPD, which encoded the intra- and inter-slide semantic structures by modelling the mutual-region/slide relations using knowledge distillation [Yu et al.(2023)Yu, Lin, and Xu]; Luo et al. proposed a negative instance-guided, self-distillation framework to directly train an instance-level classifier end-to-end as an alternative to Multiple Instance Learning methods [Luo et al.(2024)Luo, Qu, Guo, Song, and Wang].

Appendix B Semi-Structured Pruning

In contrast to structured pruning, where an entire filter is completely removed, semi-structured pruning explores sparsity patterns between unstructured and structured pruning, such as block sparsity or n:m sparsity. Unlike unstructured pruning, some of the new hardware (NVIDIA A100) can leverage n:m sparsity to speed up inference.

Appendix C Efficiency Metrics and Performance Metrics

C.1 Performance Metrics

Following [Gamper et al.(2020)Gamper, Koohbanani, Benes, Graham, Jahanifar, Khurram, Azam, Hewitt, and Rajpoot], we used the metrics (PQ, DQ, SQ) detailed below for evaluating the performance of the pruned model.

We report mean PQ, DQ and SQ over all classes by first pooling all positive instances from all test images and then calculate these metrics for each class, followed by averaging over classes. Note that in the PanNuke dataset paper [Gamper et al.(2020)Gamper, Koohbanani, Benes, Graham, Jahanifar, Khurram, Azam, Hewitt, and Rajpoot], the authors calculated mean PQ by first calculating PQ for each class from each image and then averaging over the number of images for these images with positive instances for corresponding classes. We found that PQ and DQ are similar with these two ways of calculation, SQ are about 0.2 higher in our approach.

Detection Quality (DQ): DQ is a widely used metric for evaluating segmentation models. DQ is the F1 score that measures the quality of instance detection (Graham et al. 2018). Mathematically, it can be described as:

DQ=\frac{|TP|}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|},

where TP, FP and FN denote True Positive, False Positive and False Negative respectively.

Segmentation Quality (SQ): SQ measures how close each correctly detected instance is to its ground truth mask. Mathematically, it can be described as:

SQ=\frac{\Sigma_{(x,y)\in TP}IoU(x,y)}{|TP|},

where IoU is the intersection over the union. We only considered instances with IoU $\geq$ 0.5, which is proven to have unique matches between ground-truthc and predicted instances (Kirillov et al. 2018).

Panoptic Quality (PQ): PQ, introduced by Krillov et al. (Kirillov et al. 2018), is a unified score for comparing both segmentation and classification between the predicted labels and ground truth. Mathematically, PQ is the product of DQ and SQ:

PQ=\frac{|TP|}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|}\times\frac{\Sigma_{(x,y)% \in TP}IoU(x,y)}{|TP|}

Appendix D Intra-residual block and inter-residual block pruning

Example illustration of intra-residual block (\figurereffig:intrablock) and inter-residual block pruning (\figurereffig:interblock).

D.1 Efficiency Metrics

Most existing literature adopts theoretical floating point operation (FLOPs) as their metric for assessing the speed-up or computation saving after pruning. FLOPs refers to the theoretical number of arithmetic operations required to perform a given computation. FLOPs are generally proportional to the model size, though dependent on what exact layers designed in a particular model.

However, theoretical FLOPs does not directly reflect model speedup nor memory saving and thus at most a supplementary metric when evaluating model efficiency changes before and after model pruning/compression [Ma et al.(2018)Ma, Zhang, Zheng, and Sun]. One main reason is that for different types of layers and even for the same type of layers with different design choices, the same number of floating number operation reduction does not equal the same time/memory saving. One example is a convolution layer with filter size of 3x3 versus 5x5. Most modern scientific computing libraries include optimized implementation for 3x3 but not 5x5 convolution operations and thus 3x3 is faster than 5x5. Another example is that plain convolution networks like VGGs, which do not include residual blocks, generally are more efficient than residual networks, and as such even when removing the same number of parameters, the former will accelerate more than the latter. See [Ma et al.(2018)Ma, Zhang, Zheng, and Sun] and [Fu et al.(2022b)Fu, Yang, Yuan, Li, Wan, Krishnamoorthi, Chandra, and Lin] for more in-depth assessment and explanations.

We thus turn to the following two complementary metrics for our assessment. Latency: Latency refers to the time taken for a model to process input or a batch input, where input refers to images with certain height, width and channels (like R,G,B channels). To get accuracy latency, we (1) fix the GPU device type when comparing different experiment settings (2) fix the image batch size and image dimensions (3) perform GPU warm-up before the actual latency computation for our models and (4) since GPU computing is asynchronous (different threads execute computation at slightly different timing), we wait till all computing threads stop before ending the timing for latency calculation. Model parameter number: The number of model parameters along with the data type are directly related to the model size. For example, one can calculate that a model with 1 million parameters whose weight data type of float32 roughly occupies 4 megabyte of disk/dynamic memory.

Throughput refers to the number of input images processed by a model within a given time. Despite that multi-processing in general can increase latency and thus may not directly reflect model efficiency changes, one can fix the number of GPUs used for throughput calculation and use this metric along with latency, both are similar metrics that measures time-related improvement of model pruning/compression.

Appendix E Model performance across folds for various pruning heuristics and sparsity ratios

\figureref

fig:pqfolds shows detailed performance of model from each of the 3 folds across sparsity ratios of 0.1 to 0.95 at 0.05 increments for various pruning heuristics.

Appendix F Notes on reduction of latency and model parameter numbers in model pruning

Reduction of model parameter number It is obvious that the model parameter number does not linearly reduce with the sparsity levels, including in \autoreftab:crc. Let a convolution layer has $i$ input channels, $o$ output filters, kernel height and width of $k$ , its number of parameters is $i\times o\times k\times k$ . In our pruning setting, we pruned all layers. For this convolution layer, when the input and output dimensions are both pruned with, for example, a sparsity level of 0.25, the remaining input and output channel numbers become $0.75\times i$ and $0.75\times o$ , resulting in $0.75\times i\times 0..75\times o\times k\times k$ number of parameters after pruning. As such, pruning of this convolution layer reduced more than 25% of its parameters. Since most of the parameters in a ResNets are in the convolution layers, it is not surprising that model parameter reduction is faster than linear against sparsity levels during pruning.

Reduction of latency In \autoreftab:crc, model latency reduced to 1/3 of original model, disproportional to the reduction of model parameters (1/10). First, such a behaviour is common in literature, such as for the original works of L1/L2 norm and network slimmer. Second, despite that filter pruning removed large proportion of filters, the residual connections were kept, which is know to be less efficient than plain convolution layers [Fu et al.(2022b)Fu, Yang, Yuan, Li, Wan, Krishnamoorthi, Chandra, and Lin]. Third, the exact reduction of latency for each particular layers is highly related to the implementation of the model training framework (PyTorch) and hardware level implementation (e.g. CuDNN). Fourth, imbalanced convolution channel input/output ( convolution layers with input and output channel sizes differ a lot) was shown to contribute to slowing down model inference speed [Ma et al.(2018)Ma, Zhang, Zheng, and Sun]. Resnets with some of these layers may not be as readily speed up even after pruning.

Appendix G Discussion on Model Robustness

Healthcare is a sensitive domain and therefore, it is important to discuss the effect of pruning on model bias and robustness. Hooker et al., studied model compression in detail and observed that both pruning and compression exacerbate the algorithmic bias [Hooker et al.(2020)Hooker, Moorosi, Clark, Bengio, and Denton]. Hooker et al., empirically showed that while the overall accuracy remains unchanged but some classes bear a disproportionately high portion of the error. There has also been work in the literature to preserve adversarial robustness during pruning [Jian et al.(2022)Jian, Wang, Wang, Dy, and Ioannidis].