Abstract
Deep learning algorithms have achieved tremendous success in many medical imaging problems leading to multiple commercial healthcare applications. For sustaining the performance of these algorithms post-deployment, it is necessary to overcome catastrophic forgetting and continually evolve with data. While catastrophic forgetting could be managed using historical data, a fundamental challenge in Healthcare is data-privacy, where regulations constrain restrict data sharing. In this paper, we present a single, unified mathematical framework - feature transformers, for handling the myriad variants of lifelong learning to overcome catastrophic forgetting without compromising data-privacy. We report state-of-the-art results for lifelong learning on iCIFAR100 dataset and also demonstrate lifelong learning on medical imaging applications - X-ray Pneumothorax classification and Ultrasound cardiac view classification.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Two major manifestations of lifelong learning are domain adaptation and New task learning. Despite the positive impact of deep learning on multiple healthcare applications, the highly diverse nature of medical imaging data owing to factors like imaging hardware, demographics, acquisition protocols, inter-subject variability, pathologies, etc necessitate frequent need for successful domain adaptation. In [3], authors report that the performance of MRI segmentation algorithms suffer when tested on data from a different protocol/vendor. New task Learning - the ability to augment existing neural networks with newer capabilities is critical to enable expert clinicians like radiologists to add new capabilities to existing applications and also to control explosion of number of models for different applications. For example, in [1] authors enable a model capable of performing brain structure segmentation to also perform white matter lesion segmentation.
Lifelong learning is trivial if the entire dataset is available during every episode. However, as access to past data is not practical and often impossible due to regulatory constraints, these algorithms are trained sequentially using only latest data. This leads to neural networks exhibiting a phenomenon known as catastrophic forgetting [4] - inability of models to retain past knowledge while learning from new data. To overcome this problem, researchers have attempted to retain past knowledge through proxy information, a technique dubbed as pseudorehearsal. Some of the popular methods of pseudorehearsal include the use of a knowledge distillation loss [5], rehearsal using exemplars [8, 9] and the use of generative models for retaining knowledge from previous episodes [2, 10, 12]. Regularization [4, 6] techniques are also used to constraint parts of network which are critical for the performance on previous tasks.
All these methods suffer from either of the two drawbacks rendering them unusable in medical imaging: (1) data privacy concerns arising from storage of exemplar images. (2) the ability of methods to work only on one of the variants of lifelong learning. In this paper, we propose one approach for all manifestations of lifelong learning to handle data privacy and catastrophic forgetting. We utilize an external memory to store only the features representing past data and learn richer and newer representations incrementally through transformation neural networks - feature transformers. Our major contributions are as follows:
-
Formulate a generic mathematical framework - feature transformers to cater to domain adaptation and new task learning.
-
Ensure data privacy by storing only features from previous episodes while successfully combating catastrophic forgetting.
-
Demonstrate exemplary results on two challenging problems of cardiac ultrasound view classification and pneumothorax detection from X-ray images.
Section 2 describes our approach. Section 3 contains experiments and results.
2 Life-Long Learning via Feature Transformations
A neural network classifier, parameterised by \((\varvec{\theta }, \varvec{\kappa })\), is a composition of a feature extractor \(\varvec{\varPhi }_{\varvec{\theta }}: \varvec{{X}}\rightarrow \varvec{F},\) and \(\varvec{\varPsi }_{\varvec{\kappa }}\), a classifier \(\varvec{\varPsi }_{\varvec{\kappa }}\circ \varvec{\varPhi }_{\varvec{\theta }} :\varvec{{X}}\rightarrow [{C}],\) where \(\varvec{{X}}\) is the space of input data, and \(\varvec{F}\) is a space of low-dimensional feature vectors.
In the lifelong learning setup, at any time \(t-1\), the model optimally classifies all the seen data \(\cup _{t'=0}^{t-1}{X}^{(t')}\) into the classes \([{C}^{(t-1)}]\) and the corresponding features \({\mathcal {F}}^{(t-1)}\) are well separated. At \(t\), when new training data \({D}^{(t)}=({X}^{(t)}, {Y}^{(t)})\) is encountered, features extracted using the old feature extractor
are not guaranteed to be optimized for classifying the new data and new classes. To alleviate this, we propose to change the feature representation at time \(t\), just before the classification stage. We achieve this by defining a feature transformer
parameterized by \(\varDelta \varvec{\theta }^{(t)}\), which maps any feature extracted by \(\varvec{\varPhi }_{\varvec{\theta }^{(t-1)}}\) to a new representation. The new feature extractor is now given by \(\varvec{\varPhi }_{\varvec{\theta }^{(t)}}\triangleq \varvec{\varPhi }_{\varDelta \varvec{\theta }^{(t)}} \circ \varvec{\varPhi }_{\varvec{\theta }^{(t-1)}}\), where \(\varvec{\theta }^{(t)}\triangleq \varvec{\theta }^{(t-1)}\cup \varDelta \varvec{\theta }^{(t)}\). Practically, this is realized by augmenting the capacity of the feature extractor using dense layersFootnote 1.
Remembering History via Memory: The set of all extracted features \({\mathcal {F}}^{(t-1)}\) serves as a good abstraction of the model, for all the tasks and data encountered till \(t-1\). We realize our psuedo-rehearsal strategy through the use of a finite memory module \(\mathcal {M}\), equipped with \(\texttt {READ}()\), \(\texttt {WRITE}()\) and \(\texttt {ERASE}()\) procedures, that can store a subset of \({\mathcal {F}}^{(t-1)}\) and retrieve the same at \(t\).
Ensuring Class-Separation via Composite Loss: In all our training procedures, we augment classification loss with center-loss (described in [14]) : \(J(\theta , \kappa ) = {classification\, loss}_{(\varvec{\theta }, \varvec{\kappa })} + \lambda \cdot \,{centre\text {-}loss}_{(\varvec{\theta })}\). This composite loss explicitly forces the transformed features to have class-wise separation, in addition to classification performance. We train the feature transformer at any \(t>0\) by invoking \(\texttt {TRAIN}(\varDelta \varvec{\theta }^{(t)}, \varvec{\kappa }^{(t)}; \;{D}^{(t)})\), with the combined set of features and obtain transformed features \({\mathcal {F}}^{(t)}\) which replace \({\mathcal {F}}^{(t-1)}\) in memory.
Figure 1 shows feature transformers in action using Pneumothorax classification (Sect. 3.2). A binary classifier is trained on 6000 images at time index \((t-1)\). As shown by the t-SNE plot (A), the feature extractor \(\varvec{\varPhi }_{\varvec{\theta }^{(t-1)}}\) produces features which are well-separated, and get stored in memory \(\mathcal {M}\). However, at time \(t\), when a set of 2000 new images is encountered, \(\varvec{\varPhi }_{\varvec{\theta }^{(t-1)}}\) produces features that are scattered (t-SNE plot (B)). Feature transformer learns a new representation using the (well-separated) features in \(\mathcal {M}\) as well as poorly separated features (from new data). This ensures good class separation for all images encountered until time \(t\)(t-SNE plots (C) and (D)). This is repeated for all time indices \(t\).
To enable practicality of the proposed approach, we ensure that the memory footprint is limited (Sect. 4.1) by storing only a subset of history selected by a sampling operator \(\mathcal S\). Algorithm 1 presents the pseudocode for our framework.

3 Experiments and Results
We first demonstrate our algorithm as a generic lifelong learner by conducting experiments on iCIFAR100 dataset. Then, we proceed to demonstrate the efficacy of feature transformers on two real world problems in medical imaging.
3.1 Conventional Continual Learning Paradigms on iCIFAR100
iCIFAR100 dataset contains images from 100 different classes and is a popular choice of study for comparing lifelong learning approaches [8]. The two conventional lifelong learning evaluation settings on iCIFAR100 are multi-task (MT) and single incremental task (SIT) [7], where the 100 class dataset is fed in a sequence of 20 tasks comprising 5 classes each. In MT setting, the algorithm is evaluated to learn an isolated set of new tasks while SIT entails learning all the classes incrementally. In our experiments, we used a basic CNN architecture inspired by the popular CIFAR architecture. Our feature transformer networks at every episode aims to transform features from dense layers using 2 additional dense layers of feature length 256 and were optimized for the composite loss \(\lambda = 0.2\).
Multi Task Setting: We validate our algorithm against state-of-the-art methods [8] using 2 different metrics: accuracy and backward transfer (BWT). BWT is a quantitative metric that measures catastrophic forgetting on older tasks after learning new tasks. As shown in Fig. 2(a), Feature Transformers provide an improvement of \(15\%\) while showing negligible catastrophic forgetting and outperforms of all the methods reported in [8] by a big margin.
Single Incremental Task setting: Here, we compare our results to the two obvious life-long learners - Naive and Cumulative learners, which serve as the lower and upper bounds of performance respectively. Naive learner finetunes the entire network on the latest episode data while cumulative learner retrains the model from data accumulated over all the episodes seen so far. As seen in Fig. 2(b), after 20 episodes, naive learner performs very poorly, while feature transformer displays impressive performance numbers of \(40\%\) validation accuracy, while compared to cumulative learner at \(50\%\). Our method significantly outperforms AR1 [7], iCaRL [9] achieves best-in-class performance close to gold-standard cumulative learner. We attribute it to iCaRL storing exemplar images while we only store features from previous episodes.
3.2 Lifelong Learning for Pneumothorax Classification
Our first medical imaging application for lifelong learning is Pneumothorax identification from X-ray images, which is a relevant as well as a challenging problem owing to diversity of disease manifestation. We utilize a subset of ChestXRay [13] dataset, which consists of chest X-rays labeled with corresponding diagnoses. We simulated incremental learning by providing the 8 k training images in incremental batches of 2 K and measured the performance on held-out validation set of \(\sim \)2 K images, which mimics a practical scenario of a model deployed to detect pneumothorax in a hospital with data arriving incrementally. Figure 3(a) establishes the baselines for the experiment. As in previous experiment, naive learner - which finetunes only on the latest episode and cumulative leaner - which retrains on the entire data seen so far define the performance bounds.
Experimental Details and Results: We used a pre-trained VGG-network [11] as the base network and explored the use of features from different layers of the VGG-network: post the two pooling layers and fully connected layers. Feature transformer network essentially had one additional dense layer per step.
Figure 3(a) captures the performance of feature transformer with the base features being extracted from first pooling layer - block3_pool. After fourth batch of data, performance of feature transformers almost matches the performance of cumulative training. This performance is achieved despite not having access to the full images but only the stored features. Figure 3b also presents the performance of feature transformer depending upon the base features used. It can be noted that performance is lowest for the layer that is closer to the classification layer - fc_2. This is intuitively satisfying because, the further layers in a deep neural network will be more finely tuned towards the specific task and deprives the feature transform of generic features.
3.3 Lifelong Learning for Cardiac View Classification
Cardiac Ultrasound (U/S) images exhibit a high degree of variability owing to the skill of the sonographer, acquisition parameters, types of probe, patient echogenicity and age. In this paper, we pursue classification of the four common views of standard cardiac study - 4chamber (4ch) and 2chamber (2ch), Paresternal Long Axis (PLAX) and Short Axis (PSAX). We set-up a series of lifelong episodes to investigate the proposed approach for domain adaptation and new task learning.
In the first episode, we train a classifier model to classify 2 views from adult subjects - 4ch and PLAX. To study domain adaptation, we demonstrate its performance on pediatric data for the same classes. As seen in Fig. 4, images from pediatric subjects have higher contrast and smoother speckles compared to adult subjects primarily because of use of high frequency probes and lesser depth of penetration. In the next episode, we simulate new task learning by adding 2 additional views (2ch and PSAX) followed by the final episode where pediatric images of the new views are presented. The number of images per view was 5 K and was tested on a mutually exclusive set of 2 K images per view. Table 1 shows the comparison of proposed approach with naive and cumulative learners. It should be noted that fine-tuning on pediatric data in episode 2, results in a performance drop of 7%, while the cumulative learner and Feature transformer do not suffer from effects of domain change. The naive learner fails completely while presented with two new views in Episode 3, while Feature Transformer achieves a respectable 87% demonstrating new-task learning capabilities without catastrophic forgetting. This is further demonstrated in Episode 4, where domain adaptation on pediatric views are achieved with promising results.
4 Discussion
In the final section, we examine two important facets concerning the practicality of the proposed approach - (1) Memory management (2) Network capacity. We examine the amount of history necessary for successful lifelong learning and obtained satisfactory results by retaining only \(25\%\) of past samples in memory.
4.1 Memory Management
As discussed in Sect. 2, at every episode of lifelong learning we selectively retain only a small portion of history. There are two strategies that can be pursued for the same (1) Random sampling - where we randomly retain a percentage of the memory (2) Importance sampling - retain samples that are farther from cluster centroids, given that we optimize for center loss at every episode. For the Pneumothorax classification problem, Table 2 shows the effect of random sampling of features on the performance. We would also like to point out that storing low-dimensional features is economical than storing entire images.
4.2 Controlling the Growth of Network Capacity
Our framework can be formulated as a base feature extractor and feature transformer layers, adapting the features for new tasks. In order to check the growth of feature transformer layers, the base feature extractor remains fixed and only the base features are stored and not the latest updated features. This makes existing feature transformer layers reusable for future episodes. We varied the size of feature transformers and observed the difference in performance. Table 3 shows that halving the additional capacity retains the performance on pneumothorax dataset. These experiments (along with Sect. 4.1) clearly demonstrate that power of learning separable representations continually and make our proposed approach practically feasible.
4.3 Information Loss, Incremental Capacity, Data Privacy
As shown in Fig. 3(b), feature transformer becomes less effective if the base features do not contain enough relevant information. This also means that additional capacity that every feature transformer adds may not help or in-fact be counter-productive. If the base features are extracted from layers close to the input image, there will be problem of traceability which violates the data privacy requirement we want to accomplish. We feel this a potential trade-off between performance and data privacy which we will investigate in future.
5 Conclusion
In this work, we present Feature Transformers - privacy preserving lifelong learning framework that paves the way for practical lifelong learning for healthcare applications. In addition to achieving state-of-the-art performance on iCIFAR dataset, we demonstrate exemplary performance on two challenging medical imaging applications. Another major contributions include proposing a single framework to perform and switch between domain adaptation and new task learning, Future work would involve optimizing for limited memory budget, model growth and extending to applications like segmentation, regression, etc.
Notes
- 1.
There is no restriction on the kind of layers to be used, but in this work we use only fully connected layers.
References
Baweja, C., Glocker, B., Kamnitsas, K.: Towards continual learning in medical imaging. arXiv preprint arXiv:1811.02496 (2018)
He, C., Wang, R., Shan, S., Chen, X.: Exemplar-supported generative reproduction for class incremental learning. In: 29th BMVC, 3–6 September 2018
Karani, N., Chaitanya, K., Baumgartner, C., Konukoglu, E.: A lifelong learning approach to brain MR segmentation across scanners and protocols. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 476–484. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_54
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. In: Proceedings of the National Academy of Sciences (2017)
Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2935–2947 (2017)
Liu, X., et al.: Rotate your networks: better weight consolidation and less catastrophic forgetting. arXiv preprint arXiv:1802.02950 (2018)
Lomonaco, V., Maltoni, D.: Core50: a new dataset and benchmark for continuous object recognition. arXiv preprint arXiv:1705.03550 (2017)
Lopez-Paz, D., et al.: Gradient episodic memory for continual learning. In: Advances in Neural Information Processing Systems, pp. 6467–6476 (2017)
Rebuffi, S.A., Kolesnikov, A., Lampert, C.H.: iCaRL: incremental classifier and representation learning. In: 2017 IEEE CVPR, pp. 5533–5542 (2017)
Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: Advances in Neural Information Processing Systems, pp. 2990–2999 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Venkatesan, R., Venkateswara, H., Panchanathan, S., Li, B.: A strategy for an uncompromising incremental learner. arXiv preprint arXiv:1705.00744 (2017)
Wang, X., et al.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: 2017 IEEE CVPR, pp. 3462–3471, July 2017
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ravishankar, H., Venkataramani, R., Anamandra, S., Sudhakar, P., Annangi, P. (2019). Feature Transformers: Privacy Preserving Lifelong Learners for Medical Imaging. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11767. Springer, Cham. https://doi.org/10.1007/978-3-030-32251-9_38
Download citation
DOI: https://doi.org/10.1007/978-3-030-32251-9_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32250-2
Online ISBN: 978-3-030-32251-9
eBook Packages: Computer ScienceComputer Science (R0)