Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681313acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Published: 28 October 2024 Publication History

Abstract

Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmark, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions. The benchmark can be accessed at https://github.com/shadow2469/Data-Effective-Learning-A-Comprehensive-Medical-Benchmark.git GitHub Repository.

References

[1]
2023. Endoscopy Procedures Estimates Market Volume, Share & Trends Analysis Report. Report ID: GVR-4--68039--915-0, Number of Pages: 118, Format: Electronic (PDF), Historical Range: 2016 - 2021, Industry: Healthcare. Segment Forecasts, 2023 - 2030.
[2]
Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. 2023. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540 (2023).
[3]
Sharib Ali, Noha Ghatwary, Barbara Braden, Dominique Lamarque, Adam Bailey, Stefano Realdon, Renato Cannizzaro, Jens Rittscher, Christian Daul, and James East. 2020. Endoscopy disease detection challenge 2020. arXiv preprint arXiv:2003.03376 (2020).
[4]
Sharib Ali, Debesh Jha, Noha Ghatwary, Stefano Realdon, Renato Cannizzaro, Osama E Salem, Dominique Lamarque, Christian Daul, Michael A Riegler, Kim V Anonsen, et al. 2021. PolypGen: A multi-center polyp detection and segmentation dataset for generalisability assessment. arXiv preprint arXiv:2106.04463 (2021).
[5]
Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnieres, Victor Loschenov, Enrico Grisan, et al. 2019. Endoscopy artifact detection (EAD 2019) challenge dataset. arXiv preprint arXiv:1905.03209 (2019).
[6]
Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. 2015. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics 43 (2015), 99--111.
[7]
Jorge Bernal, Javier Sánchez, and Fernando Vilarino. 2012. Towards automatic polyp detection with a polyp appearance model. Pattern Recognition 45, 9 (2012), 3166--3182.
[8]
Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, et al. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data 7, 1 (2020), 283.
[9]
Ming Chen, Shupeng Wang, and Liang Tian. 2013. A High-precision Duplicate Image Deduplication Approach. J. Comput. 8, 11 (2013), 2768--2775.
[10]
Ming Chen, YangWang, Xiaoxiang Zou, ShupengWang, and GuangjunWu. 2012. A duplicate image deduplication approach via Haar wavelet technology. In 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems, Vol. 2. IEEE, 624--628.
[11]
Xue-Wen Chen and Xiaotong Lin. 2014. Big data deep learning: challenges and perspectives. IEEE access 2 (2014), 514--525.
[12]
Neil T Clancy, Rui Li, Kevin Rogers, Paul Driscoll, Peter Excel, Ron Yandle, George Hanna, Nigel Copner, and Daniel S Elson. 2012. Development and evaluation of a light-emitting diode endoscopic light source. In Advanced Biomedical and Clinical Diagnostic Systems X, Vol. 8214. SPIE, 105--111.
[13]
Nelson Cowan. 2001. Metatheory of storage capacity limits. Behavioral and brain sciences 24, 1 (2001), 154--176.
[14]
Sagnik Dakshit, Barbara Mukami Maweu, Sristi Dakshit, and Balakrishnan Prabhakaran. 2022. Core-set selection using metrics-based explanations (CSUME) for multiclass ECG. In 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI). IEEE, 217--225.
[15]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[17]
Richard Dosselmann and Xue Dong Yang. 2011. A comprehensive assessment of the structural similarity index. Signal, Image and Video Processing 5 (2011), 81--91.
[18]
Wayne W Eckerson. 2002. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute (2002), 1--32.
[19]
Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. 2010. Why does unsupervised pre-training help deep learning?. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 201--208.
[20]
Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. 2017. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410 (2017).
[21]
Pedro Furtado and Henrique Madeira. 1999. Analysis of accuracy of data reduction techniques. In DataWarehousing and Knowledge Discovery: First International Conference, DaWaK'99 Florence, Italy, August 30--September 1, 1999 Proceedings 1. Springer, 377--388.
[22]
Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243 (2018).
[23]
Oleksandra Gulenko, Hyunmo Yang, KiSik Kim, Jin Young Youm, Minjae Kim, Yunho Kim, Woonggyu Jung, and Joon-Mo Yang. 2022. Deep-Learning-Based Algorithm for the Removal of Electromagnetic Interference Noise in Photoacoustic Endoscopic Image Processing. Sensors 22, 10 (2022), 3961.
[24]
Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov, and Gil Vernik. 2012. Estimation of deduplication ratios in large data sets. In 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--11.
[25]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked Autoencoders Are Scalable Vision Learners. (2021).
[26]
Zhongyu He, Peng Wang, Yuelong Liang, Zuoming Fu, and Xuesong Ye. 2021. Clinically available optical imaging technologies in endoscopic lesion detection: current status and future perspective. Journal of Healthcare Engineering 2021 (2021), 1--27.
[27]
William Hersh, Henning M¨"uller, and Jayashree Kalpathy-Cramer. 2009. The ImageCLEFmed medical image retrieval task test collection. Journal of Digital Imaging 22 (2009), 648--655.
[28]
Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A Hicks, Vajira Thambawita, Enrique Garcia-Ceja, Michael A Riegler, Thomas de Lange, Peter T Schmidt, Håvard D Johansen, et al. 2021. Kvasir-instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy. In MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22--24, 2021, Proceedings, Part II 27. Springer, 218--229.
[29]
Debesh Jha, Vanshali Sharma, Neethi Dasu, Nikhil Kumar Tomar, Steven Hicks, MK Bhuyan, Pradip K Das, Michael A Riegler, Pål Halvorsen, Ulas Bagci, et al. 2023. GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection. In Workshop on Machine Learning for Multimodal Healthcare Data. Springer, 125--140.
[30]
Debesh Jha, Pia H Smedsrud, Dag Johansen, Thomas de Lange, Håvard D Johansen, Pål Halvorsen, and Michael A Riegler. 2021. A comprehensive study on colorectal polyp segmentation with ResUNet, conditional random field and test-time augmentation. IEEE journal of biomedical and health informatics 25, 6 (2021), 2029--2040.
[31]
Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Johansen. 2020. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5--8, 2020, Proceedings, Part II 26. Springer, 451--462.
[32]
Luis O Jimenez-Rodriguez, Emmanuel Arzuaga-Cruz, and Miguel Vélez-Reyes. 2007. Unsupervised linear feature-extraction methods and their effects in the classification of high-dimensional data. IEEE Transactions on geoscience and remote sensing 45, 2 (2007), 469--483.
[33]
Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Accurate image superresolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1646--1654.
[34]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
[35]
Rafal Kwasny, Daniel Friar, and Giuseppe Papallo. 2020. Benchmarking Deep Learning Workloads with TensorFlow on the NVIDIA GeForce RTX 3090.
[36]
Chengcai Leng, Hai Zhang, Bo Li, Guorong Cai, Zhao Pei, and Li He. 2018. Local feature descriptor for image matching: A survey. IEEE Access 7 (2018), 6424--6434.
[37]
Mengfang Li, Yuanyuan Jiang, Yanzhou Zhang, and Haisheng Zhu. 2023. Medical image analysis using deep learning algorithms. Frontiers in Public Health 11 (2023), 1273253.
[38]
Xuan Li, Liqiong Chang, and Xue Liu. 2021. CE-Dedup: Cost-effective convolutional neural nets training based on image deduplication. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 11--18.
[39]
Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2021. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems 33, 12 (2021), 6999--7019.
[40]
Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2016. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2064--2072.
[41]
Ilya Loshchilov and Frank Hutter. 2017. DecoupledWeight Decay Regularization. (2017).
[42]
J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. and Probability, 5th 1 (1967).
[43]
Stephane G Mallat. 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on pattern analysis and machine intelligence 11, 7 (1989), 674--693.
[44]
Dutch T Meyer and William J Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage (ToS) 7, 4 (2012), 1--20.
[45]
Hussain Nyeem, Wageeh Boles, and Colin Boyd. 2013. A review of medical image watermarking requirements for teleradiology. Journal of digital imaging 26 (2013), 326--343.
[46]
JK Periasamy and B Latha. 2021. Efficient hash function--based duplication detection algorithm for data Deduplication deduction and reduction. Concurrency and Computation: Practice and Experience 33, 3 (2021), e5213.
[47]
M Mostafizur Rahman and Darryl N Davis. 2013. Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing 3, 2 (2013), 224.
[48]
Papia Ray, S Surender Reddy, and Tuhina Banerjee. 2021. Various dimension reduction techniques for high dimensional data analysis: a review. Artificial Intelligence Review 54, 5 (2021), 3473--3515.
[49]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.
[50]
Prakash Chandra Sharma, Sulabh Bansal, Rohit Raja, Phyu Myo Thwe, Moe Moe Htay, and Su Su Hlaing. 2021. Concepts, strategies, and challenges of data deduplication. In Data Deduplication Approaches. Elsevier, 37--55.
[51]
Juan Silva, Aymeric Histace, Olivier Romain, Xavier Dray, and Bertrand Granado. 2014. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery 9 (2014), 283--293.
[52]
Pia H Smedsrud, Vajira Thambawita, Steven A Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, Debesh Jha, Tor Jan Derek Berstad, Sigrun L Eskeland, et al. 2021. Kvasir-Capsule, a video capsule endoscopy dataset. Scientific Data 8, 1 (2021), 142.
[53]
CHARLES E Smith and ANTONIO Nanci. 2003. Overview of morphological changes in enamel organ cells associated with major events in amelogenesis. International Journal of Developmental Biology 39, 1 (2003), 153--161.
[54]
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, and Alexey Dosovitskiy. 2021. MLP-Mixer: An all-MLP Architecture for Vision. (2021).
[55]
David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Antonio M López, Adriana Romero, Michal Drozdzal, Aaron Courville, et al. 2017. A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering 2017 (2017).
[56]
Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. 2019. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF international conference on computer vision. 5310--5319.
[57]
Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681--1710.
[58]
Ling Xiao, Beiji Zou, Chengzhang Zhu, and Fanbo Nie. 2023. ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems. The Journal of Supercomputing 79, 3 (2023), 2932--2960.
[59]
Hengxiang Xie, Yuhui Deng, Hao Feng, and Lei Si. 2021. Pxdedup: Deduplicating massive visually identical jpeg image data. Big Data Research 23 (2021), 100171.
[60]
Martin J Yaffe. 2019. Emergence of 'big data' and its potential and current limitations in medical imaging. In Seminars in Nuclear Medicine, Vol. 49. Elsevier, 94--104.
[61]
Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, and Xuan Wang. 2022. From Hyper-Dimensional Structures to Linear Structures: Maintaining Deduplicated Data's Locality. 18, 3 (2022).

Index Terms

  1. A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data-effective learning
    2. endoscopic image processing
    3. foundation model
    4. medical benchmark

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 48
      Total Downloads
    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media