research-article

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Authors:

Wenxuan Yang,

Weimin Tan,

Yuqi Sun,

Bo YanAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 3499 - 3508

https://doi.org/10.1145/3664647.3681313

Published: 28 October 2024 Publication History

Get Access

Abstract

Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmark, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions. The benchmark can be accessed at https://github.com/shadow2469/Data-Effective-Learning-A-Comprehensive-Medical-Benchmark.git GitHub Repository.

References

[1]

2023. Endoscopy Procedures Estimates Market Volume, Share & Trends Analysis Report. Report ID: GVR-4--68039--915-0, Number of Pages: 118, Format: Electronic (PDF), Historical Range: 2016 - 2021, Industry: Healthcare. Segment Forecasts, 2023 - 2030.

Abstract

References

Index Terms

Recommendations

Anatomical Embedding-Based Training Method for Medical Image Segmentation Foundation Models

A comprehensive EHR timeseries pre-training benchmark

Heterogeneous Contrastive Learning for Foundation Models and Beyond

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations