Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3637528.3672194acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
keynote

Empower an End-to-end Scalable and Interpretable Data Science Ecosystem using Statistics, AI and Domain Science

Published: 24 August 2024 Publication History

Abstract

The data science ecosystem encompasses data fairness, statistical, ML and AI methods and tools, interpretable data analysis and results, and trustworthy decision-making. Rapid advancements in AI have revolutionized data utilization and enabled machines to learn from data more effectively. Statistics, as the science of learning from data while accounting for uncertainty, plays a pivotal role in addressing complex real-world problems and facilitating trustworthy decision-making. In this talk, I will discuss the challenges and opportunities involved in building an end-to-end scalable and interpretable data science ecosystem using the analysis of whole genome sequencing studies and biobanks that integrates statistics, ML/AI, and genomic and health science as an example. Biobanks collect whole genome data, electronic health records and epidemiological data. I will illustrate key points using the analysis of multi-ancestry whole genome sequencing studies and biobanks by discussing a few scalable and interpretable statistical and ML/AI methods, tools and data science resources.
Specifically, first, data fairness and diversity is a critical pillar of a trustworthy data science ecosystem. About 85+% of genome wide association study samples in the last 15 years are European, resulting in disparity in genetic research. I will discuss the community effort on improving diversity in genetic studies in the last 10 years. I will present trans-ancestry polygenic risk scores (PRS) using millions of common genetic variants across the genome by leveraging large GWAS sample sizes of European and smaller sample sizes of under-represented populations for predicting disease risk using transfer learning and genetic association summary statistics. The performance of deep learning methods for PRS will also be discussed. Second, scalability in cloud platforms is critical for large scale affordable analysis for multi-ancestry biobanks and whole genome studies. I will discuss improving scalability in cloud-computing using interpretable sparsity via FastSparseGRM.
To build an interpretable and powerful end-to-end ecosystem of rare variant analysis of large scale whole genome sequencing studies and biobanks, I will first introduce FAVOR, a multi-faceted variant functional annotation database and portal of all possible 9 billions of variants across the whole genome. I will discuss FAVOR-GPT, a LLM interface of the FAVOR functional annotation database to improve user experience for navigating FAVOR and performing variant functional annotation query and variant functional summary statistics calculations. I will also discuss FAVORannotator which can be used to functionally annotate any whole genome sequencing studies. I will also discuss STAAR and STAAR and STAARpipeline, the WGS rare variant analysis pipeline that boosts the power of WGS rare variant association analysis by dynamically incorporating multi-faceted variant functional annotations. Extension of incorporating single-cell data in WGS analysis will also be discussed. I will also discuss ensemble methods that improve the power of rare variant association tests.
Cloud-deployment of these resources and tools in several ecosystems will be presented, such as RAP for the UK biobank, AnVIL for the NHGRI Genome Sequencing Program and All of Us, and BioData Catalyst for the NHLBI Trans-omics Precision Medine Program (TOPMed). This talk aims to ignite proactive and thought-provoking discussions, foster collaboration, and cultivate open-minded approaches to advance scientific discovery.

References

[1]
Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M., Lin, X (2011) Rare variant association testing for sequencing data using the Sequence Kernel Association Test (SKAT). Am J Hum Genet, 89(1):82--93. PMCID: PMC313581
[2]
Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E., and Lin, X. (2019) ACAT: Fast and Powerful P-value Combination Method for Rare-variant Analysis in Sequencing Studies. American Journal of Human Genetics, 104(3), pp.410--421. PMC6407498
[3]
Li, X., Li, Z., Zhou, H, Gaynor, S, ?, Rotter, J., Willer, C. J., Peloso, G. M., Natarajan, P., Lin, X (2020). Dynamic incorporation of multiple in-silico functional annotations empowers rare variant association analysis of large whole genome sequencing studies at scale, Nature Genetics, 52, 969--983 (2020). 32839606. PMCID: PMC7483769
[4]
Li, Z., Li, X., Zhou, H, Gaynor, S, Rotter, J., Natarajan, P., Peloso, G. M., Lin, X (2022). A framework for detecting noncoding rare variant associations of large-scale whole-genome sequencing studies. Nature Methods, 19, 1599--1611
[5]
Li, X., Quick, C., Zhou, H., Gaynor, S. M., Liu, Y., Rotter, J. I., Natarajan, P., Peloso, G., M., Li, Z., and Lin, X. (2023). Powerful, scalable, and resource-efficient meta-analysis of rare variant associations in large whole-genome sequencing studies. Nature Genetics, 55, 154--164. PMCID: PMC10084891
[6]
Liu, Y., Liu, Z., and Lin, X. (2024). Ensemble testing for global null hypothesis. Journal of the Royal Statistical Society, Series B, 86(2), pp.461--486.
[7]
McCaw, Z. R., Gao, J., Lin, X. and Gronsbell, J. (2024). Leveraging a machine learning derived surrogate phenotype to improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nature Genetics, https://doi.org/10.1038/s41588-024-01793-9

Index Terms

  1. Empower an End-to-end Scalable and Interpretable Data Science Ecosystem using Statistics, AI and Domain Science
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Information & Contributors

                Information

                Published In

                cover image ACM Conferences
                KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
                August 2024
                6901 pages
                ISBN:9798400704901
                DOI:10.1145/3637528
                Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

                Sponsors

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                Published: 24 August 2024

                Check for updates

                Author Tags

                1. ai
                2. annotation
                3. biobanks
                4. electronic health records
                5. ensemble methods
                6. gpt
                7. integrative analysis
                8. interpretability
                9. machine learning
                10. scalability
                11. sparsity
                12. statistics
                13. summary statistics
                14. whole genome sequencing studies

                Qualifiers

                • Keynote

                Funding Sources

                Conference

                KDD '24
                Sponsor:

                Acceptance Rates

                Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

                Contributors

                Other Metrics

                Bibliometrics & Citations

                Bibliometrics

                Article Metrics

                • 0
                  Total Citations
                • 200
                  Total Downloads
                • Downloads (Last 12 months)200
                • Downloads (Last 6 weeks)30
                Reflects downloads up to 23 Dec 2024

                Other Metrics

                Citations

                View Options

                Login options

                View options

                PDF

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                Media

                Figures

                Other

                Tables

                Share

                Share

                Share this Publication link

                Share on social media