keynote

Empower an End-to-end Scalable and Interpretable Data Science Ecosystem using Statistics, AI and Domain Science

Author:

Xihong LinAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 3 - 4

https://doi.org/10.1145/3637528.3672194

Published: 24 August 2024 Publication History

Get Access

Abstract

The data science ecosystem encompasses data fairness, statistical, ML and AI methods and tools, interpretable data analysis and results, and trustworthy decision-making. Rapid advancements in AI have revolutionized data utilization and enabled machines to learn from data more effectively. Statistics, as the science of learning from data while accounting for uncertainty, plays a pivotal role in addressing complex real-world problems and facilitating trustworthy decision-making. In this talk, I will discuss the challenges and opportunities involved in building an end-to-end scalable and interpretable data science ecosystem using the analysis of whole genome sequencing studies and biobanks that integrates statistics, ML/AI, and genomic and health science as an example. Biobanks collect whole genome data, electronic health records and epidemiological data. I will illustrate key points using the analysis of multi-ancestry whole genome sequencing studies and biobanks by discussing a few scalable and interpretable statistical and ML/AI methods, tools and data science resources.

Specifically, first, data fairness and diversity is a critical pillar of a trustworthy data science ecosystem. About 85+% of genome wide association study samples in the last 15 years are European, resulting in disparity in genetic research. I will discuss the community effort on improving diversity in genetic studies in the last 10 years. I will present trans-ancestry polygenic risk scores (PRS) using millions of common genetic variants across the genome by leveraging large GWAS sample sizes of European and smaller sample sizes of under-represented populations for predicting disease risk using transfer learning and genetic association summary statistics. The performance of deep learning methods for PRS will also be discussed. Second, scalability in cloud platforms is critical for large scale affordable analysis for multi-ancestry biobanks and whole genome studies. I will discuss improving scalability in cloud-computing using interpretable sparsity via FastSparseGRM.

To build an interpretable and powerful end-to-end ecosystem of rare variant analysis of large scale whole genome sequencing studies and biobanks, I will first introduce FAVOR, a multi-faceted variant functional annotation database and portal of all possible 9 billions of variants across the whole genome. I will discuss FAVOR-GPT, a LLM interface of the FAVOR functional annotation database to improve user experience for navigating FAVOR and performing variant functional annotation query and variant functional summary statistics calculations. I will also discuss FAVORannotator which can be used to functionally annotate any whole genome sequencing studies. I will also discuss STAAR and STAAR and STAARpipeline, the WGS rare variant analysis pipeline that boosts the power of WGS rare variant association analysis by dynamically incorporating multi-faceted variant functional annotations. Extension of incorporating single-cell data in WGS analysis will also be discussed. I will also discuss ensemble methods that improve the power of rare variant association tests.

Cloud-deployment of these resources and tools in several ecosystems will be presented, such as RAP for the UK biobank, AnVIL for the NHGRI Genome Sequencing Program and All of Us, and BioData Catalyst for the NHLBI Trans-omics Precision Medine Program (TOPMed). This talk aims to ignite proactive and thought-provoking discussions, foster collaboration, and cultivate open-minded approaches to advance scientific discovery.

References

[1]

Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M., Lin, X (2011) Rare variant association testing for sequencing data using the Sequence Kernel Association Test (SKAT). Am J Hum Genet, 89(1):82--93. PMCID: PMC313581

Crossref

Google Scholar

[2]

Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E., and Lin, X. (2019) ACAT: Fast and Powerful P-value Combination Method for Rare-variant Analysis in Sequencing Studies. American Journal of Human Genetics, 104(3), pp.410--421. PMC6407498

Crossref

Google Scholar

[3]

Li, X., Li, Z., Zhou, H, Gaynor, S, ?, Rotter, J., Willer, C. J., Peloso, G. M., Natarajan, P., Lin, X (2020). Dynamic incorporation of multiple in-silico functional annotations empowers rare variant association analysis of large whole genome sequencing studies at scale, Nature Genetics, 52, 969--983 (2020). 32839606. PMCID: PMC7483769

Crossref

Google Scholar

[4]

Li, Z., Li, X., Zhou, H, Gaynor, S, Rotter, J., Natarajan, P., Peloso, G. M., Lin, X (2022). A framework for detecting noncoding rare variant associations of large-scale whole-genome sequencing studies. Nature Methods, 19, 1599--1611

Crossref

Google Scholar

[5]

Li, X., Quick, C., Zhou, H., Gaynor, S. M., Liu, Y., Rotter, J. I., Natarajan, P., Peloso, G., M., Li, Z., and Lin, X. (2023). Powerful, scalable, and resource-efficient meta-analysis of rare variant associations in large whole-genome sequencing studies. Nature Genetics, 55, 154--164. PMCID: PMC10084891

Crossref

Google Scholar

[6]

Liu, Y., Liu, Z., and Lin, X. (2024). Ensemble testing for global null hypothesis. Journal of the Royal Statistical Society, Series B, 86(2), pp.461--486.

Crossref

Google Scholar

[7]

McCaw, Z. R., Gao, J., Lin, X. and Gronsbell, J. (2024). Leveraging a machine learning derived surrogate phenotype to improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nature Genetics, https://doi.org/10.1038/s41588-024-01793-9

Crossref

Google Scholar

Index Terms

Empower an End-to-end Scalable and Interpretable Data Science Ecosystem using Statistics, AI and Domain Science
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Ensemble methods

Index terms have been assigned to the content through auto-classification.

Recommendations

Scalable Summary Statistics-Based Heritability Estimation Method with Individual Genotype Level Accuracy
Research in Computational Molecular Biology
Abstract
SNP heritability, the proportion of phenotypic variation explained by genotyped SNPs, is an important parameter in understanding the genetic architecture underlying various diseases and traits. Methods that aim to estimate SNP heritability from ...
Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript

Motivation: The discovery of novel gene fusions can lead to a better comprehension of cancer progression and development. The emergence of deep sequencing of trancriptome, known as RNA-seq, has opened many opportunities for the identification of this ...
Detecting genomic indel variants with exact breakpoints in single-and paired-end sequencing data using SplazerS

Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their ...

Comments

Information & Contributors

Information

Published In

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Check for updates

Author Tags

Qualifiers

Keynote

Funding Sources

US National Institute of Health
NIH (National Institutes of Health)

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
263
Total Downloads

Downloads (Last 12 months)263
Downloads (Last 6 weeks)39

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Scalable Summary Statistics-Based Heritability Estimation Method with Individual Genotype Level Accuracy

Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript

Detecting genomic indel variants with exact breakpoints in single-and paired-end sequencing data using SplazerS

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations