Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 141 results for author: Cai, T

Searching in archive stat. Search in all archives.
.
  1. arXiv:2407.20073  [pdf, other

    stat.ME

    Transfer Learning Targeting Mixed Population: A Distributional Robust Perspective

    Authors: Keyao Zhan, Xin Xiong, Zijian Guo, Tianxi Cai, Molei Liu

    Abstract: Despite recent advances in transfer learning with multiple source data sets, there still lacks developments for mixture target populations that could be approximated through a composite of the sources due to certain key factors like ethnicity in practice. To address this open problem under distributional shifts of covariates and outcome models as well as the absence of accurate labels on target, w… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  2. arXiv:2406.20088  [pdf, other

    math.ST stat.ME stat.ML

    Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints

    Authors: Arnab Auddy, T. Tony Cai, Abhinav Chakraborty

    Abstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate,… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 62G20

  3. arXiv:2406.06755  [pdf, other

    math.ST cs.LG stat.ML

    Optimal Federated Learning for Nonparametric Regression with Heterogeneous Distributed Differential Privacy Constraints

    Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

    Abstract: This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 49 pages total, consisting of an article (24 pages) and a supplement (25 pages)

    MSC Class: 62G08; 62C20; 68P27; 62F30;

  4. arXiv:2406.06749  [pdf, other

    math.ST cs.LG stat.ML

    Federated Nonparametric Hypothesis Testing with Differential Privacy Constraints: Optimal Rates and Adaptive Tests

    Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

    Abstract: Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bound… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 77 pages total; consisting of a main article (28 pages) and supplement (49 pages)

    MSC Class: 62G10; 62C20; 68P27; 62F30

  5. arXiv:2405.09493  [pdf, ps, other

    stat.ML cs.LG

    C-Learner: Constrained Learning for Causal Inference and Semiparametric Statistics

    Authors: Tiffany Tianhui Cai, Yuri Fonseca, Kaiwen Hou, Hongseok Namkoong

    Abstract: Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained… ▽ More

    Submitted 22 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

  6. arXiv:2405.06107  [pdf, other

    cs.LG cs.SC hep-ph hep-th stat.ML

    Transforming the Bootstrap: Using Transformers to Compute Scattering Amplitudes in Planar N = 4 Super Yang-Mills Theory

    Authors: Tianji Cai, Garrett W. Merz, François Charton, Niklas Nolte, Matthias Wilhelm, Kyle Cranmer, Lance J. Dixon

    Abstract: We pursue the use of deep learning methods to improve state-of-the-art computations in theoretical high-energy physics. Planar N = 4 Super Yang-Mills theory is a close cousin to the theory that describes Higgs boson production at the Large Hadron Collider; its scattering amplitudes are large mathematical expressions containing integer coefficients. In this paper, we apply Transformers to predict t… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: 26+10 pages, 9 figures, 7 tables, application of machine learning aimed at physics and machine learning audience

    Report number: SLAC-PUB-17774

  7. arXiv:2404.06676  [pdf

    cs.LG eess.SP stat.AP

    Topological Feature Search Method for Multichannel EEG: Application in ADHD classification

    Authors: Tianming Cai, Guoying Zhao, Junbin Zang, Chen Zong, Zhidong Zhang, Chenyang Xue

    Abstract: In recent years, the preliminary diagnosis of Attention Deficit Hyperactivity Disorder (ADHD) using electroencephalography (EEG) has garnered attention from researchers. EEG, known for its expediency and efficiency, plays a pivotal role in the diagnosis and treatment of ADHD. However, the non-stationarity of EEG signals and inter-subject variability pose challenges to the diagnostic and classifica… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  8. arXiv:2403.14926  [pdf, other

    stat.ML cs.LG

    Contrastive Learning on Multimodal Analysis of Electronic Health Records

    Authors: Tianxi Cai, Feiqing Huang, Ryumei Nakada, Linjun Zhang, Doudou Zhou

    Abstract: Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of stru… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 34 pages

  9. arXiv:2401.12272  [pdf, other

    stat.ML cs.LG

    Transfer Learning for Nonparametric Regression: Non-asymptotic Minimax Analysis and Adaptive Procedure

    Authors: T. Tony Cai, Hongming Pu

    Abstract: Transfer learning for nonparametric regression is considered. We first study the non-asymptotic minimax risk for this problem and develop a novel estimator called the confidence thresholding estimator, which is shown to achieve the minimax optimal risk up to a logarithmic factor. Our results demonstrate two unique phenomena in transfer learning: auto-smoothing and super-acceleration, which differe… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  10. arXiv:2401.03820  [pdf, other

    math.ST cs.IT stat.ME stat.ML

    Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

    Authors: T. Tony Cai, Dong Xia, Mengyue Zha

    Abstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  11. arXiv:2312.15611  [pdf, other

    stat.ME stat.ML

    Inference of Dependency Knowledge Graph for Electronic Health Records

    Authors: Zhiwei Xu, Ziming Gan, Doudou Zhou, Shuting Shen, Junwei Lu, Tianxi Cai

    Abstract: The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

  12. arXiv:2311.02574  [pdf, other

    stat.ME

    Semi-supervised Estimation of Event Rate with Doubly-censored Survival Data

    Authors: Yang Wang, Qingning Zhou, Tianxi Cai, Xuan Wang

    Abstract: Electronic Health Record (EHR) has emerged as a valuable source of data for translational research. To leverage EHR data for risk prediction and subsequently clinical decision support, clinical endpoints are often time to onset of a clinical condition of interest. Precise information on clinical event times is often not directly available and requires labor-intensive manual chart review to ascerta… ▽ More

    Submitted 5 November, 2023; originally announced November 2023.

    Comments: 44 pages, 9 figures

  13. arXiv:2309.06534  [pdf, other

    cs.LG stat.ME

    Distributionally Robust Transfer Learning

    Authors: Xin Xiong, Zijian Guo, Tianxi Cai

    Abstract: Many existing transfer learning methods rely on leveraging information from source data that closely resembles the target data. However, this approach often overlooks valuable knowledge that may be present in different yet potentially related auxiliary samples. When dealing with a limited amount of target data and a diverse range of source models, our paper introduces a novel approach, Distributio… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

  14. arXiv:2305.19997  [pdf, other

    stat.ML math.ST

    Knowledge Graph Embedding with Electronic Health Records Data via Latent Graphical Block Model

    Authors: Junwei Lu, Jin Yin, Tianxi Cai

    Abstract: Due to the increasing adoption of electronic health records (EHR), large scale EHRs have become another rich data source for translational clinical research. Despite its potential, deriving generalizable knowledge from EHR data remains challenging. First, EHR data are generated as part of clinical care with data elements too detailed and fragmented for research. Despite recent progress in mapping… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  15. arXiv:2305.17608  [pdf, other

    cs.LG cs.AI cs.CL math.OC stat.ML

    Reward Collapse in Aligning Large Language Models

    Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

    Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results i… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

  16. arXiv:2305.17126  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Large Language Models as Tool Makers

    Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

    Abstract: Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) to… ▽ More

    Submitted 10 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Code available at https://github.com/ctlllll/LLM-ToolMaker

  17. arXiv:2305.02334  [pdf, other

    hep-th cond-mat.dis-nn cs.LG hep-ph stat.ML

    Structures of Neural Network Effective Theories

    Authors: Ian Banta, Tianji Cai, Nathaniel Craig, Zhengkang Zhang

    Abstract: We develop a diagrammatic approach to effective field theories (EFTs) corresponding to deep neural networks at initialization, which dramatically simplifies computations of finite-width corrections to neuron statistics. The structures of EFT calculations make it transparent that a single condition governs criticality of all connected correlators of neuron preactivations. Understanding of such EFTs… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

    Comments: 7+13 pages, 5 figures

  18. arXiv:2305.00164  [pdf, other

    math.ST stat.ME

    Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles

    Authors: T. Tony Cai, Ran Chen, Yuancheng Zhu

    Abstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given f… ▽ More

    Submitted 9 March, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

    Journal ref: Ann. Statist. 52(1): 392-411 (February 2024)

  19. arXiv:2304.06808  [pdf, other

    cs.LG stat.ML

    Active Cost-aware Labeling of Streaming Data

    Authors: Ting Cai, Kirthevasan Kandasamy

    Abstract: We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data's inputs belong to one of $K$ discrete distributions and formalize this problem via a los… ▽ More

    Submitted 4 July, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: Accepted by AISTATS 2023. 20 pages, 11 figures

  20. arXiv:2303.07152  [pdf, ps, other

    math.ST cs.CR cs.LG stat.ME stat.ML

    Score Attack: A Lower Bound Technique for Optimal Differentially Private Learning

    Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

    Abstract: Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differ… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2011.03900

    MSC Class: 62F30; 62J12; 62G05

  21. arXiv:2303.02011  [pdf, other

    stat.ML cs.LG

    Diagnosing Model Performance Under Distribution Shift

    Authors: Tiffany Tianhui Cai, Hongseok Namkoong, Steve Yadlowsky

    Abstract: Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but… ▽ More

    Submitted 10 July, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

  22. arXiv:2302.04970  [pdf, other

    stat.ME

    Efficient Modeling of Surrogates to Improve Multi-source High-dimensional Biobank Studies

    Authors: Yue Liu, Molei Liu, Zijian Guo, Tianxi Cai

    Abstract: Surrogate variables in electronic health records (EHR) and biobank data play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels. We develop a novel approach named SASH for {\bf S}urrogate-{\bf A}ssisted and data-{\bf S}hielding {\bf H}igh-dimensional integrative regression. It is a semi-supervised approach that efficiently leverages sizabl… ▽ More

    Submitted 1 September, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

  23. arXiv:2301.10392  [pdf, other

    stat.ME math.ST

    Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

    Authors: T. Tony Cai, Zijian Guo, Yin Xia

    Abstract: This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to gene… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

  24. arXiv:2301.01381  [pdf, other

    stat.ME math.ST stat.ML

    Testing High-dimensional Multinomials with Applications to Text Analysis

    Authors: T. Tony Cai, Zheng Tracy Ke, Paxton Turner

    Abstract: Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown… ▽ More

    Submitted 24 November, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

  25. arXiv:2301.00718  [pdf, other

    stat.ME

    Robust Inference for Federated Meta-Learning

    Authors: Zijian Guo, Xiudi Li, Larry Han, Tianxi Cai

    Abstract: Synthesizing information from multiple data sources is critical to ensure knowledge generalizability. Integrative analysis of multi-source data is challenging due to the heterogeneity across sources and data-sharing constraints due to privacy concerns. In this paper, we consider a general robust inference framework for federated meta-learning of data from multiple sites, enabling statistical infer… ▽ More

    Submitted 2 January, 2023; originally announced January 2023.

  26. arXiv:2211.16609  [pdf

    stat.AP

    Harnessing electronic health records for real-world evidence

    Authors: Jue Hou, Rachel Zhao, Jessica Gronsbell, Brett K. Beaulieu-Jones, Griffin Webber, Thomas Jemielita, Shuyan Wan, Chuan Hong, Yucong Lin, Tianrun Cai, Jun Wen, Vidul A. Panickan, Clara-Lea Bonzel, Kai-Li Liaw, Katherine P. Liao, Tianxi Cai

    Abstract: While randomized controlled trials (RCTs) are the gold-standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data (RWD) has been vital in post-approval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of RWD is electronic health records (EHRs), which contain detailed inf… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 39 pages, 1 figure, 1 table

  27. arXiv:2211.12612  [pdf, ps, other

    stat.ML cs.LG math.ST

    Transfer Learning for Contextual Multi-armed Bandits

    Authors: Changxiao Cai, T. Tony Cai, Hongzhe Li

    Abstract: Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the… ▽ More

    Submitted 24 January, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: Accepted to the Annals of Statistics

  28. arXiv:2210.09298  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    What Makes Convolutional Models Great on Long Sequence Modeling?

    Authors: Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey

    Abstract: Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired b… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: The code is available at https://github.com/ctlllll/SGConv

  29. arXiv:2209.13762  [pdf, other

    stat.ML cs.LG

    Consensus Knowledge Graph Learning via Multi-view Sparse Low Rank Block Model

    Authors: Tianxi Cai, Dong Xia, Luwan Zhang, Doudou Zhou

    Abstract: Network analysis has been a powerful tool to unveil relationships and interactions among a large number of objects. Yet its effectiveness in accurately identifying important node-node interactions is challenged by the rapidly growing network size, with data being collected at an unprecedented granularity and scale. Common wisdom to overcome such high dimensionality is collapsing nodes into smaller… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

  30. arXiv:2209.08414  [pdf, ps, other

    stat.ME

    Towards Optimal Use of Surrogate Markers to Improve Power

    Authors: Xuan Wang, Layla Parast, Lu Tian, Tianxi Cai

    Abstract: Motivated by increasing pressure for decision makers to shorten the time required to evaluate the efficacy of a treatment such that treatments deemed safe and effective can be made publicly available, there has been substantial recent interest in using an earlier or easier to measure surrogate marker, $S$, in place of the primary outcome, $Y$. To validate the utility of a surrogate marker in these… ▽ More

    Submitted 17 September, 2022; originally announced September 2022.

  31. arXiv:2209.08315  [pdf, other

    stat.ME

    Using a Surrogate with Heterogeneous Utility to Test for a Treatment Effect

    Authors: Layla Parast, Tianxi Cai, Lu Tian

    Abstract: The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow-up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate to test for a treatment effect while ignoring this… ▽ More

    Submitted 17 September, 2022; originally announced September 2022.

  32. arXiv:2209.04977  [pdf, other

    stat.ME

    Semi-supervised Triply Robust Inductive Transfer Learning

    Authors: Tianxi Cai, Mengyan Li, Molei Liu

    Abstract: In this work, we propose a semi-supervised triply robust inductive transfer learning (STRIFLE) approach, which integrates heterogeneous data from label rich source population and label scarce target population to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ two nuisance models, a density ratio model and an i… ▽ More

    Submitted 11 September, 2022; originally announced September 2022.

  33. arXiv:2208.07927  [pdf, other

    stat.ME

    Semi-supervised Transfer Learning for Evaluation of Model Classification Performance

    Authors: Linshanshan Wang, Xuan Wang, Katherine P. Liao, Tianxi Cai

    Abstract: In modern machine learning applications, frequent encounters of covariate shift and label scarcity have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring p… ▽ More

    Submitted 18 November, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: 3 figures, 2 tables

  34. arXiv:2208.05134  [pdf, other

    stat.ME

    Doubly Robust Augmented Model Accuracy Transfer Inference with High Dimensional Features

    Authors: Doudou Zhou, Molei Liu, Mengyan Li, Tianxi Cai

    Abstract: Due to label scarcity and covariate shift happening frequently in real-world studies, transfer learning has become an essential technique to train models generalizable to some target populations using existing labeled source data. Most existing transfer learning research has been focused on model estimation, while there is a paucity of literature on transfer inference for model accuracy despite it… ▽ More

    Submitted 8 November, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

  35. arXiv:2206.05581  [pdf, other

    stat.ML cs.LG stat.ME

    Federated Offline Reinforcement Learning

    Authors: Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, Tianxi Cai

    Abstract: Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessa… ▽ More

    Submitted 27 January, 2024; v1 submitted 11 June, 2022; originally announced June 2022.

  36. arXiv:2205.06960  [pdf, other

    stat.AP stat.ME

    Assessing the Most Vulnerable Subgroup to Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data

    Authors: Xinzhou Guo, Waverly Wei, Molei Liu, Tianxi Cai, Chong Wu, Jingshen Wang

    Abstract: There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In th… ▽ More

    Submitted 21 October, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

    Comments: 25 pages, 2 figures, 5 tables

  37. arXiv:2203.11461  [pdf, other

    stat.ME stat.ML

    Locally Adaptive Algorithms for Multiple Testing with Network Structure, with Application to Genome-Wide Association Studies

    Authors: Ziyi Liang, T. Tony Cai, Wenguang Sun, Yin Xia

    Abstract: Linkage analysis has provided valuable insights to the GWAS studies, particularly in revealing that SNPs in linkage disequilibrium (LD) can jointly influence disease phenotypes. However, the potential of LD network data has often been overlooked or underutilized in the literature. In this paper, we propose a locally adaptive structure learning algorithm (LASLA) that provides a principled and gener… ▽ More

    Submitted 16 August, 2023; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: 33 pages, 7 figures

  38. arXiv:2202.10007  [pdf, other

    stat.ME math.ST stat.AP

    Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression

    Authors: Rong Ma, Zijian Guo, T. Tony Cai, Hongzhe Li

    Abstract: This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is… ▽ More

    Submitted 5 October, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

  39. arXiv:2201.06438  [pdf, other

    math.ST stat.ME stat.ML

    Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms

    Authors: T. Tony Cai, Rong Ma

    Abstract: Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complex… ▽ More

    Submitted 13 August, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: accepted by IEEE Transactions on Information Theory

  40. arXiv:2201.03727  [pdf, ps, other

    stat.ME math.ST

    Estimation and Inference with Proxy Data and its Genetic Applications

    Authors: Sai Li, T. Tony Cai, Hongzhe Li

    Abstract: Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation a… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

  41. arXiv:2112.09313  [pdf, other

    stat.ME math.ST stat.AP

    Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects

    Authors: Larry Han, Jue Hou, Kelly Cho, Rui Duan, Tianxi Cai

    Abstract: Federated learning of causal estimands may greatly improve estimation efficiency by leveraging data from multiple study sites, but robustness to heterogeneity and model misspecifications is vital for ensuring validity. We develop a Federated Adaptive Causal Estimation (FACE) framework to incorporate heterogeneous data from multiple sites to provide treatment effect estimation and inference for a f… ▽ More

    Submitted 5 October, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: 59 pages

  42. arXiv:2111.15012  [pdf, other

    stat.ME

    Adaptive Combination of Randomized and Observational Data

    Authors: David Cheng, Tianxi Cai

    Abstract: Data from both a randomized trial and an observational study are sometimes simultaneously available for evaluating the effect of an intervention. The randomized data typically allows for reliable estimation of average treatment effects but may be limited in sample size and patient heterogeneity for estimating conditional average treatment effects for a broad range of patients. Estimates from the o… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: 23 pages, 2 Figures

  43. arXiv:2111.02826  [pdf, other

    math.ST stat.ME

    Finding the Optimal Dynamic Treatment Regime Using Smooth Fisher Consistent Surrogate Loss

    Authors: Nilanjana Laha, Aaron Sonabend-W, Rajarshi Mukherjee, Tianxi Cai

    Abstract: Large health care data repositories such as electronic health records (EHR) open new opportunities to derive individualized treatment strategies for complicated diseases such as sepsis. In this paper, we consider the problem of estimating sequential treatment rules tailored to a patient's individual characteristics, often referred to as dynamic treatment regimes (DTRs). Our main objective is to fi… ▽ More

    Submitted 30 September, 2023; v1 submitted 3 November, 2021; originally announced November 2021.

    MSC Class: 62G20 ACM Class: G.3

  44. arXiv:2110.12336  [pdf, other

    stat.ME math.ST

    Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response

    Authors: Jue Hou, Rajarshi Mukherjee, Tianxi Cai

    Abstract: A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response often cannot be accurately captured by readily available EHR features and require labor intensive manual chart review to precisely annotate, wh… ▽ More

    Submitted 23 October, 2021; originally announced October 2021.

  45. arXiv:2110.09612  [pdf, ps, other

    stat.ME

    Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records

    Authors: Liang Liang, Jue Hou, Hajime Uno, Kelly Cho, Yanyuan Ma, Tianxi Cai

    Abstract: Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on impo… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  46. arXiv:2109.03365  [pdf, other

    stat.CO stat.ME stat.OT

    SIHR: Statistical Inference in High-Dimensional Linear and Logistic Regression Models

    Authors: Prabrisha Rakshit, Zhenyu Wang, T. Tony Cai, Zijian Guo

    Abstract: We introduce the R package \CRANpkg{SIHR} for statistical inference in high-dimensional generalized linear models with continuous and binary outcomes. The package provides functionalities for constructing confidence intervals and performing hypothesis tests for low-dimensional objectives in both one-sample and two-sample regression settings. We illustrate the usage of \CRANpkg{SIHR} through numeri… ▽ More

    Submitted 1 May, 2023; v1 submitted 7 September, 2021; originally announced September 2021.

  47. arXiv:2108.12112  [pdf, other

    stat.ML cs.CY cs.LG

    Targeting Underrepresented Populations in Precision Medicine: A Federated Transfer Learning Approach

    Authors: Sai Li, Tianxi Cai, Rui Duan

    Abstract: The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research has become a barrier to translating precision medicine research into practice. Due to heterogeneity across populations, risk prediction models are often found to be underperformed in these underrepresented populations, and therefore may further exacerbate known health disparities. I… ▽ More

    Submitted 27 August, 2021; originally announced August 2021.

  48. arXiv:2107.14203  [pdf, other

    stat.ML cs.AI cs.LG stat.AP

    Did the Model Change? Efficiently Assessing Machine Learning API Shifts

    Authors: Lingjiao Chen, Tracy Cai, Matei Zaharia, James Zou

    Abstract: Machine learning (ML) prediction APIs are increasingly widely used. An ML API can change over time due to model updates or retraining. This presents a key challenge in the usage of the API because it is often not clear to the user if and how the ML model has changed. Model shifts can affect downstream application performance and also create oversight issues (e.g. if consistency is desired). In thi… ▽ More

    Submitted 29 July, 2021; originally announced July 2021.

  49. arXiv:2107.00179  [pdf

    math.ST cs.DC cs.LG stat.ML

    Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation

    Authors: T. Tony Cai, Hongji Wei

    Abstract: Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an opt… ▽ More

    Submitted 30 June, 2021; originally announced July 2021.

    MSC Class: 62F30

  50. arXiv:2106.12566  [pdf, other

    cs.LG cs.CL stat.ML

    Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

    Authors: Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

    Abstract: The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful atte… ▽ More

    Submitted 2 November, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021, camera ready version