-
F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data
Authors:
Zexing Xu,
Linjun Zhang,
Sitan Yang,
Rasoul Etesami,
Hanghang Tong,
Huan Zhang,
Jiawei Han
Abstract:
Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns f…
▽ More
Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits
Authors:
Ziyi Huang,
Henry Lam,
Haofeng Zhang
Abstract:
Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret…
▽ More
Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms when applied with approximate inference. These results hold for general Bayesian inference approaches, under the assumption that the inference error measured by two different $α$-divergences is bounded. Additionally, by introducing a new definition of well-behaved distributions, we show that LinBUCB improves the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data
Authors:
Jiantong Wang,
Heng Lian,
Yan Yu,
Heping Zhang
Abstract:
Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables…
▽ More
Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting
Authors:
Daojun Liang,
Haixia Zhang,
Dongfeng Yuan
Abstract:
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes…
▽ More
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Federated Q-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost
Authors:
Zhong Zheng,
Haochen Zhang,
Lingzhou Xue
Abstract:
In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated Q-learning algorithms achieving near-linear regret speedup with low communication co…
▽ More
In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated Q-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated Q-learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and operates under two distinct mechanisms: synchronization between the agents and the server, and policy update, both triggered by events. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
On the Saturation Effect of Kernel Ridge Regression
Authors:
Yicheng Li,
Haobo Zhang,
Qian Lin
Abstract:
The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-sta…
▽ More
The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-standing conjecture.
△ Less
Submitted 28 May, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Efficient and Adaptive Posterior Sampling Algorithms for Bandits
Authors:
Bingshan Hu,
Zhiming Huang,
Tianyue H. Zhang,
Mathias Lécuyer,
Nidhi Hegde
Abstract:
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-wo…
▽ More
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
The Impact of COVID-19 on Co-authorship and Economics Scholars' Productivity
Authors:
Hanqiao Zhang,
Joy D. Xiuyao Yang
Abstract:
The COVID-19 pandemic has disrupted traditional academic collaboration patterns, prompting a unique opportunity to analyze the influence of peer effects and coauthorship dynamics on research output. Using a novel dataset, this paper endeavors to make a first cut at investigating the role of peer effects on the productivity of economics scholars, measured by the number of publications, in both pre-…
▽ More
The COVID-19 pandemic has disrupted traditional academic collaboration patterns, prompting a unique opportunity to analyze the influence of peer effects and coauthorship dynamics on research output. Using a novel dataset, this paper endeavors to make a first cut at investigating the role of peer effects on the productivity of economics scholars, measured by the number of publications, in both pre-pandemic and pandemic times. Results show that peer effect is significant for the pre-pandemic time but not for the pandemic time. The findings contribute to our understanding of how research collaboration influences knowledge production and may help guide policies aimed at fostering collaboration and enhancing research productivity in the academic community.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Analysis of Proximity Informed User Behavior in a Global Online Social Network
Authors:
Nils Breitmar,
Matthew C. Harding,
Hanqiao Zhang
Abstract:
Despite the earlier claim of "Death of Distance", recent studies revealed that geographical proximity still greatly influences link formation in online social networks. However, it is unclear how physical distances are intertwined with users' online behaviors in a virtual world. We study the role of spatial dependence on a global online social network with a dyadic Logit model. Results show countr…
▽ More
Despite the earlier claim of "Death of Distance", recent studies revealed that geographical proximity still greatly influences link formation in online social networks. However, it is unclear how physical distances are intertwined with users' online behaviors in a virtual world. We study the role of spatial dependence on a global online social network with a dyadic Logit model. Results show country-specific patterns for distance effect on probabilities to build connections. Effects are stronger when the possibility for two people to meet in person exists. Relative to weak ties, dependence on proximity is looser for strong social ties.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Exit Spillovers of Foreign-invested Enterprises in Shenzhen's Electronics Manufacturing Industry
Authors:
Hanqiao Zhang
Abstract:
Neighborhood characteristics have been broadly studied with different firm behaviors, e.g. birth, entry, expansion, and survival, except for firm exit. Using a novel dataset of foreign-invested enterprises operating in Shenzhen's electronics manufacturing industry from 2017 to 2021, I investigate the spillover effects of firm exits on other firms in the vicinity, from both the industry group and t…
▽ More
Neighborhood characteristics have been broadly studied with different firm behaviors, e.g. birth, entry, expansion, and survival, except for firm exit. Using a novel dataset of foreign-invested enterprises operating in Shenzhen's electronics manufacturing industry from 2017 to 2021, I investigate the spillover effects of firm exits on other firms in the vicinity, from both the industry group and the industry class level. Significant neighborhood effects are identified for the industry group level, but not the industry class level.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Prior Effective Sample Size When Borrowing on the Treatment Effect Scale
Authors:
Hongtao Zhang,
Keaven M Anderson,
Zachary Zimmer,
Gregory Golm,
Aditi Sapre,
Joseph G Ibrahim
Abstract:
With the robust uptick in the applications of Bayesian external data borrowing, eliciting a prior distribution with the proper amount of information becomes increasingly critical. The prior effective sample size (ESS) is an intuitive and efficient measure for this purpose. The majority of ESS definitions have been proposed in the context of borrowing control information. While many Bayesian models…
▽ More
With the robust uptick in the applications of Bayesian external data borrowing, eliciting a prior distribution with the proper amount of information becomes increasingly critical. The prior effective sample size (ESS) is an intuitive and efficient measure for this purpose. The majority of ESS definitions have been proposed in the context of borrowing control information. While many Bayesian models can be naturally extended to leveraging external information on the treatment effect scale, very little attention has been directed to computing the prior ESS in this setting. In this research, we bridge this methodological gap by extending the popular ELIR ESS definition. We lay out the general framework, and derive the prior ESS for various types of endpoints and treatment effect measures. The posterior distribution and the predictive consistency property of ESS are also examined. The methods are implemented in R programs available on GitHub: https://github.com/squallteo/TrtEffESS.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
A Fourier Approach to the Parameter Estimation Problem for One-dimensional Gaussian Mixture Models
Authors:
Xinyu Liu,
Hai Zhang
Abstract:
The purpose of this paper is twofold. First, we propose a novel algorithm for estimating parameters in one-dimensional Gaussian mixture models (GMMs). The algorithm takes advantage of the Hankel structure inherent in the Fourier data obtained from independent and identically distributed (i.i.d) samples of the mixture. For GMMs with a unified variance, a singular value ratio functional using the Fo…
▽ More
The purpose of this paper is twofold. First, we propose a novel algorithm for estimating parameters in one-dimensional Gaussian mixture models (GMMs). The algorithm takes advantage of the Hankel structure inherent in the Fourier data obtained from independent and identically distributed (i.i.d) samples of the mixture. For GMMs with a unified variance, a singular value ratio functional using the Fourier data is introduced and used to resolve the variance and component number simultaneously. The consistency of the estimator is derived. Compared to classic algorithms such as the method of moments and the maximum likelihood method, the proposed algorithm does not require prior knowledge of the number of Gaussian components or good initial guesses. Numerical experiments demonstrate its superior performance in estimation accuracy and computational cost. Second, we reveal that there exists a fundamental limit to the problem of estimating the number of Gaussian components or model order in the mixture model if the number of i.i.d samples is finite. For the case of a single variance, we show that the model order can be successfully estimated only if the minimum separation distance between the component means exceeds a certain threshold value and can fail if below. We derive a lower bound for this threshold value, referred to as the computational resolution limit, in terms of the number of i.i.d samples, the variance, and the number of Gaussian components. Numerical experiments confirm this phase transition phenomenon in estimating the model order. Moreover, we demonstrate that our algorithm achieves better scores in likelihood, AIC, and BIC when compared to the EM algorithm.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
The phase diagram of kernel interpolation in large dimensions
Authors:
Haobo Zhang,
Weihao Lu,
Qian Lin
Abstract:
The generalization ability of kernel interpolation in large dimensions (i.e., $n \asymp d^γ$ for some $γ>0$) might be one of the most interesting problems in the recent renaissance of kernel regression, since it may help us understand the 'benign overfitting phenomenon' reported in the neural networks literature. Focusing on the inner product kernel on the sphere, we fully characterized the exact…
▽ More
The generalization ability of kernel interpolation in large dimensions (i.e., $n \asymp d^γ$ for some $γ>0$) might be one of the most interesting problems in the recent renaissance of kernel regression, since it may help us understand the 'benign overfitting phenomenon' reported in the neural networks literature. Focusing on the inner product kernel on the sphere, we fully characterized the exact order of both the variance and bias of large-dimensional kernel interpolation under various source conditions $s\geq 0$. Consequently, we obtained the $(s,γ)$-phase diagram of large-dimensional kernel interpolation, i.e., we determined the regions in $(s,γ)$-plane where the kernel interpolation is minimax optimal, sub-optimal and inconsistent.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
A Strategy Transfer and Decision Support Approach for Epidemic Control in Experience Shortage Scenarios
Authors:
X. Xiao,
P. Chen,
X. Cao,
K. Liu,
L. Deng,
D. Zhao,
Z. Chen,
Q. Deng,
F. Yu,
H. Zhang
Abstract:
Epidemic outbreaks can cause critical health concerns and severe global economic crises. For countries or regions with new infectious disease outbreaks, it is essential to generate preventive strategies by learning lessons from others with similar risk profiles. A Strategy Transfer and Decision Support Approach (STDSA) is proposed based on the profile similarity evaluation. There are four steps in…
▽ More
Epidemic outbreaks can cause critical health concerns and severe global economic crises. For countries or regions with new infectious disease outbreaks, it is essential to generate preventive strategies by learning lessons from others with similar risk profiles. A Strategy Transfer and Decision Support Approach (STDSA) is proposed based on the profile similarity evaluation. There are four steps in this method: (1) The similarity evaluation indicators are determined from three dimensions, i.e., the Basis of National Epidemic Prevention & Control, Social Resilience, and Infection Situation. (2) The data related to the indicators are collected and preprocessed. (3) The first round of screening on the preprocessed dataset is conducted through an improved collaborative filtering algorithm to calculate the preliminary similarity result from the perspective of the infection situation. (4) Finally, the K-Means model is used for the second round of screening to obtain the final similarity values. The approach will be applied to decision-making support in the context of COVID-19. Our results demonstrate that the recommendations generated by the STDSA model are more accurate and aligned better with the actual situation than those produced by pure K-means models. This study will provide new insights into preventing and controlling epidemics in regions that lack experience.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Using Explainable AI and Transfer Learning to understand and predict the maintenance of Atlantic blocking with limited observational data
Authors:
Huan Zhang,
Justin Finkel,
Dorian S. Abbot,
Edwin P. Gerber,
Jonathan Weare
Abstract:
Blocking events are an important cause of extreme weather, especially long-lasting blocking events that trap weather systems in place. The duration of blocking events is, however, underestimated in climate models. Explainable Artificial Intelligence are a class of data analysis methods that can help identify physical causes of prolonged blocking events and diagnose model deficiencies. We demonstra…
▽ More
Blocking events are an important cause of extreme weather, especially long-lasting blocking events that trap weather systems in place. The duration of blocking events is, however, underestimated in climate models. Explainable Artificial Intelligence are a class of data analysis methods that can help identify physical causes of prolonged blocking events and diagnose model deficiencies. We demonstrate this approach on an idealized quasigeostrophic model developed by Marshall and Molteni (1993). We train a convolutional neural network (CNN), and subsequently, build a sparse predictive model for the persistence of Atlantic blocking, conditioned on an initial high-pressure anomaly. Shapley Additive ExPlanation (SHAP) analysis reveals that high-pressure anomalies in the American Southeast and North Atlantic, separated by a trough over Atlantic Canada, contribute significantly to prediction of sustained blocking events in the Atlantic region. This agrees with previous work that identified precursors in the same regions via wave train analysis. When we apply the same CNN to blockings in the ERA5 atmospheric reanalysis, there is insufficient data to accurately predict persistent blocks. We partially overcome this limitation by pre-training the CNN on the plentiful data of the Marshall-Molteni model, and then using Transfer Learning to achieve better predictions than direct training. SHAP analysis before and after transfer learning allows a comparison between the predictive features in the reanalysis and the quasigeostrophic model, quantifying dynamical biases in the idealized model. This work demonstrates the potential for machine learning methods to extract meaningful precursors of extreme weather events and achieve better prediction using limited observational data.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Language Model Prompt Selection via Simulation Optimization
Authors:
Haoting Zhang,
Jinghai He,
Rhonda Righter,
Zeyu Zheng
Abstract:
With the advancement in generative language models, the selection of prompts has gained significant attention in recent years. A prompt is an instruction or description provided by the user, serving as a guide for the generative language model in content generation. Despite existing methods for prompt selection that are based on human labor, we consider facilitating this selection through simulati…
▽ More
With the advancement in generative language models, the selection of prompts has gained significant attention in recent years. A prompt is an instruction or description provided by the user, serving as a guide for the generative language model in content generation. Despite existing methods for prompt selection that are based on human labor, we consider facilitating this selection through simulation optimization, aiming to maximize a pre-defined score for the selected prompt. Specifically, we propose a two-stage framework. In the first stage, we determine a feasible set of prompts in sufficient numbers, where each prompt is represented by a moderate-dimensional vector. In the subsequent stage for evaluation and selection, we construct a surrogate model of the score regarding the moderate-dimensional vectors that represent the prompts. We propose sequentially selecting the prompt for evaluation based on this constructed surrogate model. We prove the consistency of the sequential evaluation procedure in our framework. We also conduct numerical experiments to demonstrate the efficacy of our proposed framework, providing practical instructions for implementation.
△ Less
Submitted 19 May, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Semisupervised score based matching algorithm to evaluate the effect of public health interventions
Authors:
Hongzhe Zhang,
Jiasheng Shi,
Jing Huang
Abstract:
Multivariate matching algorithms "pair" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of "pairs" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching…
▽ More
Multivariate matching algorithms "pair" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of "pairs" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching algorithm with efficiency and comparatively limited auxiliary matching knowledge provided through a "training" set of paired units by domain experts, is practically intriguing.
We proposed a novel one-to-one matching algorithm based on a quadratic score function $S_β(x_i,x_j)= β^T (x_i-x_j)(x_i-x_j)^T β$. The weights $β$, which can be interpreted as a variable importance measure, are designed to minimize the score difference between paired training units while maximizing the score difference between unpaired training units. Further, in the typical but intricate case where the training set is much smaller than the unpaired set, we propose a \underline{s}emisupervised \underline{c}ompanion \underline{o}ne-\underline{t}o-\underline{o}ne \underline{m}atching \underline{a}lgorithm (SCOTOMA) that makes the best use of the unpaired units. The proposed weight estimator is proved to be consistent when the truth matching criterion is indeed the quadratic score function. When the model assumptions are violated, we demonstrate that the proposed algorithm still outperforms some popular competing matching algorithms through a series of simulations. We applied the proposed algorithm to a real-world study to investigate the effect of in-person schooling on community Covid-19 transmission rate for policy making purpose.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Estimating Factor-Based Spot Volatility Matrices with Noisy and Asynchronous High-Frequency Data
Authors:
Degui Li,
Oliver Linton,
Haoxuan Zhang
Abstract:
We propose a new estimator of high-dimensional spot volatility matrices satisfying a low-rank plus sparse structure from noisy and asynchronous high-frequency data collected for an ultra-large number of assets. The noise processes are allowed to be temporally correlated, heteroskedastic, asymptotically vanishing and dependent on the efficient prices. We define a kernel-weighted pre-averaging metho…
▽ More
We propose a new estimator of high-dimensional spot volatility matrices satisfying a low-rank plus sparse structure from noisy and asynchronous high-frequency data collected for an ultra-large number of assets. The noise processes are allowed to be temporally correlated, heteroskedastic, asymptotically vanishing and dependent on the efficient prices. We define a kernel-weighted pre-averaging method to jointly tackle the microstructure noise and asynchronicity issues, and we obtain uniformly consistent estimates for latent prices. We impose a continuous-time factor model with time-varying factor loadings on the price processes, and estimate the common factors and loadings via a local principal component analysis. Assuming a uniform sparsity condition on the idiosyncratic volatility structure, we combine the POET and kernel-smoothing techniques to estimate the spot volatility matrices for both the latent prices and idiosyncratic errors. Under some mild restrictions, the estimated spot volatility matrices are shown to be uniformly consistent under various matrix norms. We provide Monte-Carlo simulation and empirical studies to examine the numerical performance of the developed estimation methodology.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Zeroth-Order primal-dual Alternating Projection Gradient Algorithms for Nonconvex Minimax Problems with Coupled linear Constraints
Authors:
Huiling Zhang,
Zi Xu,
Yuhong Dai
Abstract:
In this paper, we study zeroth-order algorithms for nonconvex minimax problems with coupled linear constraints under the deterministic and stochastic settings, which have attracted wide attention in machine learning, signal processing and many other fields in recent years, e.g., adversarial attacks in resource allocation problems and network flow problems etc. We propose two single-loop algorithms…
▽ More
In this paper, we study zeroth-order algorithms for nonconvex minimax problems with coupled linear constraints under the deterministic and stochastic settings, which have attracted wide attention in machine learning, signal processing and many other fields in recent years, e.g., adversarial attacks in resource allocation problems and network flow problems etc. We propose two single-loop algorithms, namely the zero-order primal-dual alternating projected gradient (ZO-PDAPG) algorithm and the zero-order regularized momentum primal-dual projected gradient algorithm (ZO-RMPDPG), for solving deterministic and stochastic nonconvex-(strongly) concave minimax problems with coupled linear constraints. The iteration complexity of the two proposed algorithms to obtain an $\varepsilon$-stationary point are proved to be $\mathcal{O}(\varepsilon ^{-2})$ (resp. $\mathcal{O}(\varepsilon ^{-4})$) for solving nonconvex-strongly concave (resp. nonconvex-concave) minimax problems with coupled linear constraints under deterministic settings and $\tilde{\mathcal{O}}(\varepsilon ^{-3})$ (resp. $\tilde{\mathcal{O}}(\varepsilon ^{-6.5})$) under stochastic settings respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with iterative complexity guarantees for solving nonconvex-(strongly) concave minimax problems with coupled linear constraints under the deterministic and stochastic settings.
△ Less
Submitted 26 January, 2024;
originally announced February 2024.
-
High-dimensional Bayesian Optimization via Covariance Matrix Adaptation Strategy
Authors:
Lam Ngo,
Huong Ha,
Jeffrey Chan,
Vu Nguyen,
Hongyu Zhang
Abstract:
Bayesian Optimization (BO) is an effective method for finding the global optimum of expensive black-box functions. However, it is well known that applying BO to high-dimensional optimization problems is challenging. To address this issue, a promising solution is to use a local search strategy that partitions the search domain into local regions with high likelihood of containing the global optimum…
▽ More
Bayesian Optimization (BO) is an effective method for finding the global optimum of expensive black-box functions. However, it is well known that applying BO to high-dimensional optimization problems is challenging. To address this issue, a promising solution is to use a local search strategy that partitions the search domain into local regions with high likelihood of containing the global optimum, and then use BO to optimize the objective function within these regions. In this paper, we propose a novel technique for defining the local regions using the Covariance Matrix Adaptation (CMA) strategy. Specifically, we use CMA to learn a search distribution that can estimate the probabilities of data points being the global optimum of the objective function. Based on this search distribution, we then define the local regions consisting of data points with high probabilities of being the global optimum. Our approach serves as a meta-algorithm as it can incorporate existing black-box BO optimizers, such as BO, TuRBO, and BAxUS, to find the global optimum of the objective function within our derived local regions. We evaluate our proposed method on various benchmark synthetic and real-world problems. The results demonstrate that our method outperforms existing state-of-the-art techniques.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Timer: Generative Pre-trained Transformers Are Large Time Series Models
Authors:
Yong Liu,
Haoran Zhang,
Chenyu Li,
Xiangdong Huang,
Jianmin Wang,
Mingsheng Long
Abstract:
Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous prog…
▽ More
Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progress has been achieved with the emergence of large language models, exhibiting unprecedented abilities such as few-shot generalization, scalability, and task generality, which are however absent in small deep models. To change the status quo of training scenario-specific small models from scratch, this paper aims at the early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), which is generative pre-trained by next token prediction and adapted to various downstream tasks with promising capabilities as an LTSM. Code and datasets are available at: https://github.com/thuml/Large-Time-Series-Model.
△ Less
Submitted 4 June, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
A Closer Look at AUROC and AUPRC under Class Imbalance
Authors:
Matthew B. A. McDermott,
Lasse Hyldig Hansen,
Haoran Zhang,
Giovanni Angelotti,
Jack Gallifant
Abstract:
In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in prob…
▽ More
In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.
△ Less
Submitted 18 April, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
A Tidy Framework and Infrastructure to Systematically Assemble Spatio-temporal Indexes from Multivariate Data
Authors:
H. Sherry Zhang,
Dianne Cook,
Ursula Laa,
Nicolas Langrené,
Patricia Menéndez
Abstract:
Indexes are useful for summarizing multivariate information into single metrics for monitoring, communicating, and decision-making. While most work has focused on defining new indexes for specific purposes, more attention needs to be directed towards making it possible to understand index behavior in different data conditions, and to determine how their structure affects their values and variation…
▽ More
Indexes are useful for summarizing multivariate information into single metrics for monitoring, communicating, and decision-making. While most work has focused on defining new indexes for specific purposes, more attention needs to be directed towards making it possible to understand index behavior in different data conditions, and to determine how their structure affects their values and variation in values. Here we discuss a modular data pipeline recommendation to assemble indexes. It is universally applicable to index computation and allows investigation of index behavior as part of the development procedure. One can compute indexes with different parameter choices, adjust steps in the index definition by adding, removing, and swapping them to experiment with various index designs, calculate uncertainty measures, and assess indexes robustness. The paper presents three examples to illustrate the pipeline framework usage: comparison of two different indexes designed to monitor the spatio-temporal distribution of drought in Queensland, Australia; the effect of dimension reduction choices on the Global Gender Gap Index (GGGI) on countries ranking; and how to calculate bootstrap confidence intervals for the Standardized Precipitation Index (SPI). The methods are supported by a new R package, called tidyindex.
△ Less
Submitted 13 May, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Study Duration Prediction for Clinical Trials with Time-to-Event Endpoints Using Mixture Distributions Accounting for Heterogeneous Population
Authors:
Hong Zhang,
Jie Pu,
Shibing Deng,
Satrajit Roychoudhury,
Haitao Chu,
Douglas Robinson
Abstract:
In the era of precision medicine, more and more clinical trials are now driven or guided by biomarkers, which are patient characteristics objectively measured and evaluated as indicators of normal biological processes, pathogenic processes, or pharmacologic responses to therapeutic interventions. With the overarching objective to optimize and personalize disease management, biomarker-guided clinic…
▽ More
In the era of precision medicine, more and more clinical trials are now driven or guided by biomarkers, which are patient characteristics objectively measured and evaluated as indicators of normal biological processes, pathogenic processes, or pharmacologic responses to therapeutic interventions. With the overarching objective to optimize and personalize disease management, biomarker-guided clinical trials increase the efficiency by appropriately utilizing prognostic or predictive biomarkers in the design. However, the efficiency gain is often not quantitatively compared to the traditional all-comers design, in which a faster enrollment rate is expected (e.g. due to no restriction to biomarker positive patients) potentially leading to a shorter duration. To accurately predict biomarker-guided trial duration, we propose a general framework using mixture distributions accounting for heterogeneous population. Extensive simulations are performed to evaluate the impact of heterogeneous population and the dynamics of biomarker characteristics and disease on the study duration. Several influential parameters including median survival time, enrollment rate, biomarker prevalence and effect size are identitied. Re-assessments of two publicly available trials are conducted to empirically validate the prediction accuracy and to demonstrate the practical utility. The R package \emph{detest} is developed to implement the proposed method and is publicly available on CRAN.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
Superpixel-based and Spatially-regularized Diffusion Learning for Unsupervised Hyperspectral Image Clustering
Authors:
Kangning Cui,
Ruoning Li,
Sam L. Polk,
Yinyi Lin,
Hongsheng Zhang,
James M. Murphy,
Robert J. Plemmons,
Raymond H. Chan
Abstract:
Hyperspectral images (HSIs) provide exceptional spatial and spectral resolution of a scene, crucial for various remote sensing applications. However, the high dimensionality, presence of noise and outliers, and the need for precise labels of HSIs present significant challenges to HSIs analysis, motivating the development of performant HSI clustering algorithms. This paper introduces a novel unsupe…
▽ More
Hyperspectral images (HSIs) provide exceptional spatial and spectral resolution of a scene, crucial for various remote sensing applications. However, the high dimensionality, presence of noise and outliers, and the need for precise labels of HSIs present significant challenges to HSIs analysis, motivating the development of performant HSI clustering algorithms. This paper introduces a novel unsupervised HSI clustering algorithm, Superpixel-based and Spatially-regularized Diffusion Learning (S2DL), which addresses these challenges by incorporating rich spatial information encoded in HSIs into diffusion geometry-based clustering. S2DL employs the Entropy Rate Superpixel (ERS) segmentation technique to partition an image into superpixels, then constructs a spatially-regularized diffusion graph using the most representative high-density pixels. This approach reduces computational burden while preserving accuracy. Cluster modes, serving as exemplars for underlying cluster structure, are identified as the highest-density pixels farthest in diffusion distance from other highest-density pixels. These modes guide the labeling of the remaining representative pixels from ERS superpixels. Finally, majority voting is applied to the labels assigned within each superpixel to propagate labels to the rest of the image. This spatial-spectral approach simultaneously simplifies graph construction, reduces computational cost, and improves clustering performance. S2DL's performance is illustrated with extensive experiments on three publicly available, real-world HSIs: Indian Pines, Salinas, and Salinas A. Additionally, we apply S2DL to landscape-scale, unsupervised mangrove species mapping in the Mai Po Nature Reserve, Hong Kong, using a Gaofen-5 HSI. The success of S2DL in these diverse numerical experiments indicates its efficacy on a wide range of important unsupervised remote sensing analysis tasks.
△ Less
Submitted 24 December, 2023;
originally announced December 2023.
-
Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines
Authors:
Liyun Zeng,
Hao Helen Zhang
Abstract:
Classification and probability estimation have broad applications in modern machine learning and data science applications, including biology, medicine, engineering, and computer science. The recent development of a class of weighted Support Vector Machines (wSVMs) has shown great values in robustly predicting the class probability and classification for various problems with high accuracy. The cu…
▽ More
Classification and probability estimation have broad applications in modern machine learning and data science applications, including biology, medicine, engineering, and computer science. The recent development of a class of weighted Support Vector Machines (wSVMs) has shown great values in robustly predicting the class probability and classification for various problems with high accuracy. The current framework is based on the $\ell^2$-norm regularized binary wSVMs optimization problem, which only works with dense features and has poor performance at sparse features with redundant noise in most real applications. The sparse learning process requires a prescreen of the important variables for each binary wSVMs for accurately estimating pairwise conditional probability. In this paper, we proposed novel wSVMs frameworks that incorporate automatic variable selection with accurate probability estimation for sparse learning problems. We developed efficient algorithms for effective variable selection for solving either the $\ell^1$-norm or elastic net regularized binary wSVMs optimization problems. The binary class probability is then estimated either by the $\ell^2$-norm regularized wSVMs framework with selected variables or by elastic net regularized wSVMs directly. The two-step approach of $\ell^1$-norm followed by $\ell^2$-norm wSVMs show a great advantage in both automatic variable selection and reliable probability estimators with the most efficient time. The elastic net regularized wSVMs offer the best performance in terms of variable selection and probability estimation with the additional advantage of variable grouping in the compensation of more computation time for high dimensional problems. The proposed wSVMs-based sparse learning methods have wide applications and can be further extended to $K$-class problems through ensemble learning.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect Estimation
Authors:
Tao Hu,
Honglong Zhang,
Fan Zeng,
Min Du,
XiangKun Du,
Yue Zheng,
Quanqi Li,
Mengran Zhang,
Dan Yang,
Jihao Wu
Abstract:
In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables…
▽ More
In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables handle data from a holistic perspective, which cannot ensure the precision of causal effects in specific temporal and spatial dimensions. However, temporal and spatial dimensions are extremely critical in the logistics field, and this limitation may directly affect the precision of subsidy and pricing strategies. To address these issues, this study proposes a technique based on flexible temporal-spatial grid partitioning. Furthermore, based on the flexible grid partitioning technique, we further propose a continuous entropy balancing method in the temporal-spatial domain, which named TS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments). The method proposed in this paper has been tested on two simulation datasets and two real datasets, all of which have achieved excellent performance. In fact, after applying the TS-EBCT method to the intracity freight transportation field, the prediction accuracy of the causal effect has been significantly improved. It brings good business benefits to the company's subsidy and pricing strategies.
△ Less
Submitted 18 December, 2023; v1 submitted 14 December, 2023;
originally announced December 2023.
-
AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities
Authors:
Yuhan Li,
Hongtao Zhang,
Keaven Anderson,
Songzi Li,
Ruoqing Zhu
Abstract:
In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to pos…
▽ More
In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to post-marketing benefit-risk assessment. Kolluri et al. provided a review of several case studies that span these stages, featuring key applications such as protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring. From a regulatory standpoint, there was a notable uptick in submissions incorporating AI components in 2021. The most prevalent therapeutic areas leveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%), and neurology (11%). The paradigm of personalized or precision medicine has gained significant traction in recent research, partly due to advancements in AI techniques \cite{hamburg2010path}. This shift has had a transformative impact on the pharmaceutical industry. Departing from the traditional "one-size-fits-all" model, personalized medicine incorporates various individual factors, such as environmental conditions, lifestyle choices, and health histories, to formulate customized treatment plans. By utilizing sophisticated machine learning algorithms, clinicians and researchers are better equipped to make informed decisions in areas such as disease prevention, diagnosis, and treatment selection, thereby optimizing health outcomes for each individual.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults
Authors:
Prin Phunyaphibarn,
Junghyun Lee,
Bohan Wang,
Huishuai Zhang,
Chulhee Yun
Abstract:
Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much fla…
▽ More
Although gradient descent with Polyak's momentum is widely used in modern machine and deep learning, a concrete understanding of its effects on the training trajectory remains elusive. In this work, we empirically show that for linear diagonal networks and nonlinear neural networks, momentum gradient descent with a large learning rate displays large catapults, driving the iterates towards much flatter minima than those found by gradient descent. We hypothesize that the large catapult is caused by momentum "prolonging" the self-stabilization effect (Damian et al., 2023). We provide theoretical and empirical support for our hypothesis in a simple toy example and empirical evidence supporting our hypothesis for linear diagonal networks.
△ Less
Submitted 29 May, 2024; v1 submitted 25 November, 2023;
originally announced November 2023.
-
Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning
Authors:
Hongming Zhang,
Tongzheng Ren,
Chenjun Xiao,
Dale Schuurmans,
Bo Dai
Abstract:
In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounte…
▽ More
In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.
△ Less
Submitted 10 June, 2024; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Robust Brain MRI Image Classification with SIBOW-SVM
Authors:
Liyun Zeng,
Hao Helen Zhang
Abstract:
The majority of primary Central Nervous System (CNS) tumors in the brain are among the most aggressive diseases affecting humans. Early detection of brain tumor types, whether benign or malignant, glial or non-glial, is critical for cancer prevention and treatment, ultimately improving human life expectancy. Magnetic Resonance Imaging (MRI) stands as the most effective technique to detect brain tu…
▽ More
The majority of primary Central Nervous System (CNS) tumors in the brain are among the most aggressive diseases affecting humans. Early detection of brain tumor types, whether benign or malignant, glial or non-glial, is critical for cancer prevention and treatment, ultimately improving human life expectancy. Magnetic Resonance Imaging (MRI) stands as the most effective technique to detect brain tumors by generating comprehensive brain images through scans. However, human examination can be error-prone and inefficient due to the complexity, size, and location variability of brain tumors. Recently, automated classification techniques using machine learning (ML) methods, such as Convolutional Neural Network (CNN), have demonstrated significantly higher accuracy than manual screening, while maintaining low computational costs. Nonetheless, deep learning-based image classification methods, including CNN, face challenges in estimating class probabilities without proper model calibration. In this paper, we propose a novel brain tumor image classification method, called SIBOW-SVM, which integrates the Bag-of-Features (BoF) model with SIFT feature extraction and weighted Support Vector Machines (wSVMs). This new approach effectively captures hidden image features, enabling the differentiation of various tumor types and accurate label predictions. Additionally, the SIBOW-SVM is able to estimate the probabilities of images belonging to each class, thereby providing high-confidence classification decisions. We have also developed scalable and parallelable algorithms to facilitate the practical implementation of SIBOW-SVM for massive images. As a benchmark, we apply the SIBOW-SVM to a public data set of brain tumor MRI images containing four classes: glioma, meningioma, pituitary, and normal. Our results show that the new method outperforms state-of-the-art methods, including CNN.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Manifold learning: what, how, and why
Authors:
Marina Meilă,
Hanyu Zhang
Abstract:
Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret the…
▽ More
Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Spatial Process Approximations: Assessing Their Necessity
Authors:
Hao Zhang
Abstract:
In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in predictio…
▽ More
In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in prediction and estimation computations and necessitates an approximation to prediction and the Gaussian likelihood. A review of current methodologies for managing large spatial data indicates that some fail to address this ill-conditioning problem. Such ill-conditioning often results in low-rank approximations of the stochastic processes. This paper introduces various optimality criteria and provides solutions for each.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Regionalization of China's PM2.5 through Robust Spatio temporal Functional Clustering Method
Authors:
Tingyin Wang,
Xueqin Wang,
Xiaobo Guo,
Heping Zhang
Abstract:
The patterns of particulate matter with diameters that are generally 2.5 micrometers and smaller (PM2.5) are heterogeneous in China nationwide but can be homogeneous region-wide. To reduce the adverse effects from PM2.5, policymakers need to develop location-specific regulations based on nationwide clustering analysis of PM2.5 concentrations. However, such an analysis is challenging because the da…
▽ More
The patterns of particulate matter with diameters that are generally 2.5 micrometers and smaller (PM2.5) are heterogeneous in China nationwide but can be homogeneous region-wide. To reduce the adverse effects from PM2.5, policymakers need to develop location-specific regulations based on nationwide clustering analysis of PM2.5 concentrations. However, such an analysis is challenging because the data have complex structures and are usually noisy. In this study, we propose a robust clustering framework using a novel concept of depth, which can handle both magnitude and shape outliers effectively and incorporate spatial information. We apply this framework to a PM2.5 dataset and reveal ten regions in China that have distinct PM2.5 patterns, and the homogeneity within each cluster is also confirmed. The clusters have clearly visible boundaries and enable policymakers to develop local emission control policies and establish regional collaborative systems to control air pollution in China.
△ Less
Submitted 5 November, 2023;
originally announced November 2023.
-
On the Generalization Properties of Diffusion Models
Authors:
Puheng Li,
Zhong Li,
Huishuai Zhang,
Jiang Bian
Abstract:
Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of…
▽ More
Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.
△ Less
Submitted 12 January, 2024; v1 submitted 3 November, 2023;
originally announced November 2023.
-
The Phase Transition Phenomenon of Shuffled Regression
Authors:
Hang Zhang,
Ping Li
Abstract:
We study the phase transition phenomenon inherent in the shuffled (permuted) regression problem, which has found numerous applications in databases, privacy, data analysis, etc. In this study, we aim to precisely identify the locations of the phase transition points by leveraging techniques from message passing (MP). In our analysis, we first transform the permutation recovery problem into a proba…
▽ More
We study the phase transition phenomenon inherent in the shuffled (permuted) regression problem, which has found numerous applications in databases, privacy, data analysis, etc. In this study, we aim to precisely identify the locations of the phase transition points by leveraging techniques from message passing (MP). In our analysis, we first transform the permutation recovery problem into a probabilistic graphical model. We then leverage the analytical tools rooted in the message passing (MP) algorithm and derive an equation to track the convergence of the MP algorithm. By linking this equation to the branching random walk process, we are able to characterize the impact of the signal-to-noise-ratio ($\snr$) on the permutation recovery. Depending on whether the signal is given or not, we separately investigate the oracle case and the non-oracle case. The bottleneck in identifying the phase transition regimes lies in deriving closed-form formulas for the corresponding critical points, but only in rare scenarios can one obtain such precise expressions. To tackle this technical challenge, this study proposes the Gaussian approximation method, which allows us to obtain the closed-form formulas in almost all scenarios. In the oracle case, our method can fairly accurately predict the phase transition $\snr$. In the non-oracle case, our algorithm can predict the maximum allowed number of permuted rows and uncover its dependency on the sample number.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
A Survey of Methods for Estimating Hurst Exponent of Time Sequence
Authors:
Hong-Yan Zhang,
Zhi-Qiang Feng,
Si-Yu Feng,
Yu Zhou
Abstract:
The Hurst exponent is a significant indicator for characterizing the self-similarity and long-term memory properties of time sequences. It has wide applications in physics, technologies, engineering, mathematics, statistics, economics, psychology and so on. Currently, available methods for estimating the Hurst exponent of time sequences can be divided into different categories: time-domain methods…
▽ More
The Hurst exponent is a significant indicator for characterizing the self-similarity and long-term memory properties of time sequences. It has wide applications in physics, technologies, engineering, mathematics, statistics, economics, psychology and so on. Currently, available methods for estimating the Hurst exponent of time sequences can be divided into different categories: time-domain methods and spectrum-domain methods based on the representation of time sequence, linear regression methods and Bayesian methods based on parameter estimation methods. Although various methods are discussed in literature, there are still some deficiencies: the descriptions of the estimation algorithms are just mathematics-oriented and the pseudo-codes are missing; the effectiveness and accuracy of the estimation algorithms are not clear; the classification of estimation methods is not considered and there is a lack of guidance for selecting the estimation methods. In this work, the emphasis is put on thirteen dominant methods for estimating the Hurst exponent. For the purpose of decreasing the difficulty of implementing the estimation methods with computer programs, the mathematical principles are discussed briefly and the pseudo-codes of algorithms are presented with necessary details. It is expected that the survey could help the researchers to select, implement and apply the estimation algorithms of interest in practical situations in an easy way.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
An accelerated first-order regularized momentum descent ascent algorithm for stochastic nonconvex-concave minimax problems
Authors:
Huiling Zhang,
Zi Xu
Abstract:
Stochastic nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose an accelerated first-order regularized momentum descent ascent algorithm (FORMDA) for solving stochastic nonconvex-concave minimax problems. The iteration complexity of the algorithm is proved to be…
▽ More
Stochastic nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose an accelerated first-order regularized momentum descent ascent algorithm (FORMDA) for solving stochastic nonconvex-concave minimax problems. The iteration complexity of the algorithm is proved to be $\tilde{\mathcal{O}}(\varepsilon ^{-6.5})$ to obtain an $\varepsilon$-stationary point, which achieves the best-known complexity bound for single-loop algorithms to solve the stochastic nonconvex-concave minimax problems under the stationarity of the objective function.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
A SIMPLE Approach to Provably Reconstruct Ising Model with Global Optimality
Authors:
Junxian Zhu,
Xuanyu Chen,
Jin Zhu,
Xueqin Wang,
Heping Zhang
Abstract:
Reconstruction of interaction network between random events is a critical problem arising from statistical physics and politics to sociology, biology, and psychology, and beyond. The Ising model lays the foundation for this reconstruction process, but finding the underlying Ising model from the least amount of observed samples in a computationally efficient manner has been historically challenging…
▽ More
Reconstruction of interaction network between random events is a critical problem arising from statistical physics and politics to sociology, biology, and psychology, and beyond. The Ising model lays the foundation for this reconstruction process, but finding the underlying Ising model from the least amount of observed samples in a computationally efficient manner has been historically challenging for half a century. By using the idea of sparsity learning, we present a approach named SIMPLE that has a dominant sample complexity from theoretical limit. Furthermore, a tuning-free algorithm is developed to give a statistically consistent solution of SIMPLE in polynomial time with high probability. On extensive benchmarked cases, the SIMPLE approach provably reconstructs underlying Ising models with global optimality. The application on the U.S. senators voting in the last six congresses reveals that both the Republicans and Democrats noticeably assemble in each congresses; interestingly, the assembling of Democrats is particularly pronounced in the latest congress.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Distributed Estimation for Large-Scale Cox Regression with Poisson Subsampling
Authors:
Haixiang Zhang,
Yang Li,
HaiYing Wang
Abstract:
To ensure privacy protection and alleviate computational burden, we propose a Poisson-subsampling based distributed estimation procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between…
▽ More
To ensure privacy protection and alleviate computational burden, we propose a Poisson-subsampling based distributed estimation procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between different storage sites with only one round of communication. For inference, the asymptotic properties of the proposed estimator were rigorously established. An extensive simulation study demonstrated that the proposed approach is effective. The methodology was applied to analyze a large dataset from the U.S. airlines.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Optimal Estimator for Linear Regression with Shuffled Labels
Authors:
Hang Zhang,
Ping Li
Abstract:
This paper considers the task of linear regression with shuffled labels, i.e., $\mathbf Y = \mathbf Π\mathbf X \mathbf B + \mathbf W$, where $\mathbf Y \in \mathbb R^{n\times m}, \mathbf Pi \in \mathbb R^{n\times n}, \mathbf X\in \mathbb R^{n\times p}, \mathbf B \in \mathbb R^{p\times m}$, and $\mathbf W\in \mathbb R^{n\times m}$, respectively, represent the sensing results, (unknown or missing) c…
▽ More
This paper considers the task of linear regression with shuffled labels, i.e., $\mathbf Y = \mathbf Π\mathbf X \mathbf B + \mathbf W$, where $\mathbf Y \in \mathbb R^{n\times m}, \mathbf Pi \in \mathbb R^{n\times n}, \mathbf X\in \mathbb R^{n\times p}, \mathbf B \in \mathbb R^{p\times m}$, and $\mathbf W\in \mathbb R^{n\times m}$, respectively, represent the sensing results, (unknown or missing) corresponding information, sensing matrix, signal of interest, and additive sensing noise. Given the observation $\mathbf Y$ and sensing matrix $\mathbf X$, we propose a one-step estimator to reconstruct $(\mathbf Π, \mathbf B)$. From the computational perspective, our estimator's complexity is $O(n^3 + np^2m)$, which is no greater than the maximum complexity of a linear assignment algorithm (e.g., $O(n^3)$) and a least square algorithm (e.g., $O(np^2 m)$). From the statistical perspective, we divide the minimum $snr$ requirement into four regimes, e.g., unknown, hard, medium, and easy regimes; and present sufficient conditions for the correct permutation recovery under each regime: $(i)$ $snr \geq Ω(1)$ in the easy regime; $(ii)$ $snr \geq Ω(\log n)$ in the medium regime; and $(iii)$ $snr \geq Ω((\log n)^{c_0}\cdot n^{{c_1}/{srank(\mathbf B)}})$ in the hard regime ($c_0, c_1$ are some positive constants and $srank(\mathbf B)$ denotes the stable rank of $\mathbf B$). In the end, we also provide numerical experiments to confirm the above claims.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning
Authors:
He Zhang,
Siyuan Liu,
Jiacheng You,
Chang Liu,
Shuxin Zheng,
Ziheng Lu,
Tong Wang,
Nanning Zheng,
Bin Shao
Abstract:
Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT…
▽ More
Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.
△ Less
Submitted 9 March, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Asset Bundling for Wind Power Forecasting
Authors:
Hanyu Zhang,
Mathieu Tanneau,
Chaofan Huang,
V. Roshan Joseph,
Shangkun Wang,
Pascal Van Hentenryck
Abstract:
The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (…
▽ More
The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (BPR) framework that integrates asset bundling, machine learning, and forecast reconciliation techniques. The BPR framework first learns an intermediate hierarchy level (the bundles), then predicts wind power at the asset, bundle, and fleet level, and finally reconciles all forecasts to ensure consistency. This approach effectively introduces an auxiliary learning task (predicting the bundle-level time series) to help the main learning tasks. The paper also introduces new asset-bundling criteria that capture the spatio-temporal dynamics of wind power time series. Extensive numerical experiments are conducted on an industry-size dataset of 283 wind farms in the MISO footprint. The experiments consider short-term and day-ahead forecasts, and evaluates a large variety of forecasting models that include weather predictions as covariates. The results demonstrate the benefits of BPR, which consistently and significantly improves forecast accuracy over baselines, especially at the fleet level.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Causal inference with outcome dependent sampling and mismeasured outcome
Authors:
Min Zeng,
Zeyang Jia,
Zijian Sui,
Jinfeng Xu,
Hong Zhang
Abstract:
Outcome-dependent sampling designs are extensively utilized in various scientific disciplines, including epidemiology, ecology, and economics, with retrospective case-control studies being specific examples of such designs. Additionally, if the outcome used for sample selection is also mismeasured, then it is even more challenging to estimate the average treatment effect (ATE) accurately. To our k…
▽ More
Outcome-dependent sampling designs are extensively utilized in various scientific disciplines, including epidemiology, ecology, and economics, with retrospective case-control studies being specific examples of such designs. Additionally, if the outcome used for sample selection is also mismeasured, then it is even more challenging to estimate the average treatment effect (ATE) accurately. To our knowledge, no existing method can address these two issues simultaneously. In this paper, we establish the identifiability of ATE and propose a novel method for estimating ATE in the context of generalized linear model. The estimator is shown to be consistent under some regularity conditions. To relax the model assumption, we also consider generalized additive model. We propose to estimate ATE using penalized B-splines and establish asymptotic properties for the proposed estimator. Our methods are evaluated through extensive simulation studies and the application to a dataset from the UK Biobank, with alcohol intake as the treatment and gout as the outcome.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
A Stochastic Online Forecast-and-Optimize Framework for Real-Time Energy Dispatch in Virtual Power Plants under Uncertainty
Authors:
Wei Jiang,
Zhongkai Yi,
Li Wang,
Hanwei Zhang,
Jihai Zhang,
Fangquan Lin,
Cheng Yang
Abstract:
Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dis…
▽ More
Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dispatch framework, which is composed of two key elements: (i) A hybrid forecast-and-optimize sequential task, integrating deep learning-based forecasting and stochastic optimization, where these two stages are connected by the uncertainty estimation at multiple temporal resolutions; (ii) An efficient online data augmentation scheme, jointly involving model pre-training and online fine-tuning stages. In this way, the proposed framework is capable to rapidly adapt to the real-time data distribution, as well as to target on uncertainties caused by data drift, model discrepancy and environment perturbations in the control process, and finally to realize an optimal and robust dispatch solution. The proposed framework won the championship in CityLearn Challenge 2022, which provided an influential opportunity to investigate the potential of AI application in the energy domain. In addition, comprehensive experiments are conducted to interpret its effectiveness in the real-life scenario of smart building energy management.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models
Authors:
Borui Tang,
Jin Zhu,
Junxian Zhu,
Xueqin Wang,
Heping Zhang
Abstract:
Analysis of high-dimensional data has led to increased interest in both single index models (SIMs) and best subset selection. SIMs provide an interpretable and flexible modeling framework for high-dimensional data, while best subset selection aims to find a sparse model from a large set of predictors. However, best subset selection in high-dimensional models is known to be computationally intracta…
▽ More
Analysis of high-dimensional data has led to increased interest in both single index models (SIMs) and best subset selection. SIMs provide an interpretable and flexible modeling framework for high-dimensional data, while best subset selection aims to find a sparse model from a large set of predictors. However, best subset selection in high-dimensional models is known to be computationally intractable. Existing methods tend to relax the selection, but do not yield the best subset solution. In this paper, we directly tackle the intractability by proposing the first provably scalable algorithm for best subset selection in high-dimensional SIMs. Our algorithmic solution enjoys the subset selection consistency and has the oracle property with a high probability. The algorithm comprises a generalized information criterion to determine the support size of the regression coefficients, eliminating the model selection tuning. Moreover, our method does not assume an error distribution or a specific link function and hence is flexible to apply. Extensive simulation results demonstrate that our method is not only computationally efficient but also able to exactly recover the best subset in various settings (e.g., linear regression, Poisson regression, heteroscedastic models).
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Optimal Rate of Kernel Regression in Large Dimensions
Authors:
Weihao Lu,
Haobo Zhang,
Yicheng Li,
Manyun Xu,
Qian Lin
Abstract:
We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metr…
▽ More
We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^γ$ for $γ=2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $γ>0$ and find that the curve of optimal rate varying along $γ$ exhibits several new phenomena including the {\it multiple descent behavior} and the {\it periodic plateau behavior}. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Towards more scientific meta-analyses
Authors:
Lily H. Zhang,
Menelaos Konstantinidis,
Marie-Abèle Bind,
Donald B. Rubin
Abstract:
Meta-analysis can be a critical part of the research process, often serving as the primary analysis on which the practitioners, policymakers, and individuals base their decisions. However, current literature synthesis approaches to meta-analysis typically estimate a different quantity than what is implicitly intended; concretely, standard approaches estimate the average effect of a treatment for a…
▽ More
Meta-analysis can be a critical part of the research process, often serving as the primary analysis on which the practitioners, policymakers, and individuals base their decisions. However, current literature synthesis approaches to meta-analysis typically estimate a different quantity than what is implicitly intended; concretely, standard approaches estimate the average effect of a treatment for a population of imperfect studies, rather than the true scientific effect that would be measured in a population of hypothetical perfect studies. We advocate for an alternative method, called response-surface meta-analysis, which models the relationship between the quality of the study design as predictor variables and its reported estimated effect size as the outcome variable in order to estimate the effect size obtained by the hypothetical ideal study. The idea was first introduced by Rubin several decades ago, and here we provide a practical implementation. First, we reintroduce the idea of response-surface meta-analysis, highlighting its focus on a scientifically-motivated estimand while proposing a straightforward implementation. Then we compare the approach to traditional meta-analysis techniques used in practice. We then implement response-surface meta-analysis and contrast its results with existing literature-synthesis approaches on both simulated data and a real-world example published by the Cochrane Collaboration. We conclude by detailing the primary challenges in the implementation of response-surface meta-analysis and offer some suggestions to tackle these challenges.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
Weighting Based Approaches to Borrowing Historical Controls for Indirect comparison for Time-to-Event Data with a Cure Fraction
Authors:
Jixian Wang,
Hongtao Zhang,
Ram Tiwari
Abstract:
To use historical controls for indirect comparison with single-arm trials, the population difference between data sources should be adjusted to reduce confounding bias. The adjustment is more difficult for time-to-event data with a cure fraction. We propose different adjustment approaches based on pseudo observations and calibration weighting by entropy balancing. We show a simple way to obtain th…
▽ More
To use historical controls for indirect comparison with single-arm trials, the population difference between data sources should be adjusted to reduce confounding bias. The adjustment is more difficult for time-to-event data with a cure fraction. We propose different adjustment approaches based on pseudo observations and calibration weighting by entropy balancing. We show a simple way to obtain the pseudo observations for the cure rate and propose a simple weighted estimator based on them. Estimation of the survival function in presence of a cure fraction is also considered. Simulations are conducted to examine the proposed approaches. An application to a breast cancer study is presented.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Fixed-Point Algorithms for Solving the Critical Value and Upper Tail Quantile of Kuiper's Statistics
Authors:
Hong-Yan Zhang,
Wei Sun,
Xiao Chen,
Rui-Jia Lin,
Yu Zhou
Abstract:
Kuiper's statistic is a good measure for the difference of ideal distribution and empirical distribution in the goodness-of-fit test. However, it is a challenging problem to solve the critical value and upper tail quantile, or simply Kuiper pair, of Kuiper's statistics due to the difficulties of solving the nonlinear equation and reasonable approximation of infinite series. In this work, the contr…
▽ More
Kuiper's statistic is a good measure for the difference of ideal distribution and empirical distribution in the goodness-of-fit test. However, it is a challenging problem to solve the critical value and upper tail quantile, or simply Kuiper pair, of Kuiper's statistics due to the difficulties of solving the nonlinear equation and reasonable approximation of infinite series. In this work, the contributions lie in three perspectives: firstly, the second order approximation for the infinite series of the cumulative distribution of the critical value is used to achieve higher precision; secondly, the principles and fixed-point algorithms for solving the Kuiper pair are presented with details; finally, finally, a mistake about the critical value $c^α_n$ for $(α, n)=(0.01,30)$ in Kuiper's distribution table has been labeled and corrected where $n$ is the sample capacity and $α$ is the upper tail quantile. The algorithms are verified and validated by comparing with the table provided by Kuiper. The methods and algorithms proposed are enlightening and worth of introducing to the college students, computer programmers, engineers, experimental psychologists and so on.
△ Less
Submitted 23 March, 2024; v1 submitted 18 August, 2023;
originally announced August 2023.