Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2403.11163v1 [stat.ME] 17 Mar 2024

A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques

Xuetong Lia𝑎{}^{a}start_FLOATSUPERSCRIPT italic_a end_FLOATSUPERSCRIPT, Yuan Gaoa𝑎{}^{a\ast}start_FLOATSUPERSCRIPT italic_a ∗ end_FLOATSUPERSCRIPT, Hong Changa𝑎{}^{a}start_FLOATSUPERSCRIPT italic_a end_FLOATSUPERSCRIPT, Danyang Huangb𝑏{}^{b}start_FLOATSUPERSCRIPT italic_b end_FLOATSUPERSCRIPT, Yingying Mac𝑐{}^{c}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT,

Rui Pand𝑑{}^{d}start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT, Haobo Qie𝑒{}^{e}start_FLOATSUPERSCRIPT italic_e end_FLOATSUPERSCRIPT, Feifei Wangb𝑏{}^{b}start_FLOATSUPERSCRIPT italic_b end_FLOATSUPERSCRIPT, Shuyuan Wuf𝑓{}^{f}start_FLOATSUPERSCRIPT italic_f end_FLOATSUPERSCRIPT, Ke Xug𝑔{}^{g}start_FLOATSUPERSCRIPT italic_g end_FLOATSUPERSCRIPT, Jing Zhoub𝑏{}^{b}start_FLOATSUPERSCRIPT italic_b end_FLOATSUPERSCRIPT,

Xuening Zhuh{}^{h}start_FLOATSUPERSCRIPT italic_h end_FLOATSUPERSCRIPT, Yingqiu Zhug𝑔{}^{g}start_FLOATSUPERSCRIPT italic_g end_FLOATSUPERSCRIPT, Hansheng Wanga𝑎{}^{a}start_FLOATSUPERSCRIPT italic_a end_FLOATSUPERSCRIPT

a𝑎{}^{a}start_FLOATSUPERSCRIPT italic_a end_FLOATSUPERSCRIPTGuanghua School of Management, Peking University, Beijing, China; b𝑏{}^{b}start_FLOATSUPERSCRIPT italic_b end_FLOATSUPERSCRIPTCenter for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China; c𝑐{}^{c}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPTSchool of Economics and Management, Beihang University, Beijing, China; d𝑑{}^{d}start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPTSchool of Statistics and Mathematics, Central University of Finance and Economics, Beijing, China; e𝑒{}^{e}start_FLOATSUPERSCRIPT italic_e end_FLOATSUPERSCRIPTSchool of Statistics, Beijing Normal University, Beijing, China; f𝑓{}^{f}start_FLOATSUPERSCRIPT italic_f end_FLOATSUPERSCRIPTSchool of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China; g𝑔{}^{g}start_FLOATSUPERSCRIPT italic_g end_FLOATSUPERSCRIPTSchool of Statistics, University of International Business and Economics, Beijing, China; h{}^{h}start_FLOATSUPERSCRIPT italic_h end_FLOATSUPERSCRIPTSchool of Data Science and MOE-Laboratory for National Development and Intelligent Governance, Fudan University, Shanghai, China


Abstract

This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.

KEYWORDS: Distributed Computing, Massive Data Analysis, Minibatch Techniques, Stochastic Optimization, Subsampling Methods


11footnotetext: Yuan Gao (yuan_gao@pku.edu.cn) is the corresponding author.

1. INTRODUCTION

Modern statistical analysis often involves datasets of massive size (Fan et al.,, 2020), for which effective computation methods are indispensable. On one side, the huge demand for computation methods for massive data analysis places serious challenges on the traditional statistical methods, which have been developed for datasets of regular size. On the other side, it also stimulates new research efforts, which try to conquer computation challenges by statistical wisdom. The research along this direction not only benefits the real practices with massive datasets but also inspires new statistical theory. The objective of this work is to provide a selective review about this exciting research area, which has been rapidly developed during the past many years. Then the objective here is not to provide a complete list for all the research works related to massive data computation. This is obviously a mission impossible. Instead, we should focus on three categories of statistical computing methods. They are, respectively, distributed computing methods, subsampling methods, and minibatch gradient descent methods. We try to organize the most important and relevant ones associated with those three categories in a structured way, so that follow-up researchers might benefit. Due to our limited understanding about the past literature and also the space constraint of a regular manuscript, we might miss some important references therein. If that happens, our sincere apology in advance and we should be more than happy to hear the feedback.

As the title suggests, this review is a selective review about statistical methods for massive data analysis. Then, the meaning of “massive data” needs to be precisely defined. We argue that whether a dataset is massive or not is relative to the computation resource. In the most ideal situation, if one is given a super computer with an unlimited amount of hard drive and memory (both CPU and GPU memory) together with a super powerful CPU, then no dataset can be considered as massive. In this ideal situation, any dataset of any size can be easily placed on one hard drive, loaded into the memory as a whole, and then processed in no time. Unfortunately, such an ideal situation never happens in reality. In real practice, most researchers are given only a limited amount of computation resources, which put various constraints on the computation power. The constraints could be the hard drive. If the size of the data exceeds one single computer’s hard drive capacity, then a distributed system has to be used to place the data. A natural question arises immediately: should we also compute it in a distributed way? This inspires a large amount of research for distributed computing, if a powerful distributed computation system is indeed available. With such a system, we find that it might remain to be practically appealing to distribute a large dataset on a powerful distributed system, even if the data size is not strictly larger than one single computer’s hard drive. This makes the subsequent computation more convenient. This constitutes the first part of the selective literature to be reviewed in this work.

Distributed computing is a powerful solution for problems with extremely large scale datasets. However, this seems not the most typical situation. The most typical one in real practice is an embarrassing situation, where the dataset sizes are not extremely large but large enough to cause a lot of computation challenges. To fix the idea, consider for example a dataset of a size (for example) 100GB. Note that this is a size substantially smaller than that of a hard drive of a modern computer (e.g., 2TB), but much larger than the size of the typical memory (e.g., 32GB). For a dataset of this embarrassing size, one straightforward solution remains to be distributed computing, as mentioned in the previous paragraph. However, this straightforward solution seems not the only best one for at least two reasons. First, to implement a distributed computing algorithm, one needs a powerful distributed computation system. Depending on its size, the distributed computer system could be very expensive, if not completely not affordable. Second, to utilize a distributed computing system, appropriate programming techniques are necessarily needed. Popularly used programming frameworks (e.g., Spark, Hadoop) need to be learned. This is unfortunately a painful learning process for most field practitioners (e.g., a medical scientist), who are not professional statistical programmers. Therefore, it seems that there is a practical need for a handy computation method, which can deal with datasets of any size on a given hard drive and can be easily implemented on one single computer. The key challenge here is how to accomplish a massive data computation task with limited memory (both CPU and GPU memories) constraints. This leads to a huge body of literature about subsampling, and/or streaming data analysis. This constitutes the second part of the literature to be reviewed this work.

Both the problems of distributed computing and computing with memory constraint concern about the computation problems of datasets with massive sizes. However, we often encounter situations where not only the dataset sizes are large, but also the model sizes are extremely large. The most typical example in this regard is various deep learning methods. To fix the idea, consider for example the famous ImageNet dataset of Deng et al., (2009), which contains a total of over 1.3 million color images belonging to 1,000 classes. The total amount of images is about 150GB in size. Next, consider for example a classical deep learning model VGG16 (Simonyan and Zisserman,, 2015). This is a convolution neural network (CNN) model with a total of over 130 million parameters. To train this VGG16 model on the ImageNet dataset, all the model parameters need to be fully placed in the GPU memory for fast tensor computation. Unfortunately, once this sophisticated VGG16 model is fully loaded into the GPU memory, the space left for data processing is inevitably significantly reduced. Consequently, the ImageNet dataset has to be processed in a minibatch-by-minibatch manner. Here, a minibatch refers to a small or even tiny subset of the whole sample. Depending on the way this subsample is generated, we might have streaming data based minibatches (Chen et al., 2020a, ), subsampling based stochastic minibatches (Gower et al.,, 2019), and random partition based minibatches (Qi et al., 2023b, ; Gao et al.,, 2023). This constitutes the third part of the literature to be reviewed in this work.

The rest of this article is organized as follows. Section 2 reviews the literature about distributed computing. Section 3 studies various subsampling methods. Section 4 discusses minibatch gradient related techniques. The article is then concluded with a brief discussion in Section 5.

2. DISTRIBUTED COMPUTING

2.1. Theoretical Framework of Distributed Computing

Consider a standard statistical learning problem. Assume a total of N𝑁Nitalic_N observations, where N𝑁Nitalic_N is notably large. For each observation i𝑖iitalic_i, we collect a response variable Yi1subscript𝑌𝑖superscript1Y_{i}\in\mathbb{R}^{1}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and its corresponding feature vector Xipsubscript𝑋𝑖superscript𝑝X_{i}\in\mathbb{R}^{p}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. The primary goal here is to accurately estimate an unknown parameter θ0psubscript𝜃0superscript𝑝\theta_{0}\in\mathbb{R}^{p}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT through a suitably defined loss function. To be specific, define the empirical loss function as (θ)=i=1N(Xi,Yi;θ)𝜃superscriptsubscript𝑖1𝑁subscript𝑋𝑖subscript𝑌𝑖𝜃\mathcal{L}(\theta)=\sum_{i=1}^{N}\ell(X_{i},Y_{i};\theta)caligraphic_L ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ). Here, (Xi,Yi;θ)subscript𝑋𝑖subscript𝑌𝑖𝜃\ell(X_{i},Y_{i};\theta)roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) represents the loss function for the i𝑖iitalic_i-th observation.

In conventional scenarios with relatively small N𝑁Nitalic_N, this learning problem could be easily solved using various standard optimization algorithms (e.g., Newton-Raphson method and gradient descent method). Nevertheless, for datasets of massive size, implementation of these standard algorithms becomes practically challenging or even infeasible. Consider, for example, the classical Newton-Raphson algorithm. Let θ^(t)superscript^𝜃𝑡\widehat{\theta}^{(t)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT be the estimator derived in the t𝑡titalic_t-th iteration. Then, the (t+1𝑡1t+1italic_t + 1)-th step estimator is updated as follows:

θ^(t+1)=θ^(t){¨(θ^(t))}1˙(θ^(t)),superscript^𝜃𝑡1superscript^𝜃𝑡superscript¨superscript^𝜃𝑡1˙superscript^𝜃𝑡\widehat{\theta}^{(t+1)}=\widehat{\theta}^{(t)}-\big{\{}\ddot{\mathcal{L}}\big% {(}\widehat{\theta}^{(t)}\big{)}\big{\}}^{-1}\dot{\mathcal{L}}\big{(}\widehat{% \theta}^{(t)}\big{)},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - { over¨ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , (2.1)

where ˙(θ)˙𝜃\dot{\mathcal{L}}(\theta)over˙ start_ARG caligraphic_L end_ARG ( italic_θ ) and ¨(θ)¨𝜃\ddot{\mathcal{L}}(\theta)over¨ start_ARG caligraphic_L end_ARG ( italic_θ ) represent the 1st- and 2nd-order derivatives of the loss function ()\mathcal{L}(\cdot)caligraphic_L ( ⋅ ) with respect to θ.𝜃\theta.italic_θ . With a fixed feature dimension p𝑝pitalic_p, the computational complexity of the Newton-Raphson algorithm is at least of the order O(N)𝑂𝑁O(N)italic_O ( italic_N ) in each iteration. In the case of large datasets with an exceedingly large N𝑁Nitalic_N, such computation costs could be practically challenging or even infeasible. To address this issue, various distributed computing methods have been developed. The key idea of distributed computing is to divide a massive dataset into smaller pieces, which can be processed simultaneously across many multiple computer machines.

Assume the N𝑁Nitalic_N samples are distributed across a total of M𝑀Mitalic_M different local machines and each machine is assigned nmsubscript𝑛𝑚n_{m}italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT observations for 1mM1𝑚𝑀1\leq m\leq M1 ≤ italic_m ≤ italic_M. It follows that m=1Mnm=N.superscriptsubscript𝑚1𝑀subscript𝑛𝑚𝑁\sum_{m=1}^{M}n_{m}=N.∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_N . We then denote the whole sample as 𝒮F={1,2,,N}subscript𝒮𝐹12𝑁\mathcal{S}_{F}=\{1,2,...,N\}caligraphic_S start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = { 1 , 2 , … , italic_N } and the sample assigned to the m𝑚mitalic_m-th local computer as 𝒮(m)𝒮Fsubscript𝒮𝑚subscript𝒮𝐹\mathcal{S}_{(m)}\subset\mathcal{S}_{F}caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ⊂ caligraphic_S start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Thus, we have m=1M𝒮(m)=𝒮Fsuperscriptsubscript𝑚1𝑀subscript𝒮𝑚subscript𝒮𝐹\cup_{m=1}^{M}\mathcal{S}_{(m)}=\mathcal{S}_{F}∪ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, and 𝒮(m1)𝒮(m2)=subscript𝒮subscript𝑚1subscript𝒮subscript𝑚2\mathcal{S}_{(m_{1})}\cap\mathcal{S}_{(m_{2})}=\emptysetcaligraphic_S start_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∩ caligraphic_S start_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = ∅ for any m1m2subscript𝑚1subscript𝑚2{m_{1}}\not={m_{2}}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and |𝒮m|=nmsubscript𝒮𝑚subscript𝑛𝑚|\mathcal{S}_{m}|=n_{m}| caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | = italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Recall that the global loss function is defined as (θ)=N1i=1N(Xi,Yi;θ)𝜃superscript𝑁1superscriptsubscript𝑖1𝑁subscript𝑋𝑖subscript𝑌𝑖𝜃\mathcal{L}(\theta)=N^{-1}\sum_{i=1}^{N}\ell(X_{i},Y_{i};\theta)caligraphic_L ( italic_θ ) = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ). The averaged local sample size is denoted as n=M1m=1Mnm𝑛superscript𝑀1superscriptsubscript𝑚1𝑀subscript𝑛𝑚n=M^{-1}\sum_{m=1}^{M}n_{m}italic_n = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Define θ^=argminθ(θ)^𝜃subscriptargmin𝜃𝜃\widehat{\theta}=\operatorname*{argmin}_{\theta}\mathcal{L}(\theta)over^ start_ARG italic_θ end_ARG = roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ) and θ0=argminθE{(Xi,Yi;θ)}subscript𝜃0subscriptargmin𝜃𝐸subscript𝑋𝑖subscript𝑌𝑖𝜃\theta_{0}=\operatorname*{argmin}_{\theta}E\big{\{}\ell(X_{i},Y_{i};\theta)% \big{\}}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E { roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) } as the global estimator and true parameter, respectively. Subsequently, define the local loss function on the m𝑚mitalic_m-th local computer as (m)(θ)=nm1i𝒮m(Xi,Yi;θ)subscript𝑚𝜃superscriptsubscript𝑛𝑚1subscript𝑖subscript𝒮𝑚subscript𝑋𝑖subscript𝑌𝑖𝜃\mathcal{L}_{(m)}(\theta)={\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}n_{m}^{-1}}\sum_{i\in\mathcal{S}_{m}}\ell(X_{i},Y_{i};\theta)caligraphic_L start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( italic_θ ) = italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ). Let θ^(m)=argminθ(m)(θ)subscript^𝜃𝑚subscriptargmin𝜃subscript𝑚𝜃\widehat{\theta}_{(m)}=\operatorname*{argmin}_{\theta}\mathcal{L}_{(m)}(\theta)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( italic_θ ) be the estimator locally obtained on the m𝑚mitalic_m-th local computer. Moreover, denote ˙(Xi,Yi;θ)=(Xi,Yi;θ)/θp˙subscript𝑋𝑖subscript𝑌𝑖𝜃subscript𝑋𝑖subscript𝑌𝑖𝜃𝜃superscript𝑝\dot{\ell}(X_{i},Y_{i};\theta)=\partial\ell(X_{i},Y_{i};\theta)/\partial\theta% \in\mathbb{R}^{p}over˙ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) = ∂ roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) / ∂ italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and ¨(Xi,Yi;θ)=(Xi,Yi;θ)/θθp×p¨subscript𝑋𝑖subscript𝑌𝑖𝜃subscript𝑋𝑖subscript𝑌𝑖𝜃𝜃superscript𝜃topsuperscript𝑝𝑝\ddot{\ell}(X_{i},Y_{i};\theta)=\partial\ell(X_{i},Y_{i};\theta)/\partial% \theta\theta^{\top}\in\mathbb{R}^{p\times p}over¨ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) = ∂ roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) / ∂ italic_θ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT as the 1st- and 2nd-order derivatives of (Xi,Yi;θ)subscript𝑋𝑖subscript𝑌𝑖𝜃\ell(X_{i},Y_{i};\theta)roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) with respect to θ𝜃\thetaitalic_θ, respectively.

2.2. One-Shot Methods

For distributed statistical learning, various one-shot (OS) methods have been developed (Mcdonald et al.,, 2009; Zinkevich et al.,, 2010; Zhang et al.,, 2012; Rosenblatt and Nadler,, 2016; Lee et al.,, 2017; Hector and Song,, 2020). The basic idea is to calculate some important statistics on each local machine based on the data stored in each local machine in a fully parallel way. Subsequently, they are sent to the central machine, where these statistics are then assembled into one final estimator. Specifically, each local machine 1mM1𝑚𝑀1\leq m\leq M1 ≤ italic_m ≤ italic_M uses local sample 𝒮(m)subscript𝒮𝑚\mathcal{S}_{(m)}caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT to compute the local estimator θ^(m)=argminθ(m)(θ)subscript^𝜃𝑚subscriptargmin𝜃subscript𝑚𝜃\widehat{\theta}_{(m)}=\operatorname*{argmin}_{\theta}\mathcal{L}_{(m)}(\theta)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( italic_θ ). Subsequently, the central server collects these local estimates and aggregates them to obtain the final estimator M1m=1Mθ^(m)superscript𝑀1superscriptsubscript𝑚1𝑀subscript^𝜃𝑚M^{-1}\sum_{m=1}^{M}\widehat{\theta}_{(m)}italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT, which is denoted as the OS estimator θ^ossubscript^𝜃os\widehat{\theta}_{\operatorname{os}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_os end_POSTSUBSCRIPT.

Extensive research has been proposed in this field. For example, Chen and Xie, (2014) studied the properties of one-shot estimator based on penalized generalized linear regression with smoothly clipped absolute deviation (SCAD) penalty. Battey et al., (2018) proposed high-dimensional one-shot estimators based on Wald tests and Rao’s score tests. They extended the classical one-shot estimator from low-dimensional generalized linear regression to high-dimensional sparse scenarios. Lian and Fan, (2018) developed a debiased form of one-shot estimator for support vector machines for ultra-high-dimensional data. Tang et al., (2020) used the confidence distribution approach to combine bias-corrected lasso-type estimates computed in each local machine in the generalized linear model setting. The one-shot strategy for correlated outcomes is also discussed in the previous literature. One notable work is the distributed and integrated method of moments (DIMM) proposed by Hector and Song, (2021), which addresses the estimation problem in a regression setting with high-dimensional correlated outcomes. The key idea is to split all outcomes into blocks of low-dimensional response subvectors, then analyze these blocks in a distributed scheme, and finally combine the block-specific results using a closed-form meta-estimator. By this way, the computational challenges associated with high-dimensional correlated outcomes are alleviated. A generalization of DIMM is further developed in Hector and Song, (2020), which doubly divides the data at both the outcome and subject levels to speed up computation. Recently, a distributed empirical likelihood (DEL) method has been proposed to solve the estimation problem for imbalanced local datasets in the framework of integrative analysis (Zhou et al.,, 2023).

As one can see, the OS method is easy to implement. It is also communicationally efficient because it requires only one round of communication between the local computers and the central computer (i.e., transferring the local estimates θ^(m)subscript^𝜃𝑚\widehat{\theta}_{(m)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPTs). However, many researchers (Zhang et al.,, 2012; Wang et al.,, 2021; Wu et al., 2023c, ) have pointed out that a number of critical conditions are necessarily needed by various OS methods to achieve the same asymptotic efficiency as the global estimator θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG. The first condition is uniformity, implying that the massive data should be distributed across local computers in a relatively uniform manner so that the local sample sizes across different local machines should be approximately equal. The second condition is randomness, indicating that the massive data should be distributed across local computers as randomly as possible. The third condition is sufficiency, signifying that the sample size on each local machine should not be too small. To be more precise, it typically requires n2/Nsuperscript𝑛2𝑁n^{2}/N\to\inftyitalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N → ∞ unless some important biased reduction techniques (e.g., jackknifing) have been used (Wu et al., 2023c, ).

However, in real practice, these conditions are often violated to some extent. For example, practitioners rarely distribute large datasets in a completely uniform and random manner. Consequently, understanding how the violation of these conditions affects the statistical performance of the OS estimators becomes a topic of significant interest. Intuitively, when the uniformity condition is violated, at least one local computer ends up with a relatively tiny sample size. Consequently, the local estimates generated by these local computers may exhibit significantly larger variability or bias than others. When the randomness condition is violated, the local estimates from different local computers could be seriously biased. When the sufficiency condition is violated, the bias of each local estimator θ^(m)subscript^𝜃𝑚\widehat{\theta}_{(m)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT with an order O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ) becomes non-negligible as compared with O(1/N)𝑂1𝑁O(1/\sqrt{N})italic_O ( 1 / square-root start_ARG italic_N end_ARG ), leading to a noticeable bias in the resulting OS estimator θ^ossubscript^𝜃os\widehat{\theta}_{\operatorname{os}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_os end_POSTSUBSCRIPT. In each scenario, θ^ossubscript^𝜃os\widehat{\theta}_{\operatorname{os}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_os end_POSTSUBSCRIPT becomes statistically inefficient or even inconsistent, as rigorously demonstrated by Wang et al., (2021).

To address the challenges posed by the lack of distribution uniformity and randomness, a one-step upgraded pilot (OSUP) estimator was proposed by Wang et al., (2021). The OSUP method comprises several steps and is well-suited for a broad class of models with a likelihood specification. Specifically, to compute the OSUP estimator, a number of n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT pilot samples should be randomly selected from different local computers and then transferred to the central computer. Then the central computer computes a pilot estimator θ^psubscript^𝜃𝑝\widehat{\theta}_{p}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT by maximizing the log-likelihood function (Xi,Yi;θ)subscript𝑋𝑖subscript𝑌𝑖𝜃\ell(X_{i},Y_{i};\theta)roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) of all the pilot samples, i.e., θ^p=argmaxθi𝒫(Xi,Yi;θ)subscript^𝜃𝑝subscriptargmax𝜃subscript𝑖𝒫subscript𝑋𝑖subscript𝑌𝑖𝜃\widehat{\theta}_{p}=\operatorname*{argmax}_{\theta}\sum_{i\in\mathcal{P}}\ell% (X_{i},Y_{i};\theta)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_P end_POSTSUBSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ), where 𝒫𝒫\mathcal{P}caligraphic_P is the set of pilot samples. The pilot estimator θ^psubscript^𝜃𝑝\widehat{\theta}_{p}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is n0subscript𝑛0\sqrt{n_{0}}square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG-consistent for the target parameter θ𝜃\thetaitalic_θ. However, due to its smaller sample size (i.e., n0Nmuch-less-thansubscript𝑛0𝑁n_{0}\ll Nitalic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≪ italic_N), θ^psubscript^𝜃𝑝\widehat{\theta}_{p}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is not statistically as efficient as the global estimator. To further improve the statistical efficiency, the central computer broadcasts θ^psubscript^𝜃𝑝\widehat{\theta}_{p}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT back to all local computers. Then each local computer considers θ^psubscript^𝜃𝑝\widehat{\theta}_{p}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as an initial point and computes the 1st and 2nd order derivatives for its local log-likelihood function as ˙m(θ^p)=i𝒮(m)˙(Xi,Yi;θ^p)subscript˙𝑚subscript^𝜃𝑝subscript𝑖subscript𝒮𝑚˙subscript𝑋𝑖subscript𝑌𝑖subscript^𝜃𝑝\dot{\ell}_{m}(\widehat{\theta}_{p})=\sum_{i\in\mathcal{S}_{(m)}}\dot{\ell}(X_% {i},Y_{i};\widehat{\theta}_{p})over˙ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over˙ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and ¨m(θ^p)=i𝒮(m)¨(Xi,Yi;θ^p)subscript¨𝑚subscript^𝜃𝑝subscript𝑖subscript𝒮𝑚¨subscript𝑋𝑖subscript𝑌𝑖subscript^𝜃𝑝\ddot{\ell}_{m}(\widehat{\theta}_{p})=\sum_{i\in\mathcal{S}_{(m)}}\ddot{\ell}(% X_{i},Y_{i};\widehat{\theta}_{p})over¨ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¨ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). These derivatives are then communicated to the central computer for summation, i.e., ˙(θ^p)=m˙m(θ^p)˙subscript^𝜃𝑝subscript𝑚subscript˙𝑚subscript^𝜃𝑝\dot{\ell}(\widehat{\theta}_{p})=\sum_{m}\dot{\ell}_{m}(\widehat{\theta}_{p})over˙ start_ARG roman_ℓ end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over˙ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and ¨(θ^p)=m¨m(θ^p)¨subscript^𝜃𝑝subscript𝑚subscript¨𝑚subscript^𝜃𝑝\ddot{\ell}(\widehat{\theta}_{p})=\sum_{m}\ddot{\ell}_{m}(\widehat{\theta}_{p})over¨ start_ARG roman_ℓ end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over¨ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Based on the summarized derivative information along with the pilot estimate, a novel one-step upgrading is performed by the central computer. This leads to the final OSUP estimator as follows,

θ^OSUP=θ^p{1N¨(θ^p)}1{1N˙(θ^p)}.subscript^𝜃OSUPsubscript^𝜃𝑝superscript1𝑁¨subscript^𝜃𝑝11𝑁˙subscript^𝜃𝑝\widehat{\theta}_{\text{OSUP}}=\widehat{\theta}_{p}-\Big{\{}\frac{1}{N}\ddot{% \ell}(\widehat{\theta}_{p})\Big{\}}^{-1}\Big{\{}\frac{1}{N}\dot{\ell}(\widehat% {\theta}_{p})\Big{\}}.over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT OSUP end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - { divide start_ARG 1 end_ARG start_ARG italic_N end_ARG over¨ start_ARG roman_ℓ end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_N end_ARG over˙ start_ARG roman_ℓ end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) } .

Compared with a standard OS estimator, the OSUP estimator incurs an extra computational cost for obtaining the pilot sample. However, the benefits derived from the OSUP method are significant, leading to an estimator with the same statistical efficiency as the global estimator under very mild conditions (Wang et al.,, 2021). A similar one-step estimator is also studied by Huang and Huo, (2019). However, the key difference is that the initial estimator used in Huang and Huo, (2019) is the simple average of all local estimators.

To address the challenges posed by the lack of local sufficient sample size, Wu et al., 2023c developed a jackknife debiased (JDS) estimator to reduce the estimation bias based on the moment estimator. It should be noted that this method was originally proposed for subsampling. However, the key idea is also readily applicable to distributed computing. To be specific, they first define a jackknife estimator θ^j(m)subscriptsuperscript^𝜃𝑚𝑗\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\widehat{\theta}^{(m)}_{% -j}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT for the m𝑚mitalic_m-th machine as

θ^j(m)=argminθ1n1i𝒮(m)ij(Xi,Yi;θ).subscriptsuperscript^𝜃𝑚𝑗subscriptargmin𝜃1𝑛1superscriptsubscript𝑖subscript𝒮𝑚𝑖𝑗subscript𝑋𝑖subscript𝑌𝑖𝜃\widehat{\theta}^{(m)}_{-j}=\operatorname*{argmin}_{\theta}\frac{1}{n-1}\sum_{% i\in\mathcal{S}_{(m)}}^{i\neq j}\ell(X_{i},Y_{i};\theta).over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i ≠ italic_j end_POSTSUPERSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) .

It could be verified that Bias(θ^j(m))Biassubscriptsuperscript^𝜃𝑚𝑗\text{Bias}\big{(}\widehat{\theta}^{(m)}_{-j}\big{)}Bias ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) approximately equals τ/(n1)𝜏𝑛1\tau/(n-1)italic_τ / ( italic_n - 1 ) for some constant τ𝜏\tauitalic_τ (Shao and Tu,, 1995). Then, n1j𝒮mBias(θ^j(m))τ/(n1)superscript𝑛1subscript𝑗subscript𝒮𝑚Biassubscriptsuperscript^𝜃𝑚𝑗𝜏𝑛1n^{-1}\sum_{j\in\mathcal{S}_{m}}\text{Bias}\big{(}\widehat{\theta}^{(m)}_{-j}% \big{)}\approx\tau/(n-1)italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT Bias ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) ≈ italic_τ / ( italic_n - 1 ) and E(n1j𝒮mθ^j(m)E\big{(}n^{-1}\sum_{j\in\mathcal{S}_{m}}\widehat{\theta}^{(m)}_{-j}-italic_E ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT - θ^(m))τ/{n(n1)}\widehat{\theta}^{(m)}\big{)}\approx\tau/\{n(n-1)\}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ≈ italic_τ / { italic_n ( italic_n - 1 ) }. This inspires an estimator for the bias, which is given by Bias^(m)=(n1)n1j𝒮mθ^j(m)(n1)θ^(m)superscript^Bias𝑚𝑛1superscript𝑛1subscript𝑗subscript𝒮𝑚subscriptsuperscript^𝜃𝑚𝑗𝑛1superscript^𝜃𝑚\widehat{\text{Bias}}^{(m)}=(n-1)n^{-1}\sum_{j\in\mathcal{S}_{m}}\widehat{% \theta}^{(m)}_{-j}-(n-1)\widehat{\theta}^{(m)}over^ start_ARG Bias end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = ( italic_n - 1 ) italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT - ( italic_n - 1 ) over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT. Accordingly, a bias-corrected estimator for the m𝑚mitalic_m-th machine could be proposed as θ^JDS(m)=θ^(m)Bias^(m).superscriptsubscript^𝜃JDS𝑚superscript^𝜃𝑚superscript^Bias𝑚\widehat{\theta}_{\textup{JDS}}^{(m)}=\widehat{\theta}^{(m)}-\widehat{\text{% Bias}}^{(m)}.over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT - over^ start_ARG Bias end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT . Thereafter, θ^JDS(m)superscriptsubscript^𝜃JDS𝑚\widehat{\theta}_{\textup{JDS}}^{(m)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPTs can be further averaged across different m𝑚mitalic_m. As a consequence, the final JDS estimator could be obtained as θ^JDS=M1m=1Mθ^JDS(m)subscript^𝜃JDSsuperscript𝑀1subscriptsuperscript𝑀𝑚1superscriptsubscript^𝜃JDS𝑚\widehat{\theta}_{\textup{JDS}}=M^{-1}\sum^{M}_{m=1}\widehat{\theta}_{\textup{% JDS}}^{(m)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT. Subsequently, Wu et al., 2023c rigorously verified that Bias(θ^JDS)=O(1/n2)+O(1/N)Biassubscript^𝜃JDS𝑂1superscript𝑛2𝑂1𝑁\text{Bias}(\widehat{\theta}_{\textup{JDS}})=O(1/n^{2})+O(1/N)Bias ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT ) = italic_O ( 1 / italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( 1 / italic_N ) and the asymptotic variance of θ^JDSsubscript^𝜃JDS\widehat{\theta}_{\textup{JDS}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT remains the same as that of θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARG. As a consequence, excellent statistical efficiency can be achieved by θ^JDSsubscript^𝜃JDS\widehat{\theta}_{\textup{JDS}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT JDS end_POSTSUBSCRIPT with a very small size n𝑛nitalic_n. This bias correction method has been theoretically studied for moment estimator; however, the theoretical properties for the estimator computed from a general loss function remain unknown.

2.3. Efficient Iterative Approach

To improve the statistical efficiency of one-shot estimators, various distributed iterative methods can be considered. Since distributed computing requires passing messages among multiple computers, naively applying traditional iterative methods to a distributed system often incurs expensive communication costs. Therefore, how to achieve excellent statistical efficiency with well-controlled communication costs becomes the key issue (Jordan et al.,, 2019).

To illustrate this point, consider for example extending the iterative Newton-Raphson algorithm (2.1) to the distributed scenario. Recall that 𝒮(m)subscript𝒮𝑚\mathcal{S}_{(m)}caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT collects the indices of samples allocated to the m𝑚mitalic_m-th local computer. Given the current estimator θ^(t)superscript^𝜃𝑡\widehat{\theta}^{(t)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, we can compute the 1st and 2nd order derivatives of the loss function as

˙(θ^(t))=M1m=1M˙(m)(θ^(t))and¨(θ^(t))=M1m=1M¨(m)(θ^(t)),formulae-sequence˙superscript^𝜃𝑡superscript𝑀1superscriptsubscript𝑚1𝑀subscript˙𝑚superscript^𝜃𝑡and¨superscript^𝜃𝑡superscript𝑀1superscriptsubscript𝑚1𝑀subscript¨𝑚superscript^𝜃𝑡\dot{\mathcal{L}}\big{(}\widehat{\theta}^{(t)}\big{)}=M^{-1}\sum_{m=1}^{M}\dot% {\mathcal{L}}_{(m)}\big{(}\widehat{\theta}^{(t)}\big{)}\quad\textup{and}\quad% \ddot{\mathcal{L}}\big{(}\widehat{\theta}^{(t)}\big{)}=M^{-1}\sum_{m=1}^{M}% \ddot{\mathcal{L}}_{(m)}\big{(}\widehat{\theta}^{(t)}\big{)},over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) and over¨ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where ˙(m)(θ^(t))=i𝒮(m)˙(Xi,Yi;θ^(t))subscript˙𝑚superscript^𝜃𝑡subscript𝑖subscript𝒮𝑚˙subscript𝑋𝑖subscript𝑌𝑖superscript^𝜃𝑡\dot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)})=\sum_{i\in\mathcal{S}_{(m)}}% \dot{\ell}(X_{i},Y_{i};\widehat{\theta}^{(t)})over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over˙ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) and ¨(m)(θ^(t))=i𝒮(m)¨(Xi,Yi;θ^(t))subscript¨𝑚superscript^𝜃𝑡subscript𝑖subscript𝒮𝑚¨subscript𝑋𝑖subscript𝑌𝑖superscript^𝜃𝑡\ddot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)})=\sum_{i\in\mathcal{S}_{(m)}}% \ddot{\ell}(X_{i},Y_{i};\widehat{\theta}^{(t)})over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¨ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ). Note that ˙(m)(θ^(t))subscript˙𝑚superscript^𝜃𝑡\dot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)})over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) and ¨(m)(θ^(t))subscript¨𝑚superscript^𝜃𝑡\ddot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)})over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) are computed on the m𝑚mitalic_m-th local computer. They are then transferred to the central computer to update θ^(t+1)superscript^𝜃𝑡1\widehat{\theta}^{(t+1)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT according to (2.1). As one can see, this is a solution easy to implement but suffers several serious limitations. First, inverting the (p×p)𝑝𝑝(p\times p)( italic_p × italic_p )-dimensional Hessian matrix ¨(θ^(t))¨superscript^𝜃𝑡\ddot{\mathcal{L}}(\widehat{\theta}^{(t)})over¨ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) in the central computer incurs a computation cost with the order O(p3)𝑂superscript𝑝3O(p^{3})italic_O ( italic_p start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) for each iteration. Second, transferring the local Hessian matrix ¨(m)(θ^(t))subscript¨𝑚superscript^𝜃𝑡\ddot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)})over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) from each local computer to the central computer incurs a communication cost of order O(p2)𝑂superscript𝑝2O(p^{2})italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for each local computer in each iteration. Thus, this approach leads to high computation and communication costs for high-dimensional data.

To address this issue, various communication-efficient Newton-type methods have been proposed to alleviate high communication costs. One of the underlying key ideas is to avoid Hessian matrix transmission (Shamir et al.,, 2014; Zhang and Lin,, 2015; Wang et al.,, 2017; Wang et al., 2018b, ; Crane and Roosta,, 2019; Jordan et al.,, 2019; Luo and Song,, 2020). For example, the entire sample Hessian matrix can be approximated using some appropriate local estimators, which are computed on one single computer (e.g., the central computer). Consequently, the communication cost due to transferring the whole sample Hessian matrix between computers can be avoided. One notable work in this regard is Jordan et al., (2019). Motivated from the Taylor series expansion of (θ)𝜃\mathcal{L}(\theta)caligraphic_L ( italic_θ ), Jordan et al., (2019) defined a surrogate loss function as ~(θ)=(1)(θ)θ{˙(1)(θ¯)˙(θ¯)}~𝜃subscript1𝜃superscript𝜃topsubscript˙1¯𝜃˙¯𝜃\widetilde{\mathcal{L}}(\theta)=\mathcal{L}_{(1)}(\theta)-\theta^{\top}\left\{% \dot{\mathcal{L}}_{(1)}(\overline{\theta})-\dot{\mathcal{L}}(\overline{\theta}% )\right\}over~ start_ARG caligraphic_L end_ARG ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( italic_θ ) - italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT { over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( over¯ start_ARG italic_θ end_ARG ) - over˙ start_ARG caligraphic_L end_ARG ( over¯ start_ARG italic_θ end_ARG ) }, where θ¯¯𝜃\overline{\theta}over¯ start_ARG italic_θ end_ARG denotes any initial estimator of θ𝜃\thetaitalic_θ. Then, based on the surrogate loss function, the updating formula of Newton’s method is modified as θ^(t+1)=θ^(t){¨(1)(θ^(t))}1˙(θ^(t))superscript^𝜃𝑡1superscript^𝜃𝑡superscriptsubscript¨1superscript^𝜃𝑡1˙superscript^𝜃𝑡\widehat{\theta}^{(t+1)}=\widehat{\theta}^{(t)}-\{\ddot{\mathcal{L}}_{(1)}(% \widehat{\theta}^{(t)})\}^{-1}\dot{\mathcal{L}}(\widehat{\theta}^{(t)})over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - { over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), where ¨(1)(θ^(t))subscript¨1superscript^𝜃𝑡\ddot{\mathcal{L}}_{(1)}(\widehat{\theta}^{(t)})over¨ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) is the local Hessian matrix computed on the 1st local computer. Thus no Hessian matrix communication is needed for each iteration.

The implementation of communication-efficient Newton-type methods significantly reduces communication costs. However, when dealing with high-dimensional data, computing the inverse of the Hessian matrix remains to be a computationally expensive problem. To further improve computation efficiency, various methods avoiding matrix inverse calculation have been proposed. The first type is distributed (stochastic) gradient descent algorithms (Goyal et al.,, 2017; Lin and Zhou,, 2018; Qu and Li,, 2019; Su and Xu,, 2019; Li et al.,, 2022; Chen et al., 2022b, ), which compute only the 1st order derivatives of the loss function (i.e., gradients). To be specific, in distributed gradient descent methods, each local computer first receives the current parameter estimator from the central computer. Once the current parameter estimator is received, each local computer calculates its own gradient and sends it back to the central computer. Lastly, the central computer aggregates the local gradients and updates the parameter estimator as

θ^(t+1)=θ^(t)αtM1m=1M˙(m)(θ^(t)),superscript^𝜃𝑡1superscript^𝜃𝑡subscript𝛼𝑡superscript𝑀1superscriptsubscript𝑚1𝑀subscript˙𝑚superscript^𝜃𝑡\widehat{\theta}^{(t+1)}=\widehat{\theta}^{(t)}-\alpha_{t}M^{-1}\sum_{m=1}^{M}% \dot{\mathcal{L}}_{(m)}(\widehat{\theta}^{(t)}),over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , (2.2)

where αt>0subscript𝛼𝑡0\alpha_{t}>0italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 represents the learning rate. To further reduce the communication cost in (2.2), a local (stochastic) gradient descent algorithm is proposed (Stich,, 2019; Woodworth et al.,, 2020). The key idea is to run the (stochastic) gradient descent algorithm independently and locally on different local computers in a fully parallel way. Subsequently, the local estimators are transferred to the central computer and then iteratively updated to form the final estimator.

However, for those distributed gradient descent algorithms, a large number of iterations are typically required for numerical convergence, and the choice of hyperparameters (e.g., αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) is critically important and also difficult (Zhu et al., 2021b, ). To address this problem, various quasi-Newton methods in a distributed manner have been developed (Chen et al.,, 2014; Eisen et al.,, 2017; Lee et al.,, 2018; Soori et al.,, 2020; Wu et al., 2023b, ). The key idea of distributed quasi-Newton methods is to approximate the Hessian inverse in each iteration without actually inverting the matrix (Davidon,, 1991; Goldfarb,, 1970). The communication cost of these methods could have orders as low as O(p)𝑂𝑝O(p)italic_O ( italic_p ) in each iteration. In the meanwhile, the convergence rate of distributed quasi-Newton methods is superlinear, surpassing the linear convergence of distributed gradient descent methods (Broyden et al.,, 1973).

As an important method along this direction, Wu et al., 2023b developed a K𝐾Kitalic_K-stage distributed quasi-Newton method. Specifically, Wu et al., 2023b started with the following one-stage distributed quasi-Newton method

θ^stage,1=θ^stage,0M1m=1M{H(m,0)˙(θ^stage,0)}.subscript^𝜃stage1subscript^𝜃stage0superscript𝑀1superscriptsubscript𝑚1𝑀subscript𝐻𝑚0˙subscript^𝜃stage0\widehat{\theta}_{\operatorname{stage,1}}=\widehat{\theta}_{\operatorname{% stage,0}}-M^{-1}\sum_{m=1}^{M}\Big{\{}H_{(m,0)}\dot{\mathcal{L}}(\widehat{% \theta}_{\operatorname{stage,0}})\Big{\}}.over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_stage , 1 end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_stage , 0 end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT { italic_H start_POSTSUBSCRIPT ( italic_m , 0 ) end_POSTSUBSCRIPT over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_stage , 0 end_POSTSUBSCRIPT ) } . (2.3)

Here θ^stage,0subscript^𝜃stage0\widehat{\theta}_{\operatorname{stage,0}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_stage , 0 end_POSTSUBSCRIPT is a suitable initial estimator, such as the OS estimator θ^ossubscript^𝜃os\widehat{\theta}_{\operatorname{os}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_os end_POSTSUBSCRIPT. H(m,0)subscript𝐻𝑚0H_{(m,0)}italic_H start_POSTSUBSCRIPT ( italic_m , 0 ) end_POSTSUBSCRIPT represents the local inverse Hessian estimator obtained by each local computer after implementing the quasi-Newton algorithm. Next, the local computer sends H(m,0)˙(θ^stage,0)subscript𝐻𝑚0˙subscript^𝜃stage0H_{(m,0)}\dot{\mathcal{L}}(\widehat{\theta}_{\operatorname{stage,0}})italic_H start_POSTSUBSCRIPT ( italic_m , 0 ) end_POSTSUBSCRIPT over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_stage , 0 end_POSTSUBSCRIPT ) as a whole to the central computer. Therefore, the communication cost is of the order O(p)𝑂𝑝O(p)italic_O ( italic_p ). Wu et al., 2023b has verified that the optimal statistical efficiency can be achieved by the one-stage distributed quasi-Newton estimator as long as N(logp)4/n40𝑁superscript𝑝4superscript𝑛40N(\log p)^{4}/n^{4}\to 0italic_N ( roman_log italic_p ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT / italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT → 0. This condition can be further relaxed by allowing for multi-stage updating. This leads to the K𝐾Kitalic_K-stage distributed quasi-Newton estimator, which is extremely efficient both computationally and communicationally.

2.4. Distributed Quantile Regression

Quantile regression is an important class of regression methods for its robustness against heavy-tailed distributed responses and outliers (Koenker and Bassett,, 1978; Koenker,, 2005). Its applications span various disciplines, including agriculture (Kostov and Davidova,, 2013), climate change (Reich et al.,, 2012), health studies (Alhamzawi and Ali,, 2018), house pricing (Chen et al.,, 2013), and others (Xu et al.,, 2017; Zhong et al.,, 2022). With the availability of large-scale datasets, extensive research has been dedicated to distributed estimation and inference for quantile regression. For instance, Yang et al., (2013) proposed a subspace preserving sampling technique for quantile regression on massive datasets. However, their method suffers from the problem of statistical inefficiency. Later, Xu et al., (2020) developed a block average approach, employing the one-shot strategy by averaging estimators derived from each local computer.

To guarantee statistical efficiency, researchers address the challenges of distributed quantile regression by proposing various loss functions and iterative algorithms. For instance, Volgushev et al., (2019) proposed a two-step quantile projection algorithm, incorporating valid statistical inference. In the initial step, conditional quantile functions are estimated at different levels. Subsequently, a quantile regression process is constructed through projection. Chen et al., (2019) presented a computationally efficient method, which involves multiple rounds of aggregations. After limited q𝑞qitalic_q iterations, the authors show that the statistical efficiency of the final estimator becomes the same as the one computed on the whole data. In a related work, Chen et al., 2020b studied the linear regression problem with heavy-tailed noises. Since the quantile regression loss function is a non-smooth function, the authors established a connection between quantile regression and ordinary linear regression by transforming the response. This results in a distributed estimator that is efficient in both computation and communication. Instead of dealing with conventional smoothing functions, Hu et al., (2021) extended the communication-efficient surrogate likelihood method proposed by Jordan et al., (2019). The authors constructed a surrogate loss function and established the consistency and asymptotic normality of the proposed methodology. In a recent development, Tan et al., (2022) developed a double-smoothing approach to the local and global objective functions of quantile regression.

In a recent work of Pan et al., (2022), the authors proposed a one-step approach that is efficient both communicationally and statistically. Notably, the derived estimator is robust against data distribution heterogeneity across local computers. Specifically, assume there are N𝑁Nitalic_N observations indexed by i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N. The response Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the p𝑝pitalic_p-dimensional predictor Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT follow the standard τ𝜏\tauitalic_τ-th quantile regression model Yi=Xiβτ+εisubscript𝑌𝑖superscriptsubscript𝑋𝑖topsubscript𝛽𝜏subscript𝜀𝑖Y_{i}=X_{i}^{\top}\beta_{\tau}+\varepsilon_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where βτsubscript𝛽𝜏\beta_{\tau}italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is the associated regression coefficient vector and τ(0,1)𝜏01\tau\in(0,1)italic_τ ∈ ( 0 , 1 ). Additionally, εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the error term satisfying P(εi0|Xi)=τ𝑃subscript𝜀𝑖conditional0subscript𝑋𝑖𝜏P(\varepsilon_{i}\leq 0|X_{i})=\tauitalic_P ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0 | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_τ. The standard check loss function can be constructed as (𝜷)=i=1Nρτ(YiXi𝜷)𝜷superscriptsubscript𝑖1𝑁subscript𝜌𝜏subscript𝑌𝑖superscriptsubscript𝑋𝑖top𝜷\mathcal{L}(\bm{\beta})=\sum_{i=1}^{N}\rho_{\tau}(Y_{i}-X_{i}^{\top}\bm{\beta})caligraphic_L ( bold_italic_β ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β ), where ρτ(μ)=μ{τI(μ0)}subscript𝜌𝜏𝜇𝜇𝜏𝐼𝜇0\rho_{\tau}(\mu)=\mu\{\tau-I(\mu\leq 0)\}italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_μ ) = italic_μ { italic_τ - italic_I ( italic_μ ≤ 0 ) } represents the check function and I()𝐼I(\cdot)italic_I ( ⋅ ) the indicator function. Consequently, the standard estimator of βτsubscript𝛽𝜏\beta_{\tau}italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT can be obtained by β^τ=argminβ(β)subscript^𝛽𝜏subscript𝛽𝛽\widehat{\beta}_{\tau}=\mathop{\arg\min}_{\beta}\mathcal{L}(\beta)over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT caligraphic_L ( italic_β ). Assume that a pilot sample (i.e., indexed by 𝒬𝒬\mathcal{Q}caligraphic_Q) is derived across K𝐾Kitalic_K local computers. The sample size of the pilot sample is n𝑛nitalic_n satisfying n/N0𝑛𝑁0n/N\rightarrow 0italic_n / italic_N → 0. As a result, a pilot estimator can be obtained by β^τ𝒬=argminβi𝒬ρτ(YiXiβ)subscriptsuperscript^𝛽𝒬𝜏subscript𝛽subscript𝑖𝒬subscript𝜌𝜏subscript𝑌𝑖superscriptsubscript𝑋𝑖top𝛽\widehat{\beta}^{\mathcal{Q}}_{\tau}=\mathop{\arg\min}_{\beta}\sum_{i\in% \mathcal{Q}}\rho_{\tau}(Y_{i}-X_{i}^{\top}\beta)over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT caligraphic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_Q end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β ), which is n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG-consistent. The one-step estimator proposed by Pan et al., (2022) is derived as

β^τ(1)=β^τ𝒬+1f^(0)(i=1NXiXi)1[i=1NXi{τI(ε^i0)}],subscriptsuperscript^𝛽1𝜏subscriptsuperscript^𝛽𝒬𝜏1^𝑓0superscriptsuperscriptsubscript𝑖1𝑁subscript𝑋𝑖superscriptsubscript𝑋𝑖top1delimited-[]superscriptsubscript𝑖1𝑁subscript𝑋𝑖𝜏𝐼subscript^𝜀𝑖0\widehat{\beta}^{(1)}_{\tau}=\widehat{\beta}^{\mathcal{Q}}_{\tau}+\frac{1}{% \widehat{f}(0)}\left(\sum_{i=1}^{N}X_{i}X_{i}^{\top}\right)^{-1}\left[\sum_{i=% 1}^{N}X_{i}\Big{\{}\tau-I(\widehat{\varepsilon}_{i}\leq 0)\Big{\}}\right],over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT caligraphic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_f end_ARG ( 0 ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_τ - italic_I ( over^ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0 ) } ] , (2.4)

where f^()^𝑓\widehat{f}(\cdot)over^ start_ARG italic_f end_ARG ( ⋅ ) is a kernel density estimator and ε^isubscript^𝜀𝑖\widehat{\varepsilon}_{i}over^ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the residual. It can be proved that β^τ(1)subscriptsuperscript^𝛽1𝜏\widehat{\beta}^{(1)}_{\tau}over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is N𝑁\sqrt{N}square-root start_ARG italic_N end_ARG-consistent and asymptotically normal, regardless of how the raw data are distributed across local computers.

2.5. Distributed Logistic Regression with Rare Events Data

The rare events problem in this subsection refers to a binary data classification problem, where one class (often assumed to be the negative class) has a much greater number of instances than the other class (often assumed to be the positive class). Rare events data are prevalent in scientific fields and applications. The rare events data examples include but are not limited to drug discovery (Zhu et al.,, 2006; Korkmaz,, 2020), software defects (Richardson and Lidbury,, 2013), and rare disease diagnosis (Zhao et al.,, 2018; Zhuang et al.,, 2019). In traditional statistical theory, one often assumes that the probability of any type of event to happen is fixed. However, for rare events, it is more appropriate to assume that the positive class probability should decay towards zero at an appropriate rate as the total sample size increases (Wang,, 2020). In this regard, Wang, (2020) constructed a novel theoretical framework to accurately describe rare events data. Under this theoretical framework, it was demonstrated that the convergence rate of the global maximum likelihood estimator (MLE) is mainly determined by the sample size of the positive class instead of the total sample size. This implies a considerably slower convergence rate than that of the classical cases. It seems that limited attempts have been made for the rigorous asymptotic theory for distributed classification problems with rare event data. This motivates Li et al., (2023) to develop a novel distributed logistic regression method with solid statistical theory support for massive rare event data.

More specifically, assume there are a total of N𝑁Nitalic_N observations indexed by 1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N. The i𝑖iitalic_i-th observation is denoted as (Xi,Yi)subscript𝑋𝑖subscript𝑌𝑖\big{(}X_{i},Y_{i}\big{)}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Xipsubscript𝑋𝑖superscript𝑝X_{i}\in\mathbb{R}^{p}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is a p𝑝pitalic_p-dimensional covariate and Yi{0,1}subscript𝑌𝑖01Y_{i}\in\{0,1\}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } is the binary response. Let N1=i=1NYisubscript𝑁1superscriptsubscript𝑖1𝑁subscript𝑌𝑖N_{1}=\sum_{i=1}^{N}Y_{i}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the total number of positive instances. To model their regression relationship, the following logistic regression model is considered

P(Yi=1Xi)=pi(α,β)=exp(α+Xiβ)1+exp(α+Xiβ),𝑃subscript𝑌𝑖conditional1subscript𝑋𝑖subscript𝑝𝑖𝛼𝛽𝛼superscriptsubscript𝑋𝑖top𝛽1𝛼superscriptsubscript𝑋𝑖top𝛽\displaystyle P\big{(}Y_{i}=1\mid X_{i}\big{)}=p_{i}(\alpha,\beta)=\frac{\exp(% \alpha+X_{i}^{\top}\beta)}{1+\exp(\alpha+X_{i}^{\top}\beta)},italic_P ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ∣ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α , italic_β ) = divide start_ARG roman_exp ( italic_α + italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β ) end_ARG start_ARG 1 + roman_exp ( italic_α + italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β ) end_ARG ,

where α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R is the intercept and βp𝛽superscript𝑝\beta\in\mathbb{R}^{p}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the slope parameter. To reflect the asymptotic behavior, two important assumptions should be imposed (Wang,, 2020). First, the percentage of positive instances should be extremely small. Statistically, the positive response probability is specified to converge towards 0 as the total sample size N𝑁N\to\inftyitalic_N → ∞. Rewrite α𝛼\alphaitalic_α as αNsubscript𝛼𝑁\alpha_{N}italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. This leads to αNsubscript𝛼𝑁\alpha_{N}\to-\inftyitalic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT → - ∞ as N𝑁N\to\inftyitalic_N → ∞. Second, the total number of positive instances should diverge to infinity. Otherwise, the parameters of interest cannot be estimated consistently. Mathematically, it follows that E(N1)Nexp(αN)E{exp(Xiβ)}𝐸subscript𝑁1𝑁subscript𝛼𝑁𝐸superscriptsubscript𝑋𝑖top𝛽E(N_{1})\approx N\exp(\alpha_{N})E\{\exp(X_{i}^{\top}\beta)\}italic_E ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≈ italic_N roman_exp ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) italic_E { roman_exp ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β ) } as N𝑁N\to\inftyitalic_N → ∞. This suggests that αNsubscript𝛼𝑁\alpha_{N}\to-\inftyitalic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT → - ∞ and αN+logNsubscript𝛼𝑁𝑁\alpha_{N}+\log N\to\inftyitalic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + roman_log italic_N → ∞ as N𝑁N\to\inftyitalic_N → ∞ (Wang,, 2020; Li et al.,, 2023).

Assume a distributed computation system with one central computer and a total of K𝐾Kitalic_K local computers indexed by 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K. Li et al., (2023) first discussed two different data distribution strategies. They are RANDOM and COPY strategies, respectively. Specifically, the RANDOM strategy is used to randomly distribute the full data to each local computer with approximately equal sizes. For the COPY strategy, all the positive instances are copied to every local computer. In contrast, the negative instances are randomly distributed on different local computers. Next, Li et al., (2023) studied three types of objective functions. They are, respectively,

R,k(θ)subscriptR𝑘𝜃\displaystyle\mathcal{L}_{{\rm R},k}\big{(}\theta\big{)}caligraphic_L start_POSTSUBSCRIPT roman_R , italic_k end_POSTSUBSCRIPT ( italic_θ ) =\displaystyle== i=1Nai(k){Yilogpi(αN,β)+(1Yi)log(1pi(αN,β))},superscriptsubscript𝑖1𝑁superscriptsubscript𝑎𝑖𝑘subscript𝑌𝑖subscript𝑝𝑖subscript𝛼𝑁𝛽1subscript𝑌𝑖1subscript𝑝𝑖subscript𝛼𝑁𝛽\displaystyle\sum_{i=1}^{N}a_{i}^{(k)}\Big{\{}Y_{i}\log p_{i}\big{(}\alpha_{N}% ,\beta\big{)}+\big{(}1-Y_{i}\big{)}\log\big{(}1-p_{i}(\alpha_{N},\beta)\big{)}% \Big{\}},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) + ( 1 - italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) ) } ,
US,k(θ)subscriptUS𝑘𝜃\displaystyle\mathcal{L}_{{\rm US},k}(\theta)caligraphic_L start_POSTSUBSCRIPT roman_US , italic_k end_POSTSUBSCRIPT ( italic_θ ) =\displaystyle== i=1N{Yilogpi(αN,β)+(1Yi)ai(k)log(1pi(αN,β))},superscriptsubscript𝑖1𝑁subscript𝑌𝑖subscript𝑝𝑖subscript𝛼𝑁𝛽1subscript𝑌𝑖superscriptsubscript𝑎𝑖𝑘1subscript𝑝𝑖subscript𝛼𝑁𝛽\displaystyle\sum_{i=1}^{N}\Big{\{}Y_{i}\log p_{i}\big{(}\alpha_{N},\beta\big{% )}+\big{(}1-Y_{i}\big{)}a_{i}^{(k)}\log\big{(}1-p_{i}(\alpha_{N},\beta)\big{)}% \Big{\}},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) + ( 1 - italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) ) } ,
IPW,k(θ)subscriptIPW𝑘𝜃\displaystyle\mathcal{L}_{{\rm IPW},k}(\theta)caligraphic_L start_POSTSUBSCRIPT roman_IPW , italic_k end_POSTSUBSCRIPT ( italic_θ ) =\displaystyle== i=1N{Yilogpi(αN,β)+K(1Yi)ai(k)log(1pi(αN,β))}.superscriptsubscript𝑖1𝑁subscript𝑌𝑖subscript𝑝𝑖subscript𝛼𝑁𝛽𝐾1subscript𝑌𝑖superscriptsubscript𝑎𝑖𝑘1subscript𝑝𝑖subscript𝛼𝑁𝛽\displaystyle\sum_{i=1}^{N}\Big{\{}Y_{i}\log p_{i}\big{(}\alpha_{N},\beta\big{% )}+K\big{(}1-Y_{i}\big{)}a_{i}^{(k)}\log\big{(}1-p_{i}(\alpha_{N},\beta)\big{)% }\Big{\}}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) + italic_K ( 1 - italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β ) ) } .

Here P(ai(k)=1)=1/K𝑃superscriptsubscript𝑎𝑖𝑘11𝐾P(a_{i}^{(k)}=1)=1/Kitalic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1 ) = 1 / italic_K and ai(k)=1superscriptsubscript𝑎𝑖𝑘1a_{i}^{(k)}=1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1 if the i𝑖iitalic_i-th observation is randomly distributed to k𝑘kitalic_k-th local computer. Local estimators can be computed based on the local data according to different loss functions. The local estimators are then averaged by the central computer in the last step (Zhang et al.,, 2012; Chang et al.,, 2017). This leads to three distributed estimators (i.e., θ^RMLEsubscript^𝜃RMLE\widehat{\theta}_{\rm RMLE}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_RMLE end_POSTSUBSCRIPT, θ^USsubscript^𝜃US\widehat{\theta}_{\rm US}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_US end_POSTSUBSCRIPT and θ^IPWsubscript^𝜃IPW\widehat{\theta}_{\rm IPW}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT roman_IPW end_POSTSUBSCRIPT). Under some regularity conditions, Li et al., (2023) found that the COPY strategy together with the inverse probability weighted objective function IPW,k(θ)subscriptIPW𝑘𝜃\mathcal{L}_{{\rm IPW},k}(\theta)caligraphic_L start_POSTSUBSCRIPT roman_IPW , italic_k end_POSTSUBSCRIPT ( italic_θ ) seems to be the best choice.

2.6. Decentralized Distributed Computing

The distributed computing methods outlined earlier share a common characteristic. That is the requirement of a central computer, which is responsible for communicating with every local computer. Such type of architecture is easy to implement. However, it suffers from several serious limitations. First, this centralized network structure is extremely fragile. If the central computer stops working, the entire network stops. Second, there exists the issue of privacy disclosure for the centralized structure. This is because if the central machine is attacked, the attacker is given the chance to communicate with every local computer. Third, a centralized network structure has a high requirement for network bandwidth, since the central machine should communicate with numerous local computers (Bellet et al.,, 2018; Li et al.,, 2021).

To fix these problems, a number of researchers advocate the idea of fully decentralized distributed computing, which is also called decentralized federated learning (DFL) (Yuan et al.,, 2016). The key feature of DFL is that there is no central computer involved for model training and communication. Different local computers are directly connected through a sophisticated communication network. All computation-related communications should occur only between network-connected individual local computers. To be specific, define θ^(t,m)superscript^𝜃𝑡𝑚\widehat{\theta}^{(t,m)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT to be the t𝑡titalic_t-th estimator obtained on the m𝑚mitalic_m-th local computer, and the updating formula of the decentralized federated learning algorithm is given by

θ^(t+1,m)=θ~(t,m)α˙(m)(θ~(t,m)).superscript^𝜃𝑡1𝑚superscript~𝜃𝑡𝑚𝛼subscript˙𝑚superscript~𝜃𝑡𝑚\widehat{\theta}^{(t+1,m)}=\widetilde{\theta}^{(t,m)}-\alpha\dot{{\mathcal{L}}% }_{(m)}\left(\widetilde{\theta}^{(t,m)}\right).over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 , italic_m ) end_POSTSUPERSCRIPT = over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ) . (2.5)

Here, θ~(t,m)superscript~𝜃𝑡𝑚\widetilde{\theta}^{(t,m)}over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT is the neighborhood-averaged estimator obtained in the t𝑡titalic_t-th iteration for the m𝑚mitalic_m-th local computer. Numerous studies have investigated the numerical convergence properties of the method (Blot et al.,, 2016; Nedic et al.,, 2017; Lian et al.,, 2017; Vanhaesebrouck et al.,, 2017; Tang et al.,, 2018; Lalitha et al.,, 2018). It has been demonstrated that the algorithm can achieve a linear convergence rate even with data heterogeneity (Richards et al.,, 2020; Savazzi et al.,, 2020). To achieve this nice theoretical property, several stringent conditions have been assumed about the network structure in the past literature. The most typical assumption is that the transition matrix determined by the network structure should be doubly stochastic (Yuan et al.,, 2016; Tang et al.,, 2018). Unfortunately, this is an assumption that can hardly be satisfied in real practice.

To relax this stringent condition, Wu et al., 2023a developed a novel methodology for DFL, which only requires the network structure to be weakly balanced. To be specific, take a linear regression as an example. Assume there are a total of M𝑀Mitalic_M local computers, which are connected by a communication network. The adjacency matrix of the network is defined as A=(aij)M×M𝐴subscript𝑎𝑖𝑗superscript𝑀𝑀A=(a_{ij})\in\mathbb{R}^{M\times M}italic_A = ( italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT, and the corresponding weighting matrix is defined as W=(wij)M×M𝑊subscript𝑤𝑖𝑗superscript𝑀𝑀W=(w_{ij})\in\mathbb{R}^{M\times M}italic_W = ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT with wij=aij/dmsubscript𝑤𝑖𝑗subscript𝑎𝑖𝑗subscript𝑑𝑚w_{ij}=a_{ij}/d_{m}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where dm=jaijsubscript𝑑𝑚subscript𝑗subscript𝑎𝑖𝑗d_{m}=\sum_{j}a_{ij}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the in-degree of A𝐴Aitalic_A. Then algorithm 2.5 could be rewritten as

θ^(t+1,m)=(IpαΣ^xx(m))(k=1Mwmkθ^(t,k))+αΣ^xy(m),superscript^𝜃𝑡1𝑚subscript𝐼𝑝𝛼superscriptsubscript^Σ𝑥𝑥𝑚superscriptsubscript𝑘1𝑀subscript𝑤𝑚𝑘superscript^𝜃𝑡𝑘𝛼superscriptsubscript^Σ𝑥𝑦𝑚\widehat{\theta}^{(t+1,m)}=\left(I_{p}-\alpha\widehat{\Sigma}_{xx}^{(m)}\right% )\Big{(}\sum_{k=1}^{M}w_{mk}\widehat{\theta}^{(t,k)}\Big{)}+\alpha\widehat{% \Sigma}_{xy}^{(m)},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 , italic_m ) end_POSTSUPERSCRIPT = ( italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_α over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) + italic_α over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , (2.6)

where Ipp×psubscript𝐼𝑝superscript𝑝𝑝I_{p}\in\mathbb{R}^{p\times p}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT is an identity matrix, Σ^xx(m)=i𝒮(m)XiXi/np×psuperscriptsubscript^Σ𝑥𝑥𝑚subscript𝑖subscript𝒮𝑚subscript𝑋𝑖superscriptsubscript𝑋𝑖top𝑛superscript𝑝𝑝\widehat{\Sigma}_{xx}^{(m)}=\sum_{i\in\mathcal{S}_{(m)}}X_{i}X_{i}^{\top}/n\in% \mathbb{R}^{p\times p}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / italic_n ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT, and Σ^xy(m)=i𝒮(m)XiYi/npsuperscriptsubscript^Σ𝑥𝑦𝑚subscript𝑖subscript𝒮𝑚subscript𝑋𝑖subscript𝑌𝑖𝑛superscript𝑝\widehat{\Sigma}_{xy}^{(m)}=\sum_{i\in\mathcal{S}_{(m)}}X_{i}Y_{i}/n\in\mathbb% {R}^{p}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Next, define θ^*(t)=(θ^(t,1),,θ^(t,M))Mp,Σ^xy*=(Σ^xy(1),\widehat{\theta}^{*(t)}=\big{(}\widehat{\theta}^{(t,1)\top},\dots,\widehat{% \theta}^{(t,M)\top}\big{)}^{\top}\in\mathbb{R}^{Mp},\ \widehat{\Sigma}_{xy}^{*% }=\big{(}\widehat{\Sigma}_{xy}^{(1)\top},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT * ( italic_t ) end_POSTSUPERSCRIPT = ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , 1 ) ⊤ end_POSTSUPERSCRIPT , … , over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_M ) ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M italic_p end_POSTSUPERSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) ⊤ end_POSTSUPERSCRIPT ,,Σ^xy(M))Mp,\dots,\widehat{\Sigma}_{xy}^{(M)\top}\big{)}^{\top}\in\mathbb{R}^{Mp},… , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M italic_p end_POSTSUPERSCRIPT , and Δ*=diag{IpαΣ^xx(1),,IpαΣ^xx(M)}superscriptΔdiagsubscript𝐼𝑝𝛼subscriptsuperscript^Σ1𝑥𝑥subscript𝐼𝑝𝛼subscriptsuperscript^Σ𝑀𝑥𝑥\Delta^{*}=\mbox{diag}\big{\{}I_{p}-\alpha\widehat{\Sigma}^{(1)}_{xx},\dots,I_% {p}-\alpha\widehat{\Sigma}^{(M)}_{xx}\big{\}}roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = diag { italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_α over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_α over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT }. This leads to a matrix form of (2.6) as

θ^*(t+1)=Δ*(WIp)θ^*(t)+αΣ^xy*,superscript^𝜃absent𝑡1superscriptΔtensor-product𝑊subscript𝐼𝑝superscript^𝜃absent𝑡𝛼subscriptsuperscript^Σ𝑥𝑦\widehat{\theta}^{*(t+1)}=\Delta^{*}(W\otimes I_{p})\widehat{\theta}^{*(t)}+% \alpha\widehat{\Sigma}^{*}_{xy},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT * ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_W ⊗ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT * ( italic_t ) end_POSTSUPERSCRIPT + italic_α over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT ,

where tensor-product\otimes stands for the Kronecker product. Assume the stable solution of this system (denoted by θ^*superscript^𝜃\widehat{\theta}^{*}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT) exists, it follows then θ^*=α{IpΔ*(WIp)}1Σ^xy*.superscript^𝜃𝛼superscriptsubscript𝐼𝑝superscriptΔtensor-product𝑊subscript𝐼𝑝1subscriptsuperscript^Σ𝑥𝑦\widehat{\theta}^{*}=\alpha\Big{\{}I_{p}-\Delta^{*}(W\otimes I_{p})\Big{\}}^{-% 1}\widehat{\Sigma}^{*}_{xy}.over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_α { italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_W ⊗ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT . Theoretically, Wu et al., 2023a proved that the statistical efficiency of the DFL is determined by three factors: (1) the learning rate, (2) the network structure, and (3) the data distribution pattern. The optimal statistical efficiency can be guaranteed if the learning rate is relatively small and the network structure is relatively balanced, even if data are distributed heterogeneously.

There has been extensive research to further extend the classical DFL algorithm from different perspectives. To address the issue of insufficient labels, Gao et al., (2019) developed a heterogeneous horizontal federated learning framework. To enhance network security, Chen et al., 2022a introduced a decentralized federated learning algorithm that meets differential privacy requirements. The key idea is that each client adds random noise to their parameter estimators before communication. The classical DFL algorithm suffers from high communication costs and low convergence rates in non-convex situations. To fix this problem, Nadiradze et al., (2021) proposed an asynchronous decentralized federated learning method. A novel DFL framework was developed by Liu et al., 2022a , which optimally balances communication efficiency and statistical efficiency. Liu et al., 2022b further proposed a decentralized surrogate median regression method for non-smooth sparse problems. For valid statistical inference in DFL, Gu and Chen, (2023) studied communication-efficient M𝑀Mitalic_M-estimation. To achieve optimal efficiency, they proposed a one-step DFL estimation method that allows a relatively large number of clients.

2.7. Distributed Statistical Inference

In addition to estimation, other statistical inference tools, such as hypothesis testing and confidence intervals, also play a crucial role in scientific research and data analysis. These inference tools allow researchers to quantify the uncertainty of the estimators obtained from the data, and then help practitioners interpret the results appropriately (Casella and Berger,, 2002). In the above discussion of this section, various distributed estimation methods have been introduced. Most of them have proven that the distributed estimator can be statistically as efficient as the global estimator under certain conditions (Wang et al.,, 2017; Jordan et al.,, 2019; Volgushev et al.,, 2019; Wang et al., 2019b, ; Zhu et al., 2021a, ; Fan et al.,, 2023; Pan et al.,, 2022). Consequently, the distributed estimators generally share the same asymptotic distribution as the global estimator. This means that if one can consistently estimate the asymptotic covariance matrix, then asymptotically valid statistical inference can be directly conducted. Then the key issue becomes how to estimate the asymptotic covariance matrix consistently and distributedly. To this end, various plug-in approaches have been widely adopted for various models. For example, general M𝑀Mitalic_M-estimation problems with smoothed loss (Jordan et al.,, 2019), quantile regression models (Pan et al.,, 2022), support vector machine (Wang et al., 2019b, ), and various debiased estimators for high-dimensional models (Fan et al.,, 2019; Tu et al.,, 2023).

However, if the asymptotic covariance of an estimator is too complex, it can be very challenging to construct the corresponding estimator analytically. In this case, bootstrap provides a more directed inference approach (Shao and Tu,, 1995; Efron and Stein,, 1981). Nevertheless, bootstrap usually requires multiple resampling procedures over the whole dataset. This is computationally too expensive to be acceptable, especially for massive datasets. To solve this problem, researchers developed some more computationally feasible bootstrap methods. Among them, Kleiner et al., (2014) introduced a method called the bag of little bootstraps (BLB). The BLB method first divides the whole sample into multiple subsets. It then computes multiple repeated estimates of the estimator (or related statistics) based on the inflated resamples. Finally, an averaging step is applied to aggregate these estimates. Taking a similar approach, Sengupta et al., (2016) further proposed the method of subsampled double bootstrap (SDB). Instead of directly dividing the whole sample into the disjoint subsets, the SDB method selects multiple subsets from the whole dataset by sampling with replacement. The computational efficiency and inferential reliability of these subsample-based bootstrap methods depend on various hyperparameters, such as the subsample size and the number of replicates. To address this issue, Ma et al., (2024) developed an interesting approach for selecting the optimal hyperparameters for the subsample-based bootstrap methods, including the BLB and the SDB.

However, the above bootstrap variants generally require that the involved estimator (or statistic) has a weighted subsample representation. This may not be true for some general statistics. For example, the class of symmetric statistics considered in Chen and Peng, (2021), which includes the U𝑈Uitalic_U-statistics as an important example. Chen and Peng, (2021) investigated the theoretical properties of the one-shot averaging type distributed statistics for both the degenerate and non-degenerate cases. For inference purposes, they developed a distributed bootstrap procedure, where no technique such as inflated resampling technique in the BLB method of Kleiner et al., (2014) is required. To further ease the computational burden, Chen and Peng, (2021) proposed a pseudo-distributed bootstrap (PDB) procedure. The consistency of the PDB procedure has also been theoretically proved and numerically validated.

3. SUBSAMPLING MODELS

3.1. Sequential Addressing Subsampling

Note that subsampling technique is closely related to the idea of bootstrap (Efron,, 1979; Bickel and Freedman,, 1981). However, the classical full size bootstrap is often computationally too expensive for massive data analysis. A practical solution in this regard is to repeatedly generate subsamples of small sizes for parameter estimation and statistical inference. Consequently, various subsampling methods are proposed. These methods include, but are not limited to, the m𝑚mitalic_m out of n𝑛nitalic_n bootstrap (Bickel et al.,, 1997), the bag of little bootstrap (Kleiner et al.,, 2014), the subsampling double bootstrap (Sengupta et al.,, 2016), the distributed bootstrap (Chen and Peng,, 2021), the optimal subsampling bootstrap (Ma et al.,, 2024), and possibly others. These methods are particularly useful for the situation, where the data is small enough to be comfortably placed on one single hard drive but large enough so that it cannot be fully loaded into the computer memory as a whole.

When computational resources are limited, an alternative way of subsampling is to comprehensively utilize both the computer memory and the hard drive. Some early literature has proposed out-of-core sampling methods, which obtain samples by randomly accessing data points on the hard drive without loading the whole data file in advance (Vitter,, 1985; Li,, 1994). However, due to the hardware limitations at that time, such out-of-core sampling methods have not been tested on massive datasets. When dealing with massive datasets, the time cost is of the most critical concern. The time required to sample a single data point from the hard drive often exceeds that of in-memory sampling (Suwandarathna and Koggalage,, 2007). This time cost comprises two main components. They are, respectively, the addressing cost associated with identifying the target data point on the hard drive and the I/O cost associated with reading the target data point into memory. It is then referred to as the hard drive sampling cost (HDSC), representing the time needed to fetch a specific data point from the hard disk into computer memory (Pan et al.,, 2023). To reduce the HDSC for massive data, Pan et al., (2023) developed a computationally efficient method known as sequential addressing subsampling (SAS). This method involves a two-step process. They are, respectively, a random shuffling operation aiming at randomly sorting the raw data and a sequential sampling step for obtaining the desired subsamples. It is noteworthy that the random addressing operation, a crucial component of obtaining a subsample, is only performed once. The subsample obtained through this process is referred to as the SAS subsample, and various statistics can be constructed and theoretically studied using these SAS subsamples.

The subsampling method in Pan et al., (2023) provides a promising solution to accelerate the subsampling process on the hard drive. Moreover, their SAS method has been proven to be a robust tool for making statistical inference on massive datasets. Consider the sample mean as a concrete example to illustrate the theoretical findings based on the SAS method. Assume a set of N𝑁Nitalic_N samples represented by X1,,XNsubscript𝑋1subscript𝑋𝑁X_{1},\dots,X_{N}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT with mean a μ𝜇\muitalic_μ and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Additionally, assume that E(Xiμ)4=γσ4𝐸superscriptsubscript𝑋𝑖𝜇4𝛾superscript𝜎4E(X_{i}-\mu)^{4}=\gamma\sigma^{4}italic_E ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = italic_γ italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Let {Xk,Xk+1,,Xk+n1}subscript𝑋𝑘subscript𝑋𝑘1subscript𝑋𝑘𝑛1\{X_{k},X_{k+1},\dots,X_{k+n-1}\}{ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_k + italic_n - 1 end_POSTSUBSCRIPT } be the k𝑘kitalic_k-th subsample with a sample size of n𝑛nitalic_n, where the sample mean is defined as X¯k=n1i=kk+n1Xisubscript¯𝑋𝑘superscript𝑛1superscriptsubscript𝑖𝑘𝑘𝑛1subscript𝑋𝑖\overline{X}_{k}=n^{-1}\sum_{i=k}^{k+n-1}X_{i}over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + italic_n - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In practice, assume that B𝐵Bitalic_B sequential subsamples are obtained using the SAS method. Denote the corresponding sample means of these B𝐵Bitalic_B sequential subsample as {X¯(1),,X¯(b),,X¯(B)}subscript¯𝑋1subscript¯𝑋𝑏subscript¯𝑋𝐵\{\overline{X}_{(1)},\dots,\overline{X}_{(b)},\dots,\overline{X}_{(B)}\}{ over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , … , over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT , … , over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_B ) end_POSTSUBSCRIPT }. Define the sample mean of interest as X¯¯B=B1b=1BX¯(b)subscript¯¯𝑋𝐵superscript𝐵1superscriptsubscript𝑏1𝐵subscript¯𝑋𝑏\overline{\overline{X}}_{B}=B^{-1}\sum_{b=1}^{B}\overline{X}_{(b)}over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT. It is easily to see that E(X¯¯B)=μ𝐸subscript¯¯𝑋𝐵𝜇E(\overline{\overline{X}}_{B})=\muitalic_E ( over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = italic_μ. Given the assumptions that (1) n𝑛n\rightarrow\inftyitalic_n → ∞ and n/N0𝑛𝑁0n/N\rightarrow 0italic_n / italic_N → 0 as N𝑁N\rightarrow\inftyitalic_N → ∞; (2) B/N0𝐵𝑁0B/N\rightarrow 0italic_B / italic_N → 0 and nB=O(N)𝑛𝐵𝑂𝑁nB=O(N)italic_n italic_B = italic_O ( italic_N ) as N𝑁N\to\inftyitalic_N → ∞, the variance of X¯¯Bsubscript¯¯𝑋𝐵\overline{\overline{X}}_{B}over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT can be presented as

var(X¯¯B)=σ2(1nB+1N){1+o(1)}.varsubscript¯¯𝑋𝐵superscript𝜎21𝑛𝐵1𝑁1𝑜1\mbox{var}(\overline{\overline{X}}_{B})=\sigma^{2}\bigg{(}\frac{1}{nB}+\frac{1% }{N}\bigg{)}\big{\{}1+o(1)\big{\}}.var ( over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n italic_B end_ARG + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) { 1 + italic_o ( 1 ) } .

On one hand, the term σ2/Nsuperscript𝜎2𝑁\sigma^{2}/Nitalic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N is associated with the overall sample and cannot be eliminated by subsampling. On the other hand, the term σ2/(nB)superscript𝜎2𝑛𝐵\sigma^{2}/(nB)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_n italic_B ) can be reduced by increasing the subsample size n𝑛nitalic_n or the number of subsamples B𝐵Bitalic_B. To perform automatic inference, the standard error of X¯¯Bsubscript¯¯𝑋𝐵\overline{\overline{X}}_{B}over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is proposed as

SE^2(X¯¯B)=nB1(1nB+1N)b=1B(X¯(b)X¯¯B)2,superscript^SE2subscript¯¯𝑋𝐵𝑛𝐵11𝑛𝐵1𝑁superscriptsubscript𝑏1𝐵superscriptsubscript¯𝑋𝑏subscript¯¯𝑋𝐵2\widehat{\mbox{SE}}^{2}(\overline{\overline{X}}_{B})=\frac{n}{B-1}\Big{(}\frac% {1}{nB}+\frac{1}{N}\Big{)}\sum_{b=1}^{B}\Big{(}\overline{X}_{(b)}-\overline{% \overline{X}}_{B}\Big{)}^{2},over^ start_ARG SE end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = divide start_ARG italic_n end_ARG start_ARG italic_B - 1 end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_n italic_B end_ARG + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( over¯ start_ARG italic_X end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT - over¯ start_ARG over¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the theoretical results can be found in Pan et al., (2023). In summary, the SAS method is time-saving in terms of HDSC as well as useful for automatic statistical inferences.

3.2. Subsampling-based Estimation Methods

In the previous subsection, we have thoroughly reviewed various subsampling methods and discussed them from the perspective of computational efficiency. In the meanwhile, how to estimate the parameters of interest with the best statistical efficiency is also a crucial concern. To formulate this problem, let ={i:1iN}conditional-set𝑖1𝑖𝑁\mathcal{F}=\{i:1\leq i\leq N\}caligraphic_F = { italic_i : 1 ≤ italic_i ≤ italic_N } represent an index set for an extremely large dataset with sample size N𝑁Nitalic_N. Let Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the response associated with the i𝑖iitalic_i-th subject and Xi=(Xij)psubscript𝑋𝑖subscript𝑋𝑖𝑗superscript𝑝X_{i}=(X_{ij})\in\mathbb{R}^{p}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT be the corresponding p𝑝pitalic_p-dimensional feature vector. Define πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be the sampling probability for each sample 1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N. A random subsample of size n𝑛nitalic_n can be drawn (with replacement) with π=(π1,,πN)N𝜋superscriptsubscript𝜋1subscript𝜋𝑁topsuperscript𝑁\pi=(\pi_{1},...,\pi_{N})^{\top}\in\mathbb{R}^{N}italic_π = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Then the key research question here is how to define πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs appropriately, so that a small but representative subsample can be obtained for downstream statistical models.

Various subsampling methods have been proposed to address this concern. For the linear regression model, the algorithm leveraging methods have been extensively discussed, utilizing the empirical statistical leverage scores of the input covariate matrix to define the sampling probabilities (Drineas et al.,, 2006, 2011; Mahoney,, 2011). Ma et al., (2014) further provided an effective framework to evaluate the statistical properties of parameter estimation in these algorithmic leveraging methods. Wang et al., 2019a introduced the information-based optimal subdata selection (IBOSS) method, which can deterministically identify a subsample with the maximum information matrix under the D𝐷Ditalic_D-optimality criterion. The IBOSS approach is further extended to a divide-and-conquer setting by Wang, 2019a . For binary logistic regression, Wang et al., 2018a proposed the optimal subsampling method motivated by the A𝐴Aitalic_A-optimality criterion (OSMAC) by minimizing the asymptotic mean squared error (MSE) of the subsample estimator to design the subsampling probability. The OSMAC method can be enhanced by incorporating unweighted objective functions and Poisson subsampling, resulting in improved efficiency (Wang, 2019b, ). Furthermore, the applicability of OSMAC is extended to various classes of models, including multi-class logistic regression (Yao and Wang,, 2019), generalized linear models (Ai et al.,, 2021), quantile regression (Wang and Ma,, 2021) and quasi-likelihood (Yu et al.,, 2022).

Although the above methods are demonstrated to be statistically efficient, computing πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the entire dataset poses a substantial computational challenge, especially when the data size N𝑁Nitalic_N is extremely large. Take the OSMAC method (Wang et al., 2018a, ) as an example. For the OSMAC method, determining the optimal subsampling probabilities involves a computational complexity of O(Np)𝑂𝑁𝑝O(Np)italic_O ( italic_N italic_p ). As a result, this optimal subsampling algorithm becomes computationally expensive when dealing with a very large sample size N𝑁Nitalic_N. To address this challenge, the repeated subsampling method can be adopted. The key idea of repeated subsampling is to draw a subsample with uniform probability (i.e., πi=1/Nsubscript𝜋𝑖1𝑁\pi_{i}=1/Nitalic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / italic_N) while operating the subsampling step repeatedly. Using the uniform probability in the subsampling process eliminates the need to compute probabilities for the entire dataset in advance. Therefore the computation cost is significantly reduced. Through repeated subsampling, the selected data approximates the whole dataset. Note that the cost associated with subsampling cannot be negligible, particularly true for the repeated subsampling methods. Nevertheless, thanks to the SAS method of Pan et al., (2023), the hard drive sampling cost can be significantly reduced.

Based on repeated subsampling, a variety of statistical models have been developed. Here we introduce the sequential one-step (SOS) estimator for generalized linear models (Wang et al.,, 2022) for example. Assume the whole dataset has been randomly distributed on the hard drive and the SAS method is used to obtain each subdata. Assume the subsampling is repeated for K𝐾Kitalic_K times. In the k𝑘kitalic_k-th subsampling with 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K, denote 𝒮ksubscript𝒮𝑘\mathcal{S}_{k}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to be the indices of selected observations in the whole dataset. Based on 𝒮ksubscript𝒮𝑘\mathcal{S}_{k}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the SOS estimator is computed as follows. First, we need to calculate an initial estimator β¯1subscript¯𝛽1\overline{\beta}_{1}over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT based on 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This initial estimator could, for instance, be a maximum likelihood estimator (MLE) of the generalized linear regression models. Assume β¯ksubscript¯𝛽𝑘\overline{\beta}_{k}over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to be the current estimator in the k𝑘kitalic_k-th step. Subsequently, in the (k+1)𝑘1(k+1)( italic_k + 1 )-th subsampling step, a new SAS subsample 𝒮k+1subscript𝒮𝑘1\mathcal{S}_{k+1}caligraphic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is obtained. Then a one-step update is performed based on β¯ksubscript¯𝛽𝑘\overline{\beta}_{k}over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to obtain the one-step updated estimator, i.e., β^k+1=β¯k{¨𝒮k+1(β¯k)}1˙𝒮k+1(β¯k)subscript^𝛽𝑘1subscript¯𝛽𝑘superscriptsubscript¨subscript𝒮𝑘1subscript¯𝛽𝑘1subscript˙subscript𝒮𝑘1subscript¯𝛽𝑘\widehat{\beta}_{k+1}=\overline{\beta}_{k}-\big{\{}\ddot{\ell}_{\mathcal{S}_{k% +1}}(\overline{\beta}_{k})\big{\}}^{-1}\dot{\ell}_{\mathcal{S}_{k+1}}(% \overline{\beta}_{k})over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - { over¨ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over˙ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where ˙𝒮k+1(β¯k)subscript˙subscript𝒮𝑘1subscript¯𝛽𝑘\dot{\ell}_{\mathcal{S}_{k+1}}(\overline{\beta}_{k})over˙ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and ¨𝒮k+1(β¯k)subscript¨subscript𝒮𝑘1subscript¯𝛽𝑘\ddot{\ell}_{\mathcal{S}_{k+1}}(\overline{\beta}_{k})over¨ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denote the 1st and 2nd order derivatives of the likelihood function based on the (k+1)𝑘1(k+1)( italic_k + 1 )-th subsample, respectively. Next, the SOS subsampling estimator for the (k+1)𝑘1(k+1)( italic_k + 1 )-th step is computed as β¯k+1={kβ¯k+β^k+1}/(k+1)=l=1k+1β^l/(k+1)subscript¯𝛽𝑘1𝑘subscript¯𝛽𝑘subscript^𝛽𝑘1𝑘1superscriptsubscript𝑙1𝑘1subscript^𝛽𝑙𝑘1\overline{\beta}_{k+1}=\big{\{}k\overline{\beta}_{k}+\widehat{\beta}_{k+1}\big% {\}}/(k+1)=\sum_{l=1}^{k+1}\widehat{\beta}_{l}/(k+1)over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = { italic_k over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT } / ( italic_k + 1 ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / ( italic_k + 1 ). The corresponding estimator obtained in the last K𝐾Kitalic_K-th step is the final SOS estimator, i.e., β^SOS=β¯Ksuperscript^𝛽SOSsubscript¯𝛽𝐾\widehat{\beta}^{\text{SOS}}=\overline{\beta}_{K}over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT SOS end_POSTSUPERSCRIPT = over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. It is noteworthy that the SOS method represents an extension of the classical one-step estimator (Shao,, 2003; Zou and Li,, 2008) but within the context of subsampling. The theoretical properties of the SOS estimator are also established in Wang et al., (2022), demonstrating that both the bias and variance of the SOS estimator decrease as the number of sampling iterations K𝐾Kitalic_K increases.

3.3. Subsample Feature Screening

Feature screening plays a critically important role for ultrahigh dimensional data analysis. Extensive literature has been developed along this direction. Since Fan and Lv, (2008) introduced the seminal work of sure independence screening (SIS), a large amount of follow-up research has been inspired. The key idea of SIS is to rank and then select important features by certain appropriately defined correlation measures. For example, under a linear regression model setup and assuming some appropriate regularity conditions, Fan and Lv, (2008) showed that the top features selected according to marginal sample correlation coefficients are screening consistent. In other words, the selected top features are assured to asymptotically cover the underlying low dimensional true model with probability tending to one. Wang, (2009) further improved SIS by the method of forward regression for a significantly improved finite-sample performance. Li et al., (2012) proposed a distance correlation based independent screening method (DC-SIS) so that variable screening can be conducted in a model free manner. Recently, a distributed feature selection method has been developed by Li et al., (2020) for massive data analysis.

Zhu et al., (2022) proposed a novel subsampling-based feature selection method for large datasets with ultrahigh dimensional features. They consider a classical linear regression model as Yi=Xiβ+εisubscript𝑌𝑖superscriptsubscript𝑋𝑖top𝛽subscript𝜀𝑖Y_{i}=X_{i}^{\top}\beta+\varepsilon_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_β + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Fan and Lv,, 2008), where βp𝛽superscript𝑝\beta\in\mathbb{R}^{p}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is regression parameters, εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the independent noise term with var(εi)=σ2varsubscript𝜀𝑖superscript𝜎2\operatorname{var}\left(\varepsilon_{i}\right)=\sigma^{2}roman_var ( italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Assume a total of B𝐵Bitalic_B subsamples with size n𝑛nitalic_n, which are denoted by 𝒮(b)𝒮F={1,2,,N}subscript𝒮𝑏subscript𝒮𝐹12𝑁\mathcal{S}_{(b)}\subset\mathcal{S}_{F}=\big{\{}1,2,\dots,N\big{\}}caligraphic_S start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ⊂ caligraphic_S start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = { 1 , 2 , … , italic_N } with |𝒮(b)|=nsubscript𝒮𝑏𝑛|\mathcal{S}_{(b)}|=n| caligraphic_S start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT | = italic_n. Define 𝕏(b)=(Xi:i𝕊(b))n×p\mathbb{X}_{(b)}=(X_{i}:i\in\mathbb{S}_{(b)})\in\mathbb{R}^{n\times p}blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ blackboard_S start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT as the subsampled design matrix and 𝕐(b)=(Yi:i𝒮(b))n\mathbb{Y}_{(b)}=(Y_{i}:i\in\mathcal{S}_{(b)})\in\mathbb{R}^{n}blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT = ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ caligraphic_S start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the associate response vector. Define a candidate model as ={j1,,jm}subscript𝑗1subscript𝑗𝑚\mathcal{M}=\{j_{1},\ldots,j_{m}\}caligraphic_M = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } with 1jkp1subscript𝑗𝑘𝑝1\leq j_{k}\leq p1 ≤ italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_p for every 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K. Define the design matrix associate with \mathcal{M}caligraphic_M as 𝕏(b)()n×||superscriptsubscript𝕏𝑏superscript𝑛\mathbb{X}_{(b)}^{(\mathcal{M})}\in\mathbb{R}^{n\times|\mathcal{M}|}blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( caligraphic_M ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × | caligraphic_M | end_POSTSUPERSCRIPT. Then, the R𝑅Ritalic_R-Squared statistic to be computed from the b𝑏bitalic_b-th subsample for the model \mathcal{M}caligraphic_M is given by

R(b)2()=(𝕏(b)𝕐(b))(𝕏(b)𝕏(b))1(𝕏(b)𝕐(b))𝕐(b)𝕐¯(b)2,subscriptsuperscript𝑅2𝑏superscriptsuperscriptsubscript𝕏𝑏limit-fromtopsubscript𝕐𝑏topsuperscriptsuperscriptsubscript𝕏𝑏limit-fromtopsuperscriptsubscript𝕏𝑏1superscriptsubscript𝕏𝑏limit-fromtopsubscript𝕐𝑏superscriptnormsubscript𝕐𝑏subscript¯𝕐𝑏2R^{2}_{(b)}\left(\mathcal{M}\right)=\Big{(}\mathbb{X}_{(b)}^{\mathcal{M}\top}% \mathbb{Y}_{(b)}\Big{)}^{\top}\Big{(}\mathbb{X}_{(b)}^{\mathcal{M}\top}\mathbb% {X}_{(b)}^{\mathcal{M}}\Big{)}^{-1}\Big{(}\mathbb{X}_{(b)}^{\mathcal{M}\top}% \mathbb{Y}_{(b)}\Big{)}\Big{\|}\mathbb{Y}_{(b)}-\overline{\mathbb{Y}}_{(b)}% \Big{\|}^{-2},italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) = ( blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M ⊤ end_POSTSUPERSCRIPT blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M ⊤ end_POSTSUPERSCRIPT blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M ⊤ end_POSTSUPERSCRIPT blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ) ∥ blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT - over¯ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ,

where 𝕐¯(b)=n1𝟏𝕐(b)subscript¯𝕐𝑏superscript𝑛1superscript1topsubscript𝕐𝑏\overline{\mathbb{Y}}_{(b)}=n^{-1}\mathbf{1}^{\top}\mathbb{Y}_{(b)}over¯ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT. Thereafter, a one-shot type statistic can be assembled as ROS2()=B1b=1BR(b)2().subscriptsuperscript𝑅2OSsuperscript𝐵1superscriptsubscript𝑏1𝐵subscriptsuperscript𝑅2𝑏R^{2}_{\mathrm{OS}}\left(\mathcal{M}\right)=B^{-1}\sum_{b=1}^{B}R^{2}_{(b)}(% \mathcal{M}).italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_OS end_POSTSUBSCRIPT ( caligraphic_M ) = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) .

The one-shot estimator ROS2()subscriptsuperscript𝑅2OSR^{2}_{\mathrm{OS}}\left(\mathcal{M}\right)italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_OS end_POSTSUBSCRIPT ( caligraphic_M ) is easy to compute. However, the drawback is that it might suffer from non-ignborable estimation bias if the subsample size is relatively small. To address this issue, Zhu et al., (2022) developed two improved methods for bias reduction. The first method is a jackknife-based bias-correction method. To be specific, define a delete-one estimator as R(b)i2()subscriptsuperscript𝑅2𝑏𝑖R^{2}_{(b)i}(\mathcal{M})italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) italic_i end_POSTSUBSCRIPT ( caligraphic_M ), which is the R(b)2()subscriptsuperscript𝑅2𝑏R^{2}_{(b)}(\mathcal{M})italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) statistic computed without the i𝑖iitalic_i-th observation. This leads to the jackknife estimator for the bias as Δ^(b)=n1(n1)iR(b)2()(n1)R(b)i2()subscript^Δ𝑏superscript𝑛1𝑛1subscript𝑖subscriptsuperscript𝑅2𝑏𝑛1subscriptsuperscript𝑅2𝑏𝑖\widehat{\Delta}_{(b)}=n^{-1}(n-1)\sum_{i}R^{2}_{(b)}(\mathcal{M})-(n-1)R^{2}_% {(b)i}(\mathcal{M})over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_n - 1 ) ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) - ( italic_n - 1 ) italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) italic_i end_POSTSUBSCRIPT ( caligraphic_M ). Then the jackknife-based bias-correction metric is defined as RJBC2()=B1k=1B{R(b)2()Δ^(b)}.superscriptsubscript𝑅JBC2superscript𝐵1superscriptsubscript𝑘1𝐵subscriptsuperscript𝑅2𝑏subscript^Δ𝑏R_{\mathrm{JBC}}^{2}\left(\mathcal{M}\right)=B^{-1}\sum_{k=1}^{B}\{R^{2}_{(b)}% (\mathcal{M})-\widehat{\Delta}_{(b)}\}.italic_R start_POSTSUBSCRIPT roman_JBC end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M ) = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT { italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) - over^ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT } . The second method is an aggregated moment method. This method first decomposes R(b)2()subscriptsuperscript𝑅2𝑏R^{2}_{(b)}\left(\mathcal{M}\right)italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ( caligraphic_M ) into several moment components, and then aggregates the components separately. To be more precise, note that R𝑅Ritalic_R-Squared statistic is composed of three components, namely Σ^𝕏(b)=n1(𝕏(b))𝕏(b),Σ^𝕏𝕐(b)=n1(𝕏(b))𝕐(b)formulae-sequencesuperscriptsubscript^Σ𝕏𝑏superscript𝑛1superscriptsuperscriptsubscript𝕏𝑏topsuperscriptsubscript𝕏𝑏superscriptsubscript^Σ𝕏𝕐𝑏superscript𝑛1superscriptsuperscriptsubscript𝕏𝑏topsubscript𝕐𝑏\widehat{\Sigma}_{\mathbb{X}(b)}^{\mathcal{M}}=n^{-1}(\mathbb{X}_{(b)}^{% \mathcal{M}})^{\top}\mathbb{X}_{(b)}^{\mathcal{M}},\widehat{\Sigma}_{\mathbb{X% }\mathbb{Y}(b)}^{\mathcal{M}}=n^{-1}(\mathbb{X}_{(b)}^{\mathcal{M}})^{\top}% \mathbb{Y}_{(b)}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X blackboard_Y ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_X start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT, and σ^y(b)2=n1𝕐(b)𝕐¯(b)2superscriptsubscript^𝜎𝑦𝑏2superscript𝑛1superscriptnormsubscript𝕐𝑏subscript¯𝕐𝑏2\widehat{\sigma}_{y(b)}^{2}=n^{-1}\|\mathbb{Y}_{(b)}-\overline{\mathbb{Y}}_{(b% )}\|^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_y ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ blackboard_Y start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT - over¯ start_ARG blackboard_Y end_ARG start_POSTSUBSCRIPT ( italic_b ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Averaging over all subsamples leads to Σ^𝕏=B1b=1BΣ^𝕏(k),Σ^𝕏𝕐=B1k=1BΣ^𝕏𝕐(k)formulae-sequencesuperscriptsubscript^Σ𝕏superscript𝐵1superscriptsubscript𝑏1𝐵superscriptsubscript^Σ𝕏𝑘superscriptsubscript^Σ𝕏𝕐superscript𝐵1superscriptsubscript𝑘1𝐵superscriptsubscript^Σ𝕏𝕐𝑘\widehat{\Sigma}_{\mathbb{X}}^{\mathcal{M}}=B^{-1}\sum_{b=1}^{B}\widehat{% \Sigma}_{\mathbb{X}(k)}^{\mathcal{M}},\widehat{\Sigma}_{\mathbb{XY}}^{\mathcal% {M}}=B^{-1}\sum_{k=1}^{B}\widehat{\Sigma}_{\mathbb{XY}(k)}^{\mathcal{M}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X blackboard_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X blackboard_Y ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT and σ^𝕐2=B1b=1Bσ^y(b)2superscriptsubscript^𝜎𝕐2superscript𝐵1superscriptsubscript𝑏1𝐵superscriptsubscript^𝜎𝑦𝑏2\widehat{\sigma}_{\mathbb{Y}}^{2}=B^{-1}\sum_{b=1}^{B}\widehat{\sigma}_{y(b)}^% {2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT blackboard_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_y ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then, an aggregated moment estimator is defined as RAM2()=σ^𝕐2(Σ^𝕏𝕐)(Σ^𝕏)1(Σ^𝕏𝕐).superscriptsubscript𝑅AM2superscriptsubscript^𝜎𝕐2superscriptsuperscriptsubscript^Σ𝕏𝕐topsuperscriptsuperscriptsubscript^Σ𝕏1superscriptsubscript^Σ𝕏𝕐R_{\mathrm{AM}}^{2}\left(\mathcal{M}\right)={\widehat{\sigma}_{\mathbb{Y}}^{-2% }}(\widehat{\Sigma}_{\mathbb{XY}}^{\mathcal{M}})^{\top}(\widehat{\Sigma}_{% \mathbb{X}}^{\mathcal{M}})^{-1}(\widehat{\Sigma}_{\mathbb{XY}}^{\mathcal{M}}).italic_R start_POSTSUBSCRIPT roman_AM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M ) = over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT blackboard_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X blackboard_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT blackboard_X blackboard_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) . Both two methods can reduce the bias of ROS2()subscriptsuperscript𝑅2OSR^{2}_{\mathrm{OS}}(\mathcal{M})italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_OS end_POSTSUBSCRIPT ( caligraphic_M ) significantly without inflecting the asymptotic variance.

4. MINIBATCH RELATED TECHNIQUES

4.1. A Selective Review on Statistical Optimization

In statistical research, many estimation problems can ultimately be transformed into optimization problems. For example, for the generalized linear models (GLMs), one usually estimates the model parameters by maximizing the likelihood function (Nelder and Wedderburn,, 1972). The likelihood function is generally a sufficiently smooth and strongly convex function. Consequently, the Newton’s method or Fisher’s score method can be easily implemented. Often the numerical convergence can be achieved in a few iterations (Shao,, 2003). Therefore, researchers usually do not concern much about the specific optimization process but directly study the statistical properties of the optimizer (Van der Vaart,, 2000).

However, with the rapid development of information technology, not only datasets are becoming increasingly large, but models are also becoming even more complex (Fan et al.,, 2020). As mentioned before, this poses two challenges to traditional optimization methods. First, the dimension of the model parameters p𝑝pitalic_p can be very high. This would make it difficult to invert the p×p𝑝𝑝p\times pitalic_p × italic_p Hessian matrix for Newton’s type methods. Second, the dataset may be too large to be read into computer memory as a whole. Then the optimization methods based on whole datasets become no longer feasible. For the former challenge, one can consider gradient-based first-order optimization methods, such as gradient descent (GD) method, quasi-Newton method, and conjugate gradient method (Beck,, 2017). For the latter challenge, one can load the data into the memory in a minibatch-wise manner, and then implement the algorithms based on these small minibatches. This leads to various minibatch-based methods. In particular, when the minibatches are generated randomly, they are also referred to as stochastic optimization methods (Lan,, 2020), such as stochastic gradient descent (SGD). Due to the scalability of these minibatch-based first-order optimization methods, they are now widely used in large-scale learning tasks such as deep learning (Bottou et al.,, 2018; Simonyan and Zisserman,, 2015; He et al.,, 2016).

In addition to the popularity in practice, the theoretical properties of minibatch related algorithms have also attracted increasing attention from researchers. The early literature on solving optimization problems using the idea of stochastic approximation can be traced back to Robbins and Monro, (1951) and Kiefer and Wolfowitz, (1952). In order to improve the efficiency of approximation, Polyak and Juditsky, (1992) further proposed to perform averaging over the iterates, also known as PJR-averaging operation. More recently, Moulines and Bach, (2011) and Bach and Moulines, (2013) investigated the SGD algorithm for objective functions with and without strong convexity, establishing the non-asymptotic convergence upper bounds. For objective functions of finite sums, several variance reduction techniques have been found useful in achieving a faster convergence rate compared to the classical SGD (Roux et al.,, 2012; Johnson and Zhang,, 2013; Defazio et al.,, 2014). To further accelerate the convergence rate for ill-conditioned problems, momentum based methods have also attracted great attention (Gitman et al.,, 2019; Assran and Rabbat,, 2020; Liu et al.,, 2020). Another line of research considered how to generate minibatches to improve the minibatch based GD (Needell and Ward,, 2017; Gower et al.,, 2019; Mishchenko et al.,, 2020; Gürbüzbalaban et al.,, 2021). Apart from research on minibatch related methods from an optimization perspective, there are also many studies conducted from a statistical perspective, paying more attention to characterizing the statistical properties of the resulting estimators. These works include but are not limited to Toulis and Airoldi, (2017), Chen et al., 2020a , Luo and Song, (2020), Zhu et al., (2023), and Tang et al., (2023).

4.2. Minibatch Gradient Descent Algorithms

For high dimensional data analysis, various stochastic minibatch gradient descent (SMGD) methods have received increasing attention in recent literature due to their outstanding performances and relatively easier theoretical properties (Duchi et al.,, 2011; Kingma and Ba,, 2014). The SMGD algorithms can be mainly categorized into two groups according to the generation of minibatch data. The first group assumes the minibatches are independently generated from the given sample with replacement (Gao et al.,, 2021). Then the noise introduced by minibatch data can be viewed as conditionally independent. The second category assumes that the noises introduced by minibatch data form a martingale difference sequence. This assumption is particularly true for streaming data analysis (Mou et al.,, 2020; Yu et al.,, 2021; Chen et al., 2022c, ). Taking the SMGD studied in Chen et al., 2020a as a concrete example, they assume that the gradient noise from different minibatch data forms a martingale difference. Following the previous work of Polyak and Juditsky, (1992), the asymptotic distribution of the averaged SMGD estimator can be established. To conduct statistical inference of the averaged SMGD estimator, different inference procedures are proposed for both fixed dimension case and diverged dimension case. In the fixed dimension case, Chen et al., 2020a proposed a novel batch-means covariance estimator which can avoid computing the inverse of the Hessian matrix as compared with the naive plug-in estimator. In the high dimension case, Chen et al., 2020a proposed an online debiased lasso procedure to construct the confidence interval of element-wise regression coefficient.

Different from the SMGD algorithm, another way to generate minibatch data is the random partition, which is arguably the most popularly used minibatch method in offline real practice, since it has been well implemented by many standard deep learning programs such as TensorFlow and PyTorch. Since the minibatches form a partition of the whole sample data, they are no longer independent or conditionally independent with each other. As a result, it does not match the model assumption in the past literature (Bottou et al.,, 2018; Dieuleveut et al.,, 2020; Lan,, 2020; Mou et al.,, 2020; Yu et al.,, 2021; Chen et al., 2022c, ), which calls for theoretical investigation. To fill this theoretical gap, Qi et al., 2023b studied the properties of fixed minibatch gradient descent (FMGD) algorithm and the resulting estimator. Let 𝒮={1,2,,N}𝒮12𝑁\mathcal{S}=\{1,2,\dots,N\}caligraphic_S = { 1 , 2 , … , italic_N } be the index set of the whole sample. Let Yi1subscript𝑌𝑖superscript1Y_{i}\in\mathbb{R}^{1}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT be the response of interest and Xi=(Xi1,Xi2,,Xip)psubscript𝑋𝑖superscriptsubscript𝑋𝑖1subscript𝑋𝑖2subscript𝑋𝑖𝑝topsuperscript𝑝X_{i}=(X_{i1},X_{i2},\dots,X_{ip})^{\top}\in\mathbb{R}^{p}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT be the associated p𝑝pitalic_p-dimensional predictor. Define the loss function evaluated at sample i𝑖iitalic_i as (Xi,Yi;θ)subscript𝑋𝑖subscript𝑌𝑖𝜃\ell(X_{i},Y_{i};\theta)roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ), where θq𝜃superscript𝑞\theta\in\mathbb{R}^{q}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT denotes the parameter. Then the global loss function can be constructed as (θ)=N1i=1N(Xi,Yi;θ)𝜃superscript𝑁1subscriptsuperscript𝑁𝑖1subscript𝑋𝑖subscript𝑌𝑖𝜃\mathcal{L}(\theta)=N^{-1}\sum^{N}_{i=1}\ell(X_{i},Y_{i};\theta)caligraphic_L ( italic_θ ) = italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ). The global estimator can be defined as θ^=argmin(θ)^𝜃argmin𝜃\widehat{\theta}=\operatorname{\operatorname*{argmin}}\mathcal{L}(\theta)over^ start_ARG italic_θ end_ARG = roman_argmin caligraphic_L ( italic_θ ). Denote {𝒮(t,m)}m=1Msubscriptsuperscriptsuperscript𝒮𝑡𝑚𝑀𝑚1\{\mathcal{S}^{(t,m)}\}^{M}_{m=1}{ caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT as the minibatch index sets in the t𝑡titalic_t-th epoch. Then one should have 𝒮=m𝒮(t,m)𝒮subscript𝑚superscript𝒮𝑡𝑚\mathcal{S}=\bigcup_{m}\mathcal{S}^{(t,m)}caligraphic_S = ⋃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT and 𝒮(t,m1)𝒮(t,m2)=superscript𝒮𝑡subscript𝑚1superscript𝒮𝑡subscript𝑚2\mathcal{S}^{(t,m_{1})}\bigcap\mathcal{S}^{(t,m_{2})}=\emptysetcaligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋂ caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = ∅ for any t1𝑡1t\geq 1italic_t ≥ 1 and m1m2subscript𝑚1subscript𝑚2m_{1}\neq m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For convenience, assume N𝑁Nitalic_N and M𝑀Mitalic_M are particularly designed so that n=N/M𝑛𝑁𝑀n=N/Mitalic_n = italic_N / italic_M is an integer and all minibatches have the same sample size as |𝒮(t,m)|=nsuperscript𝒮𝑡𝑚𝑛|\mathcal{S}^{(t,m)}|=n| caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT | = italic_n. Then the updating formula of MGD can be expressed as

θ^(t,1)superscript^𝜃𝑡1\displaystyle\widehat{\theta}^{(t,1)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , 1 ) end_POSTSUPERSCRIPT =\displaystyle== θ^(t1,M)α˙(t,1)(θ^(t1,M)),superscript^𝜃𝑡1𝑀𝛼superscript˙𝑡1superscript^𝜃𝑡1𝑀\displaystyle\widehat{\theta}^{(t-1,M)}-\alpha\dot{\mathcal{L}}^{(t,1)}\Big{(}% \widehat{\theta}^{(t-1,M)}\Big{)},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 1 , italic_M ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t , 1 ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 1 , italic_M ) end_POSTSUPERSCRIPT ) , (4.1)
θ^(t,m)superscript^𝜃𝑡𝑚\displaystyle\widehat{\theta}^{(t,m)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT =\displaystyle== θ^(t,m1)α˙(t,m)(θ^(t,m1)) for 2mM,superscript^𝜃𝑡𝑚1𝛼superscript˙𝑡𝑚superscript^𝜃𝑡𝑚1 for 2𝑚𝑀\displaystyle\widehat{\theta}^{(t,m-1)}-\alpha\dot{\mathcal{L}}^{(t,m)}\Big{(}% \widehat{\theta}^{(t,m-1)}\Big{)}\mbox{ for }2\leq m\leq M,over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT ) for 2 ≤ italic_m ≤ italic_M ,

where α>0𝛼0\alpha>0italic_α > 0 is the learning rate, (t,m)(θ)=n1i𝒮(t,m)(Xi,Yi;θ)superscript𝑡𝑚𝜃superscript𝑛1subscript𝑖superscript𝒮𝑡𝑚subscript𝑋𝑖subscript𝑌𝑖𝜃\mathcal{L}^{(t,m)}(\theta)=n^{-1}\sum_{i\in\mathcal{S}^{(t,m)}}\ell(X_{i},Y_{% i};\theta)caligraphic_L start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( italic_θ ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) is the loss function for the m𝑚mitalic_m-th minibatch in the t𝑡titalic_t-th epoch, and ˙(t,m)(θ)superscript˙𝑡𝑚𝜃\dot{\mathcal{L}}^{(t,m)}(\theta)over˙ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( italic_θ ) is the first-order derivatives of (t,m)(θ)superscript𝑡𝑚𝜃\mathcal{L}^{(t,m)}(\theta)caligraphic_L start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( italic_θ ) with respect to θ𝜃\thetaitalic_θ.

To study the algorithm, Qi et al., 2023b first consider FMGD algorithm under the linear regression model with a fixed sample partition. Then the above updating formulas (4.1) naturally form a linear system. Under appropriate technical assumptions, Qi et al., 2023b show that the FMGD estimator converges linearly to the stable solution of the linear system as

θ^(t,m)θ^(m)ρα,Mt1θ^(0,m)θ^(m),normsuperscript^𝜃𝑡𝑚superscript^𝜃𝑚superscriptsubscript𝜌𝛼𝑀𝑡1normsuperscript^𝜃0𝑚superscript^𝜃𝑚\Big{\|}\widehat{\theta}^{(t,m)}-\widehat{\theta}^{(m)}\Big{\|}\leq\rho_{% \alpha,M}^{t-1}\Big{\|}\widehat{\theta}^{(0,m)}-\widehat{\theta}^{(m)}\Big{\|},∥ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ ≤ italic_ρ start_POSTSUBSCRIPT italic_α , italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 0 , italic_m ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ ,

where ρα,M(0,1)subscript𝜌𝛼𝑀01\rho_{\alpha,M}\in(0,1)italic_ρ start_POSTSUBSCRIPT italic_α , italic_M end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is a contraction factor depending on α𝛼\alphaitalic_α and M𝑀Mitalic_M. The asymptotic normality result is also established by Qi et al., 2023b . However, Qi et al., 2023b find that the FMGD estimator is biased for any constant learning α>0𝛼0\alpha>0italic_α > 0. To further reduce the error upper bound, Qi et al., 2023b consider the diminishing learning rate scheduling. As long as t=1αt=subscriptsuperscript𝑡1subscript𝛼𝑡\sum^{\infty}_{t=1}\alpha_{t}=\infty∑ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∞ and t=1αt2<subscriptsuperscript𝑡1subscriptsuperscript𝛼2𝑡\sum^{\infty}_{t=1}\alpha^{2}_{t}<\infty∑ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < ∞, the FMGD estimator should converge to the OLS estimator as t𝑡t\to\inftyitalic_t → ∞. Qi et al., 2023b then extend their theoretical investigations to random partition with shuffling and general loss function, and similar results are established.

4.3. Minibatch Gradient Descent with Momentum

Despite the practical usefulness of the MGD algorithm, it still can be extremely time-consuming for large-scale statistical analysis with high dimensional parameters, particularly in the research field of deep learning (Kingma and Ba,, 2014; He et al.,, 2016; Goodfellow et al.,, 2016; Devlin et al.,, 2019). To address this issue, various improved algorithms have been proposed. One direction is to investigate the accelerated gradient descent algorithm. As a first-order optimization method, gradient descent (GD) does not require the computation of the second derivative (i.e., the Hessian matrix) of the objective function. However, due to the neglect of the second-order information of the objective function, the numerical convergence rate of the standard GD algorithm is often much slower than that of second-order optimization algorithms such as Newton’s method. This is particularly true when the objective function is severely ill-conditioned. To address this issue, Polyak, (1964) proposed a so-called “heavy-ball” method. This method utilizes not only the gradient of the current step but also the information from the previous step (i.e., momentum). Specifically, it updates the estimates as

θ^(t)=θ^(t1)α˙(θ^(t1))+γ(θ^(t1)θ^(t2)),superscript^𝜃𝑡superscript^𝜃𝑡1𝛼˙superscript^𝜃𝑡1𝛾superscript^𝜃𝑡1superscript^𝜃𝑡2\displaystyle\widehat{\theta}^{(t)}=\widehat{\theta}^{(t-1)}-\alpha\dot{% \mathcal{L}}(\widehat{\theta}^{(t-1)})+\gamma\Big{(}\widehat{\theta}^{(t-1)}-% \widehat{\theta}^{(t-2)}\Big{)},over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) + italic_γ ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t - 2 ) end_POSTSUPERSCRIPT ) ,

where θ^(t)superscript^𝜃𝑡\widehat{\theta}^{(t)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the t𝑡titalic_t-th estimate, ˙(θ)˙𝜃\dot{\mathcal{L}}(\theta)over˙ start_ARG caligraphic_L end_ARG ( italic_θ ) is the gradient, α>0𝛼0\alpha>0italic_α > 0 is the learning rate, and γ>0𝛾0\gamma>0italic_γ > 0 is the momentum parameter. Further theoretical analysis by Polyak, (1964) showed that the convergence rate could be much improved compared to the standard GD algorithm by this modification. This method is now commonly referred to as the gradient descent with momentum (GDM).

The success of the momentum idea has attracted considerable attention from both theoretical and practical perspectives (Sutskever et al.,, 2013; Goodfellow et al.,, 2016; Bottou et al.,, 2018). For example, Nesterov, (1983) proposed a method that uses the momentum and predicted gradient to update the parameters. By a similar idea, Beck and Teboulle, (2009) developed an accelerated optimization algorithm for non-smooth objective functions. Kingma and Ba, (2014) proposed an adaptive momentum method called ADAM, which is widely used in the fields of deep learning. Cyrus et al., (2018) described a robust momentum method that generalizes the triple momentum method proposed in Van Scoy et al., (2017). Ma and Yarats, (2018) developed a more general variant called the quasi-hyperbolic momentum (QHM) algorithm, whose theoretical properties were further investigated by Gitman et al., (2019). In practice, when the whole dataset is too large to be loaded into the memory, one has to process the data in a minibatch-by-minibatch manner. This is particularly true for many deep learning tasks (Krizhevsky et al.,, 2012; Simonyan and Zisserman,, 2015; He et al.,, 2016). This leads to various minibatch-based GDM (MGDM) methods. In fact, the MGDM methods have been incorporated into many important software libraries, including TensorFlow (Abadi et al.,, 2016) and PyTorch (Paszke et al.,, 2019).

In recent years, many efforts have been devoted to the theoretical analysis of the MGDM methods. Most of these studies investigated various MGDM methods from an optimization perspective, and have shown that the momentum term can effectively improve the numerical convergence (Gitman et al.,, 2019; Loizou and Richtárik,, 2020; Liu et al.,, 2020; Assran and Rabbat,, 2020). A recent work of Tang et al., (2023) analyzed a PJR-averaging version of the MGDM method for general statistical optimization problems and established the asymptotic distribution of the resulting estimator. It is worth noting that most of the theoretical studies mentioned above typically require that the minibatches are sampled independently and identically from the whole dataset (or the population distribution). However, in practice, such as in TensorFlow or PyTorch, minibatches are often obtained through random partition. As mentioned before, random partition means that the whole dataset is randomly partitioned into several non-overlapping minibatches. Unfortunately, minibatches generated in this way no longer satisfy the aforementioned requirements. To bridge the gap between theory and practice, Gao et al., (2023) considered the random partition based MGDM algorithm as a linear dynamical system:

θ^(t,m)=θ^(t,m1)α˙(m)(θ^(t,m1))+γ(θ^(t,m1)θ^(t,m2)),for 1mM,formulae-sequencesuperscript^𝜃𝑡𝑚superscript^𝜃𝑡𝑚1𝛼subscript˙𝑚superscript^𝜃𝑡𝑚1𝛾superscript^𝜃𝑡𝑚1superscript^𝜃𝑡𝑚2for 1𝑚𝑀\displaystyle\widehat{\theta}^{(t,m)}=\widehat{\theta}^{(t,m-1)}-\alpha\dot{% \mathcal{L}}_{(m)}(\widehat{\theta}^{(t,m-1)})+\gamma\Big{(}\widehat{\theta}^{% (t,m-1)}-\widehat{\theta}^{(t,m-2)}\Big{)},\ \textup{for }1\leq m\leq M,over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT ) + italic_γ ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m - 2 ) end_POSTSUPERSCRIPT ) , for 1 ≤ italic_m ≤ italic_M ,

where θ^(t,m)superscript^𝜃𝑡𝑚\widehat{\theta}^{(t,m)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT is the estimate from the m𝑚mitalic_m-th minibatch in the t𝑡titalic_t-th epoch, and (m)(θ)subscript𝑚𝜃\mathcal{L}_{(m)}(\theta)caligraphic_L start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ( italic_θ ) is the gradient computed on the m𝑚mitalic_m-th minibatch (1mM)1𝑚𝑀(1\leq m\leq M)( 1 ≤ italic_m ≤ italic_M ).

Based on the linear regression model, a closed form of the stable solution to the above linear dynamical system can be obtained. Specifically, let θ^(m)superscript^𝜃𝑚\widehat{\theta}^{(m)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT be the stable solution corresponding to the m𝑚mitalic_m-th minibatch. Under appropriate conditions, one can obtain the following linear convergence result:

θ^(t,m)θ^(m)(ρα,γM+εn,t)t(θ^(0,m)θ^(m)+θ^(0,m1)θ^(m1)),normsuperscript^𝜃𝑡𝑚superscript^𝜃𝑚superscriptsuperscriptsubscript𝜌𝛼𝛾𝑀subscript𝜀𝑛𝑡𝑡normsuperscript^𝜃0𝑚superscript^𝜃𝑚normsuperscript^𝜃0𝑚1superscript^𝜃𝑚1\displaystyle\Big{\|}\widehat{\theta}^{(t,m)}-\widehat{\theta}^{(m)}\Big{\|}% \leq\Big{(}\rho_{\alpha,\gamma}^{M}+{\varepsilon}_{n,t}\Big{)}^{t}\Big{(}\Big{% \|}\widehat{\theta}^{(0,m)}-\widehat{\theta}^{(m)}\Big{\|}+\Big{\|}\widehat{% \theta}^{(0,m-1)}-\widehat{\theta}^{(m-1)}\Big{\|}\Big{)},∥ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ ≤ ( italic_ρ start_POSTSUBSCRIPT italic_α , italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∥ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 0 , italic_m ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ + ∥ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 0 , italic_m - 1 ) end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT ∥ ) ,

for each 1mM1𝑚𝑀1\leq m\leq M1 ≤ italic_m ≤ italic_M, where ρα,γ(0,1)subscript𝜌𝛼𝛾01\rho_{\alpha,\gamma}\in(0,1)italic_ρ start_POSTSUBSCRIPT italic_α , italic_γ end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is the contraction factor controlling the convergence rate, and εn,tsubscript𝜀𝑛𝑡{\varepsilon}_{n,t}italic_ε start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT is some small number. By choosing appropriate tuning parameters α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ, one can achieve the minimal (and thus the optimal) contraction factor ρmin=(κ1)/(κ+1)subscript𝜌𝜅1𝜅1\rho_{\min}=(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)italic_ρ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = ( square-root start_ARG italic_κ end_ARG - 1 ) / ( square-root start_ARG italic_κ end_ARG + 1 ), where κ𝜅\kappaitalic_κ is the condition number of the least squares problem. In addition, Gao et al., (2023) established the asymptotic normality for the stable solution. The results showed that the stable solution can be statistically as efficient as the whole sample OLS estimator, as long as learning rate α𝛼\alphaitalic_α is sufficiently small. However, this interesting result relies on the least squares loss function. Investigating the MGDM algorithm based on a randomly partition strategy under general loss functions is a problem worth further exploration.

4.4. Communication Reduction for Minibatch Gradient Descent Algorithm

Another research direction focuses on how to reduce communication costs during the training process of MGD algorithms. Massive datasets are usually stored on hard drives due to a limited storage capacity of the computation devices. However, the data need to be transmitted into the system RAM and/or graphical memory before gradient computation. Therefore, the communication cost during the training process of the MGD algorithm might also lead to expensive time consumption (Pumma et al.,, 2017; Chien et al.,, 2018). The communication mechanism of MGD algorithm also causes another problem when computing in a CPU-GPU system, that is, the GPU idling problem. This problem refers to the phenomenon where the GPU waits for the CPU to perform data communication before gradient computation during the training process (Zinkevich et al.,, 2010; Zhang et al.,, 2016; Choi et al.,, 2019). It could become even worse when the number of updates is extremely large. Then how to reduce the communication cost during the training of MGD algorithms for faster computation becomes a problem of great interest (Bauer et al.,, 2011; Chien et al.,, 2018; Zhu et al.,, 2019; Ofeidis et al.,, 2022).

This problem can be solved by either advanced hardware technology or algorithmic innovation. There has been remarkable technological advancement in the first direction. For example, the GPUDirect Storage technology developed by NVIDIA enables direct communication between the hard drive and the GPU graphical memory. Furthermore, their Remote Direct Memory Access (RDMA) technology provides a direct path among different GPUs within a distributed system. As for the second direction, various methods have been proposed to either reduce the idling time or improve the communication efficiency. For example, Choi et al., (2019) proposed a data echoing method, which repeatedly computes the gradient on the currently loaded minibatch data before the next minibatch to be prepared. Hoffer et al., (2019) further considered applying different data augmentation method to the currently loaded minibatch data. To reduce the idling time of GPU and enhance computation efficiency, the idea of pre-loading and buffering has also received a lot of attention; see for example the data processing pipelines (Nitzberg and Lo,, 1997; Ma et al.,, 2003; Bauer et al.,, 2011) and asynchronous data loading (Zhu et al.,, 2019; Ofeidis et al.,, 2022). Despite their usefulness, the existing methods still suffer from several challenges. First, novel hardware technologies often require substantial engineering labors and financial expenses. As a result, these methods have not been extensively adopted by general practitioners and researchers for now. Second, since solving the GPU idling problem is quite engineering-oriented, most existing algorithms and training strategies lack theoretical analysis. Consequently, it is of great importance to investigate the theoretical properties of those methods under certain framework.

To address this issue, Qi et al., 2023a consider a buffered minibatch gradient descent (BMGD) algorithm. The proposed BMGD algorithm consists of two steps, that is the buffering step and the computation step. In the buffering step, a large amount of data are loaded into the CPU memory. To this end, assume that the entire sample can be divided into K𝐾Kitalic_K non-overlapping blocks, which are referred to as the buffered data. Their indices are collected by 𝒮r,ksubscript𝒮𝑟𝑘\mathcal{S}_{r,k}caligraphic_S start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT (1rR,1kKformulae-sequence1𝑟𝑅1𝑘𝐾1\leq r\leq R,1\leq k\leq K1 ≤ italic_r ≤ italic_R , 1 ≤ italic_k ≤ italic_K). For all 1rR1𝑟𝑅1\leq r\leq R1 ≤ italic_r ≤ italic_R, one can obtain 𝒮=k𝒮r,k𝒮subscript𝑘subscript𝒮𝑟𝑘\mathcal{S}=\bigcup_{k}\mathcal{S}_{r,k}caligraphic_S = ⋃ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT and 𝒮r,k1𝒮r,k2=subscript𝒮𝑟subscript𝑘1subscript𝒮𝑟subscript𝑘2\mathcal{S}_{r,k_{1}}\cap\mathcal{S}_{r,k_{2}}=\emptysetcaligraphic_S start_POSTSUBSCRIPT italic_r , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∩ caligraphic_S start_POSTSUBSCRIPT italic_r , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∅ for k1k2subscript𝑘1subscript𝑘2k_{1}\neq k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then in the computation step, a standard MGD updating procedure is applied on the buffered data. Assume that each buffered data can be further decomposed into M𝑀Mitalic_M minibatches. Let 𝒮r,k(t,m)superscriptsubscript𝒮𝑟𝑘𝑡𝑚\mathcal{S}_{r,k}^{(t,m)}caligraphic_S start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT be the index set of the sample calculated on the m𝑚mitalic_m-th minibatch in the t𝑡titalic_t-th epoch for the k𝑘kitalic_k-th buffer in the r𝑟ritalic_r-th iteration and assume that |𝒮r,k(t,m)|=nsuperscriptsubscript𝒮𝑟𝑘𝑡𝑚𝑛|\mathcal{S}_{r,k}^{(t,m)}|=n| caligraphic_S start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT | = italic_n for all r,k,t,m𝑟𝑘𝑡𝑚r,k,t,mitalic_r , italic_k , italic_t , italic_m. By inheriting the model assumptions in Section 4.1, the updating formula of the BMGD algorithm can be written as

θ^r,k(t,1)superscriptsubscript^𝜃𝑟𝑘𝑡1\displaystyle\widehat{\theta}_{r,k}^{(t,1)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , 1 ) end_POSTSUPERSCRIPT =\displaystyle== θ^r,k(t1,M)α˙r,k(t,1)(θ^r,k(t1,M)),superscriptsubscript^𝜃𝑟𝑘𝑡1𝑀𝛼superscriptsubscript˙𝑟𝑘𝑡1superscriptsubscript^𝜃𝑟𝑘𝑡1𝑀\displaystyle\widehat{\theta}_{r,k}^{(t-1,M)}-\alpha\dot{\mathcal{L}}_{r,k}^{(% t,1)}\left(\widehat{\theta}_{r,k}^{(t-1,M)}\right),over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 , italic_M ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , 1 ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 , italic_M ) end_POSTSUPERSCRIPT ) ,
θ^r,k(t,m)superscriptsubscript^𝜃𝑟𝑘𝑡𝑚\displaystyle\widehat{\theta}_{r,k}^{(t,m)}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT =\displaystyle== θ^r,k(t,m1)α˙r,k(t,m)(θ^r,k(t,m1)) for 2mM,superscriptsubscript^𝜃𝑟𝑘𝑡𝑚1𝛼superscriptsubscript˙𝑟𝑘𝑡𝑚superscriptsubscript^𝜃𝑟𝑘𝑡𝑚1 for 2𝑚𝑀\displaystyle\widehat{\theta}_{r,k}^{(t,m-1)}-\alpha\dot{\mathcal{L}}_{r,k}^{(% t,m)}\left(\widehat{\theta}_{r,k}^{(t,m-1)}\right)\mbox{\ \ for\ \ }2\leq m% \leq M,over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT - italic_α over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m - 1 ) end_POSTSUPERSCRIPT ) for 2 ≤ italic_m ≤ italic_M , (4.2)

where α>0𝛼0\alpha>0italic_α > 0 is the learning rate, and ˙r,k(t,m)(θ)=n1i𝒮r,k(t,m)˙(Xi,Yi;θ)superscriptsubscript˙𝑟𝑘𝑡𝑚𝜃superscript𝑛1subscript𝑖subscriptsuperscript𝒮𝑡𝑚𝑟𝑘˙subscript𝑋𝑖subscript𝑌𝑖𝜃\dot{\mathcal{L}}_{r,k}^{(t,m)}(\theta)=n^{-1}\sum_{i\in\mathcal{S}^{(t,m)}_{r% ,k}}\dot{\ell}(X_{i},Y_{i};\theta)over˙ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT ( italic_θ ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT over˙ start_ARG roman_ℓ end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) is the gradient computed on minibatch 𝒮r,k(t,m)subscriptsuperscript𝒮𝑡𝑚𝑟𝑘\mathcal{S}^{(t,m)}_{r,k}caligraphic_S start_POSTSUPERSCRIPT ( italic_t , italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , italic_k end_POSTSUBSCRIPT. The buffering idea of the BMGD algorithm can help reduce the GPU idling time and improve communication efficiency. A rigorous asymptotic theory was developed by Qi et al., 2023a to support the BMGD method. Its applicability is also extended to the Polyak-Lojasiewicz (PL) function class, which not only contains a wide range of statistical models (e.g., the generalized linear model) but also some non-convex loss functions.

4.5. Learning Rate Scheduling

One crucial component to the minibatch gradient descent algorithm and its variants is an appropriate learning rate scheduling. Despite previous efforts showing that the algorithm should converge under certain conditions, determining the correct learning rate in real practice largely relies on subjective judgment (Ruder,, 2016). If the learning rate is set inappropriately, it can lead to training failure or slow convergence. To address this issue, various scheduling approaches have been proposed to achieve adaptive adjustment of the learning rate. Examples include rule-based scheduling, such as the step decay scheduler (Ge et al.,, 2019) and the “reduce learning rate on plateau” method (Nakamura et al.,, 2021). These methods automatically reduce the learning rate when optimization encounters bottlenecks. Besides, Duchi et al., (2011) proposed AdaGrad, which iteratively decreases the step size based on a pre-specified function. However, AdaGrad requires setting an additional parameter related to the learning rate, which needs to be subjectively chosen. Extensions of AdaGrad, such as RMSProp (Mukkamala and Hein,, 2017) and AdaDelta (Zeiler,, 2012), have been proposed. RMSProp introduces a decay factor to adjust the weights of previous sample gradients. Moreover, Adam (Kingma and Ba,, 2014) combines RMSProp with a momentum-based method called adaptive moment estimation. In Adam, both the step size and update direction are adjusted during each iteration. However, because the step sizes are adjusted without considering the loss function, the loss reduction obtained for each update step is suboptimal. Therefore, the convergence rate can still be further improved. New techniques have been introduced to adjust step sizes in gradient-based optimization methods. For instance, Baydin et al., (2017) introduced a “learning rate for the learning rate” hyperparameter, which is updated using gradient descent to adjust the learning rate. Shu et al., (2022) have built an additional network to predict learning rate values in different iterations. However, those works bring more hyperparameters and thus result in more parameter tuning as well as more uncertainties.

From the perspective of optimization, the key to improving training algorithms lies in how to appropriately exploit 1st-order information (i.e., the gradient) and the 2nd-order information (i.e., the Hessian matrix). When the scale of parameters to be estimated is relatively small, the most common approach has been to apply the Newton’s method. For training neural networks with a large number of parameters, several generalized optimization methods (Tan et al.,, 2016; Agarwal et al.,, 2017; Bergou et al.,, 2022; Gargiani et al.,, 2020) have been proposed, inspired by the Newton–Raphson iteration. Due to the high computational cost, existing studies have tried to approximate the Hessian matrix, including the Barzilai-Borwein method (Tan et al.,, 2016), subsampling methods (Agarwal et al.,, 2017; Bergou et al.,, 2022), and generalized Gauss-Newton method (Gargiani et al.,, 2020). However, most of those methods still involve the computation and storage of 1st- and 2nd-order derivatives. Thus, they may be less efficient or even not feasible practically, when the dimension of parameter estimation is extremely high (Sutskever,, 2013).

In order to achieve automatic and nearly optimal optimization, Zhu et al., 2021b propose a novel optimization method based on local quadratic approximation (LQA). Viewing the learning rate as a time-varying parameter, they treat the loss reduction as a function of the temporal learning rate. Then, the learning rate is dynamically adjusted through the maximization of the loss reduction. Their aim is to make use of 2nd-order derivative information to accelerate the optimization while avoiding calculating the 2nd-order derivatives directly. To this end, they combine techniques of Taylor expansion and quadratic approximation to propose an improved optimization algorithm with low computational costs.

Denote the time-varying learning rate as αt,ksubscript𝛼𝑡𝑘\alpha_{t,k}italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, where t𝑡titalic_t and k𝑘kitalic_k are indices of iteration and minibatch, respectively. Given a loss function (X;θ)𝑋𝜃\mathcal{L}(X;\theta)caligraphic_L ( italic_X ; italic_θ ) with the model parameter θ𝜃\thetaitalic_θ and input sample X𝑋Xitalic_X, let θ^(t,k)superscript^𝜃𝑡𝑘\widehat{\theta}^{(t,k)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT and Δ(αt,k)Δsubscript𝛼𝑡𝑘\Delta\mathcal{L}(\alpha_{t,k})roman_Δ caligraphic_L ( italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) denote the estimate and the loss reduction at the k𝑘kitalic_k-th minibatch of the t𝑡titalic_t-th iteration of the training, respectively. For simplicity, let gt,k=|𝒮k|1i𝒮k˙(θ^(t,k))subscript𝑔𝑡𝑘superscriptsubscript𝒮𝑘1subscript𝑖subscript𝒮𝑘˙superscript^𝜃𝑡𝑘g_{t,k}=|\mathcal{S}_{k}|^{-1}\sum_{i\in\mathcal{S}_{k}}\dot{\mathcal{L}}(% \widehat{\theta}^{(t,k)})italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = | caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) denote the current gradient, where 𝒮ksubscript𝒮𝑘\mathcal{S}_{k}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the index set of the k𝑘kitalic_k-th minibatch. Then, the LQA method adopts Taylor expansion to explore the loss reduction. From the Taylor expansion of (θ)𝜃\mathcal{L}(\theta)caligraphic_L ( italic_θ ) around θ^(t,k)superscript^𝜃𝑡𝑘\widehat{\theta}^{(t,k)}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT, one can obtain the following approximation,

Δ(αt,k)Δsubscript𝛼𝑡𝑘\displaystyle\Delta\mathcal{L}(\alpha_{t,k})roman_Δ caligraphic_L ( italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) =1|𝒮k|i𝒮k{(Xi;θ^(t,k)αt,kgt,k)(Xi;θ^(t,k))}absent1subscript𝒮𝑘subscript𝑖subscript𝒮𝑘subscript𝑋𝑖superscript^𝜃𝑡𝑘subscript𝛼𝑡𝑘subscript𝑔𝑡𝑘subscript𝑋𝑖superscript^𝜃𝑡𝑘\displaystyle=\frac{1}{|\mathcal{S}_{k}|}\sum_{i\in\mathcal{S}_{k}}\left\{% \mathcal{L}\left(X_{i};\widehat{\theta}^{(t,k)}-\alpha_{t,k}g_{t,k}\right)-% \mathcal{L}\left(X_{i};\widehat{\theta}^{(t,k)}\right)\right\}= divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT { caligraphic_L ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) - caligraphic_L ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) }
=at,kαt,k+bt,kαt,k+o(αt,k2gt,kgt,k),absentsubscript𝑎𝑡𝑘subscript𝛼𝑡𝑘subscript𝑏𝑡𝑘subscript𝛼𝑡𝑘𝑜subscriptsuperscript𝛼2𝑡𝑘superscriptsubscript𝑔𝑡𝑘topsubscript𝑔𝑡𝑘\displaystyle=-a_{t,k}\alpha_{t,k}+b_{t,k}\alpha_{t,k}+o(\alpha^{2}_{t,k}g_{t,% k}^{\top}g_{t,k}),= - italic_a start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT + italic_o ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) , (4.3)

where the two constants are given by at,k=|𝒮k|1i𝒮k˙(Xi;θ^(t,k))gt,ksubscript𝑎𝑡𝑘superscriptsubscript𝒮𝑘1subscript𝑖subscript𝒮𝑘˙subscript𝑋𝑖superscript^𝜃𝑡𝑘subscript𝑔𝑡𝑘a_{t,k}=|\mathcal{S}_{k}|^{-1}\sum_{i\in\mathcal{S}_{k}}\dot{\mathcal{L}}\left% (X_{i};\widehat{\theta}^{(t,k)}\right)g_{t,k}italic_a start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = | caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT over˙ start_ARG caligraphic_L end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT and bt,k=(2|𝒮k|)1i𝒮kgt,k¨(Xi;θ^(t,k))gt,ksubscript𝑏𝑡𝑘superscript2subscript𝒮𝑘1subscript𝑖subscript𝒮𝑘superscriptsubscript𝑔𝑡𝑘top¨subscript𝑋𝑖superscript^𝜃𝑡𝑘subscript𝑔𝑡𝑘b_{t,k}=(2|\mathcal{S}_{k}|)^{-1}\sum_{i\in\mathcal{S}_{k}}g_{t,k}^{\top}\ddot% {\mathcal{L}}\left(X_{i};\widehat{\theta}^{(t,k)}\right)g_{t,k}italic_b start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = ( 2 | caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¨ start_ARG caligraphic_L end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT with ˙(θ^(t,k))˙superscript^𝜃𝑡𝑘\dot{\mathcal{L}}(\widehat{\theta}^{(t,k)})over˙ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) and ¨(θ^(t,k))¨superscript^𝜃𝑡𝑘\ddot{\mathcal{L}}(\widehat{\theta}^{(t,k)})over¨ start_ARG caligraphic_L end_ARG ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_t , italic_k ) end_POSTSUPERSCRIPT ) denoting the 1st- and 2nd-order derivatives of the local loss function, respectively. Since the higher order terms here are negligible, equation (4.3) can be simply written as Δ(αt,k)at,kδt,k+bt,kδt,k2Δsubscript𝛼𝑡𝑘subscript𝑎𝑡𝑘subscript𝛿𝑡𝑘subscript𝑏𝑡𝑘subscriptsuperscript𝛿2𝑡𝑘\Delta\mathcal{L}(\alpha_{t,k})\approx-a_{t,k}\delta_{t,k}+b_{t,k}\delta^{2}_{% t,k}roman_Δ caligraphic_L ( italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) ≈ - italic_a start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT. To optimize Δ(αt,k)Δsubscript𝛼𝑡𝑘\Delta\mathcal{L}(\alpha_{t,k})roman_Δ caligraphic_L ( italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) with respect to αt,ksubscript𝛼𝑡𝑘\alpha_{t,k}italic_α start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, one can take the corresponding derivative of the loss reduction, which leads to the approximated optimal learning rate αt,k*=(2bt,k)1at,ksubscriptsuperscript𝛼𝑡𝑘superscript2subscript𝑏𝑡𝑘1subscript𝑎𝑡𝑘\alpha^{*}_{t,k}=(2b_{t,k})^{-1}a_{t,k}italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = ( 2 italic_b start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT. This suggests that once the coefficients at,ksubscript𝑎𝑡𝑘a_{t,k}italic_a start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT and bt,ksubscript𝑏𝑡𝑘b_{t,k}italic_b start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT are estimated, a nearly optimal choice for the learning rate can thus be determined. In this way, the reduction in the loss function is nearly optimal for each batch step. As a consequence, the total number of iterations required for convergence can be significantly reduced, allowing for the algorithm to converge much faster than usual.

5. CONCLUDING REMARKS

This paper gives a selective review about three categories of statistical computation methods for massive data analysis. Firstly, we considered the distributed computing methods, which provide a feasible solution when the data size is too large to be comfortably accommodated by one single computer. Secondly, we considered the subsampling methods, which are practically useful, when limited computing resources are available for a massive data analysis task. Finally, we considered the minibatch gradient techniques, which have been widely used for the training of complicated models with a large number of parameters, such as deep neural network models.

To conclude this paper, we discuss here some potential future research directions. Firstly, we remark that all the methods reviewed in this work are developed for independent data. On the other side, datasets with sophisticated dependence structure (e.g., spatial-temporal data) are commonly encountered in real practice. Then how to conduct distributed computation for data with sophisticated dependent structure is an important research direction. Second, all the methods reviewed in this work are all developed for models with relatively limited dimension and simple structure. On the other side, models with extremely high dimension and extremely complicated structure are increasingly available. This is particularly true for various well-celebrated deep learning models. Then how to develop statistical computing theory for models with extremely high dimension and extremely sophisticated structure is another important research direction. Third, all the methods reviewed in this work assume a global model universally for all the data points with a common set of model parameters. However, this seems an obviously unrealistic assumption for dataset with a massive size. Then how to conduct effective statistical learning with more flexible model parameters is also an important research direction. Lastly, we remark that the three categories of methods reviewed in this paper can be combined together for even better performance. See for example Yu et al., (2022) and Chen et al., 2022b . This is the last research direction worth pursuing.

Disclosure Statement

No potential conflict of interest is reported by the authors.

REFERENCES

  • Abadi et al., (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Viégas, F. B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
  • Agarwal et al., (2017) Agarwal, N., Bullins, B., & Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1), 4148–4187.
  • Ai et al., (2021) Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749–772.
  • Alhamzawi and Ali, (2018) Alhamzawi, R. & Ali, H. T. M. (2018). Bayesian quantile regression for ordinal longitudinal data. Journal of Applied Statistics, 45(5), 815–828.
  • Assran and Rabbat, (2020) Assran, M. & Rabbat, M. (2020). On the convergence of Nesterov’s accelerated gradient method in stochastic settings. In International Conference on Machine Learning. PMLR.
  • Bach and Moulines, (2013) Bach, F. & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ). In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Battey et al., (2018) Battey, H., Fan, J., Liu, H., Lu, J., & Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3), 1352–1382.
  • Bauer et al., (2011) Bauer, M., Cook, H., & Khailany, B. (2011). CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery.
  • Baydin et al., (2017) Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., & Wood, F. (2017). Online learning rate adaptation with hypergradient descent. arXiv: 1703.04782.
  • Beck, (2017) Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics.
  • Beck and Teboulle, (2009) Beck, A. & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
  • Bellet et al., (2018) Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
  • Bergou et al., (2022) Bergou, E. H., Diouane, Y., Kunc, V., Kungurtsev, V., & Royer, C. W. (2022). A subsampling line-search method with second-order results. INFORMS Journal on Optimization, 4(4), 403–425.
  • Bickel and Freedman, (1981) Bickel, P. J. & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6), 1196–1217.
  • Bickel et al., (1997) Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n𝑛nitalic_n observations: Gains, losses, and remedies for losses. Statistica Sinica, 7(1), 1–31.
  • Blot et al., (2016) Blot, M., Picard, D., Cord, M., & Thome, N. (2016). Gossip training for deep learning. arXiv: 1611.09726.
  • Bottou et al., (2018) Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223–311.
  • Broyden et al., (1973) Broyden, C. G., Dennis Jr, J. E., & Moré, J. J. (1973). On the local and superlinear convergence of quasi-Newton methods. IMA Journal of Applied Mathematics, 12(3), 223–245.
  • Casella and Berger, (2002) Casella, G. & Berger, R. L. (2002). Statistical inference. Duxbury Pacific Grove, CA.
  • Chang et al., (2017) Chang, X., Lin, S., & Wang, Y. (2017). Divide and conquer local average regression. Electronic Journal of Statistics, 11(1), 1326–1350.
  • Chen et al., (2013) Chen, C. W., Dunson, D. B., Reed, C., & Yu, K. (2013). Bayesian variable selection in quantile regression. Statistics and its Interface, 6(2), 261–274.
  • (22) Chen, S., Yu, D., Zou, Y., Yu, J., & Cheng, X. (2022). Decentralized wireless federated learning with differential privacy. IEEE Transactions on Industrial Informatics, 18(9), 6273–6282.
  • Chen and Peng, (2021) Chen, S. X. & Peng, L. (2021). Distributed statistical inference for massive data. The Annals of Statistics, 49(5), 2851–2869.
  • Chen et al., (2014) Chen, W., Wang, Z., & Zhou, J. (2014). Large-scale L-BFGS using mapreduce. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • (25) Chen, X., Lee, J. D., Tong, X. T., & Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1), 251–273.
  • (26) Chen, X., Liu, W., Mao, X., & Yang, Z. (2020). Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research, 21(1), 7432–7474.
  • Chen et al., (2019) Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244–3273.
  • (28) Chen, X., Liu, W., & Zhang, Y. (2022). First-order Newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, 117(540), 1858–1874.
  • Chen and Xie, (2014) Chen, X. & Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4), 1655–1684.
  • (30) Chen, Z., Mou, S., & Maguluri, S. T. (2022). Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1), 1–24.
  • Chien et al., (2018) Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Narasimhamurthy, S., & Laure, E. (2018). Characterizing deep-learning I/O workloads in TensorFlow. In International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems.
  • Choi et al., (2019) Choi, D., Passos, A., Shallue, C. J., & Dahl, G. E. (2019). Faster neural network training with data echoing. arXiv: 1907.05550.
  • Crane and Roosta, (2019) Crane, R. & Roosta, F. (2019). DINGO: Distributed Newton-type method for gradient-norm optimization. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Cyrus et al., (2018) Cyrus, S., Hu, B., Van Scoy, B., & Lessard, L. (2018). A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference.
  • Davidon, (1991) Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on Optimization, 1(1), 1–17.
  • Defazio et al., (2014) Defazio, A., Bach, F., & Lacoste Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Deng et al., (2009) Deng, J., Dong, W., Socher, R., Li, Li, J., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Devlin et al., (2019) Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
  • Dieuleveut et al., (2020) Dieuleveut, A., Durmus, A., & Bach, F. (2020). Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3), 1348–1382.
  • Drineas et al., (2006) Drineas, P., Mahoney, M. W., & Muthukrishnan, S. (2006). Sampling algorithms for 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm.
  • Drineas et al., (2011) Drineas, P., Mahoney, M. W., Muthukrishnan, S., & Sarlós, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219–249.
  • Duchi et al., (2011) Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 257–269.
  • Efron, (1979) Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1–26.
  • Efron and Stein, (1981) Efron, B. & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 9(3), 586–596.
  • Eisen et al., (2017) Eisen, M., Mokhtari, A., & Ribeiro, A. (2017). Decentralized quasi-Newton methods. IEEE Transactions on Signal Processing, 65(10), 2613–2628.
  • Fan et al., (2023) Fan, J., Guo, Y., & Wang, K. (2023). Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 118(542), 1000–1010.
  • Fan et al., (2020) Fan, J., Li, R., Zhang, C., & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
  • Fan and Lv, (2008) Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
  • Fan et al., (2019) Fan, J., Wang, D., Wang, K., & Zhu, Z. (2019). Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6), 3009–3031.
  • Gao et al., (2019) Gao, D., Ju, C., Wei, X., Liu, Y., Chen, T., & Yang, Q. (2019). HHHFL: Hierarchical heterogeneous horizontal federated learning for electroencephalography. arXiv: 1909.05784.
  • Gao et al., (2021) Gao, Y., Li, J., Zhou, Y., Xiao, F., & Liu, H. (2021). Optimization methods for large-scale machine learning. In 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing.
  • Gao et al., (2023) Gao, Y., Zhu, X., Qi, H., Li, G., Zhang, R., & Wang, H. (2023). An asymptotic analysis of random partition based minibatch momentum methods for linear regression models. Journal of Computational and Graphical Statistics, 32(3), 1083–1096.
  • Gargiani et al., (2020) Gargiani, M., Zanelli, A., Diehl, M., & Hutter, F. (2020). On the promise of the stochastic generalized Gauss-Newton method for training DNNs. arXiv: 2006.02409.
  • Ge et al., (2019) Ge, R., Kakade, S. M., Kidambi, R., & Netrapalli, P. (2019). The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Gitman et al., (2019) Gitman, I., Lang, H., Zhang, P., & Xiao, L. (2019). Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Goldfarb, (1970) Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109), 23–26.
  • Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
  • Gower et al., (2019) Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., & Richtárik, P. (2019). SGD: General analysis and improved rates. In International Conference on Machine Learning. PMLR.
  • Goyal et al., (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677.
  • Gu and Chen, (2023) Gu, J. & Chen, S. (2023). Statistical inference for decentralized federated learning. working paper.
  • Gürbüzbalaban et al., (2021) Gürbüzbalaban, M., Ozdaglar, A., & Parrilo, P. A. (2021). Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186, 49–84.
  • He et al., (2016) He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Hector and Song, (2020) Hector, E. C. & Song, P. X. (2020). Doubly distributed supervised learning and inference with high-dimensional correlated outcomes. Journal of Machine Learning Research, 21(1), 6983–7017.
  • Hector and Song, (2021) Hector, E. C. & Song, P. X. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. Journal of the American Statistical Association, 116(534), 805–818.
  • Hoffer et al., (2019) Hoffer, E., Nun, T. B., Hubara, I., Giladi, N., Hoefler, T., & Soudry, D. (2019). Augment your batch: Better training with larger batches. arXiv: 1901.09335.
  • Hu et al., (2021) Hu, A., Jiao, Y., Liu, Y., Shi, Y., & Wu, Y. (2021). Distributed quantile regression for massive heterogeneous data. Neurocomputing, 448, 249–262.
  • Huang and Huo, (2019) Huang, C. & Huo, X. (2019). A distributed one-step estimator. Mathematical Programming, 174(1-2), 41–76.
  • Johnson and Zhang, (2013) Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Jordan et al., (2019) Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526), 668–681.
  • Kiefer and Wolfowitz, (1952) Kiefer, J. & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23, 462–466.
  • Kingma and Ba, (2014) Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv: 1412.6980.
  • Kleiner et al., (2014) Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4), 795–816.
  • Koenker, (2005) Koenker, R. (2005). Quantile regression. Cambridge University Press.
  • Koenker and Bassett, (1978) Koenker, R. & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50.
  • Korkmaz, (2020) Korkmaz, S. (2020). Deep learning-based imbalanced data classification for drug discovery. Journal of Chemical Information and Modeling, 60(9), 4180–4190.
  • Kostov and Davidova, (2013) Kostov, P. & Davidova, S. (2013). A quantile regression analysis of the effect of farmers’ attitudes and perceptions on market participation. Journal of Agricultural Economics, 64(1), 112–132.
  • Krizhevsky et al., (2012) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Lalitha et al., (2018) Lalitha, A., Shekhar, S., Javidi, T., & Koushanfar, F. (2018). Fully decentralized federated learning. In 3rd Workshop on Bayesian Deep Learning (NeurIPS).
  • Lan, (2020) Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer.
  • Lee et al., (2018) Lee, C., Lim, C. H., & Wright, S. J. (2018). A distributed quasi-Newton algorithm for empirical risk minimization with nonsmooth regularization. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Lee et al., (2017) Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1), 115–144.
  • Li, (1994) Li, K. H. (1994). Reservoir-sampling algorithms of time complexity O(n(1+log(N/n)))𝑂𝑛1𝑁𝑛O(n(1+\log(N/n)))italic_O ( italic_n ( 1 + roman_log ( italic_N / italic_n ) ) ). ACM Transactions on Mathematical Software, 20(4), 481–493.
  • Li et al., (2012) Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139.
  • Li et al., (2020) Li, X., Li, R., Xia, Z., & Xu, C. (2020). Distributed feature screening via componentwise debiasing. Journal of Machine Learning Research, 21(24), 1–32.
  • Li et al., (2022) Li, X., Liang, J., Chang, X., & Zhang, Z. (2022). Statistical estimation and online inference via local SGD. In Conference on Learning Theory.
  • Li et al., (2023) Li, X., Zhu, X., & Wang, H. (2023). Distributed logistic regression for massive data with rare events. arXiv: 2304.02269.
  • Li et al., (2021) Li, Y., Chen, C., Liu, N., Huang, H., Zheng, Z., & Yan, Q. (2021). A blockchain-based decentralized federated learning framework with committee consensus. IEEE Network, 35(1), 234–241.
  • Lian and Fan, (2018) Lian, H. & Fan, Z. (2018). Divide-and-conquer for debiased 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18(182), 1–26.
  • Lian et al., (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C. J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems. PMLR.
  • Lin and Zhou, (2018) Lin, S. & Zhou, D. (2018). Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47(2), 249–276.
  • (91) Liu, W., Chen, L., & Wang, W. (2022). General decentralized federated learning for communication-computation tradeoff. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops. IEEE.
  • (92) Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing, 70, 797–809.
  • Liu et al., (2020) Liu, Y., Gao, Y., & Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Loizou and Richtárik, (2020) Loizou, N. & Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3), 653–710.
  • Luo and Song, (2020) Luo, L. & Song, P. X. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1), 69–97.
  • Ma and Yarats, (2018) Ma, J. & Yarats, D. (2018). Quasi-hyperbolic momentum and Adam for deep learning. arXiv: 1810.06801.
  • Ma et al., (2014) Ma, P., Mahoney, M., & Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning. PMLR.
  • Ma et al., (2003) Ma, X., Winslett, M., Lee, J., & Yu, S. (2003). Improving MPI-IO output performance with active buffering plus threads. In International Parallel and Distributed Processing Symposium.
  • Ma et al., (2024) Ma, Y., Leng, C., & Wang, H. (2024). Optimal subsampling bootstrap for massive data. Journal of Business and Economic Statistics, 42(1), 174–186.
  • Mahoney, (2011) Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 123–224.
  • Mcdonald et al., (2009) Mcdonald, R., Mohri, M., Silberman, N., Walker, D., & Mann, G. S. (2009). Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Mishchenko et al., (2020) Mishchenko, K., Khaled, A., & Richtárik, P. (2020). Random reshuffling: Simple analysis with vast improvements. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Mou et al., (2020) Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., & Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Conference on Learning Theory. PMLR.
  • Moulines and Bach, (2011) Moulines, E. & Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Mukkamala and Hein, (2017) Mukkamala, M. C. & Hein, M. (2017). Variants of RMSProp and adagrad with logarithmic regret bounds. In International Conference on Machine Learning. PMLR.
  • Nadiradze et al., (2021) Nadiradze, G., Sabour, A., Davies, P., Li, S., & Alistarh, D. (2021). Asynchronous decentralized SGD with quantized and local updates. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Nakamura et al., (2021) Nakamura, K., Derbel, B., Won, K. J., & Hong, B. W. (2021). Learning-rate annealing methods for deep neural networks. Electronics, 10(16), 2029.
  • Nedic et al., (2017) Nedic, A., Olshevsky, A., & Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4), 2597–2633.
  • Needell and Ward, (2017) Needell, D. & Ward, R. (2017). Batched stochastic gradient descent with weighted sampling. In Approximation Theory XV: San Antonio 2016 15. Springer.
  • Nelder and Wedderburn, (1972) Nelder, J. A. & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384.
  • Nesterov, (1983) Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k2)𝑂1superscript𝑘2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In Doklady Akademii Nauk. Russian Academy of Sciences.
  • Nitzberg and Lo, (1997) Nitzberg, B. & Lo, V. (1997). Collective buffering: Improving parallel I/O performance. In IEEE International Symposium on High Performance Distributed Computing. IEEE.
  • Ofeidis et al., (2022) Ofeidis, I., Kiedanski, D., & Tassiulas, L. (2022). An overview of the data-loader landscape: Comparative performance analysis. arXiv: 2209.13705.
  • Pan et al., (2022) Pan, R., Ren, T., Guo, B., Li, F., Li, G., & Wang, H. (2022). A note on distributed quantile regression by pilot sampling and one-step updating. Journal of Business and Economic Statistics, 40(4), 1691–1700.
  • Pan et al., (2023) Pan, R., Zhu, Y., Guo, B., Zhu, X., & Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering, 35(9), 9502–9513.
  • Paszke et al., (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Polyak, (1964) Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
  • Polyak and Juditsky, (1992) Polyak, B. T. & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
  • Pumma et al., (2017) Pumma, S., Si, M., Feng, W., & Balaji, P. (2017). Parallel I/O optimizations for scalable deep learning. In IEEE International Conference on Parallel and Distributed Systems.
  • (120) Qi, H., Huang, D., Zhu, Y., Huang, D., & Wang, H. (2023). Mini-batch gradient descent with buffer. arXiv: 2312.08728.
  • (121) Qi, H., Wang, F., & Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics, 32(4), 1348–1360.
  • Qu and Li, (2019) Qu, G. & Li, N. (2019). Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 65(6), 2566–2581.
  • Reich et al., (2012) Reich, B. J., Fuentes, M., & Dunson, D. B. (2012). Bayesian spatial quantile regression. Journal of the American Statistical Association, 106(493), 6–20.
  • Richards et al., (2020) Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International Conference on Machine Learning. PMLR.
  • Richardson and Lidbury, (2013) Richardson, A. M. & Lidbury, B. A. (2013). Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data. BMC Bioinformatics, 14, 206.
  • Robbins and Monro, (1951) Robbins, H. & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
  • Rosenblatt and Nadler, (2016) Rosenblatt, J. D. & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Journal of the IMA, 5(4), 379–404.
  • Roux et al., (2012) Roux, N., Schmidt, M., & Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Ruder, (2016) Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv: 1609.04747.
  • Savazzi et al., (2020) Savazzi, S., Nicoli, M., & Rampa, V. (2020). Federated learning with cooperating devices: A consensus approach for massive lot networks. IEEE Internet of Things Journal, 7(5), 4641–4654.
  • Sengupta et al., (2016) Sengupta, S., Volgushev, S., & Shao, X. (2016). A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111(515), 1222–1232.
  • Shamir et al., (2014) Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning. PMLR.
  • Shao, (2003) Shao, J. (2003). Mathematical statistics. Springer New York, NY.
  • Shao and Tu, (1995) Shao, J. & Tu, D. (1995). The jackknife and bootstrap. Springer New York, NY.
  • Shu et al., (2022) Shu, J., Zhu, Y., Zhao, Q., Meng, D., & Xu, Z. (2022). Mlr-snet: transferable lr schedules for heterogeneous tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3505–3521.
  • Simonyan and Zisserman, (2015) Simonyan, K. & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
  • Soori et al., (2020) Soori, S., Mishchenko, K., Mokhtari, A., Dehnavi, M. M., & Gurbuzbalaban, M. (2020). DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
  • Stich, (2019) Stich, S. U. (2019). Local SGD converges fast and communicates little. In 2019 International Conference on Learning Representations.
  • Su and Xu, (2019) Su, L. & Xu, J. (2019). Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1), 1–41.
  • Sutskever, (2013) Sutskever, I. (2013). Training recurrent neural networks. University of Toronto Toronto, ON, Canada.
  • Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. PMLR.
  • Suwandarathna and Koggalage, (2007) Suwandarathna, S. & Koggalage, R. (2007). Increasing hard drive performance — from a thermal perspective. In International Conference on Industrial and Information Systems.
  • Tan et al., (2016) Tan, C., Ma, S., Dai, Y., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Tan et al., (2022) Tan, K. M., Battey, H., & Zhou, W. (2022). Communication-constrained distributed quantile regression with optimal statistical guarantees. Journal of Machine Learning Research, 23(1), 12456–12516.
  • Tang et al., (2018) Tang, H., Lian, X., Yan, M., Zhang, C., & Liu, J. (2018). Decentralized training over decentralized data. In International Conference on Machine Learning. PMLR.
  • Tang et al., (2023) Tang, K., Liu, W., & Zhang, Y. (2023). Acceleration of stochastic gradient descent with momentum by averaging: Finite-sample rates and asymptotic normality. arXiv: 2305.17665.
  • Tang et al., (2020) Tang, L., Zhou, L., & Song, P. X. K. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176, 104567.
  • Toulis and Airoldi, (2017) Toulis, P. & Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4), 1694–1727.
  • Tu et al., (2023) Tu, J., Liu, W., Mao, X., & Xu, M. (2023). Distributed semi-supervised sparse statistical inference. arXiv: 2306.10395.
  • Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University Press.
  • Van Scoy et al., (2017) Van Scoy, B., Freeman, R. A., & Lynch, K. M. (2017). The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1), 49–54.
  • Vanhaesebrouck et al., (2017) Vanhaesebrouck, P., Bellet, A., & Tommasi, M. (2017). Decentralized collaborative learning of personalized models over networks. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR.
  • Vitter, (1985) Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1(1), 37–57.
  • Volgushev et al., (2019) Volgushev, S., Chao, S., & Cheng, G. (2019). Distributed inference for quantile regression processes. The Annals of Statistics, 47(3), 1634–1662.
  • Wang et al., (2022) Wang, F., Huang, D., Gao, T., Wu, S., & Wang, H. (2022). Sequential one‐step estimator by sub‐sampling for customer churn analysis with massive data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 71(5), 1753–1786.
  • Wang et al., (2021) Wang, F., Zhu, Y., Huang, D., Qi, H., & Wang, H. (2021). Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Computational Statistics and Data Analysis, 162, 107265.
  • Wang, (2009) Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524.
  • (158) Wang, H. (2019). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3), 46.
  • (159) Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132), 1–59.
  • Wang, (2020) Wang, H. (2020). Logistic regression for massive data with rare events. In International Conference on Machine Learning. PMLR.
  • Wang and Ma, (2021) Wang, H. & Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112.
  • (162) Wang, H., Yang, M., & Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525), 393–405.
  • (163) Wang, H., Zhu, R., & Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829–844.
  • Wang et al., (2017) Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In International Conference on Machine Learning. PMLR.
  • (165) Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate Newton method for distributed optimization. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • (166) Wang, X., Yang, Z., Chen, X., & Liu, W. (2019). Distributed inference for linear support vector machine. Journal of Machine Learning Research, 20(113), 1–41.
  • Woodworth et al., (2020) Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., & Srebro, N. (2020). Is local SGD better than minibatch SGD? In International Conference on Machine Learning. PMLR.
  • (168) Wu, S., Huang, D., & Wang, H. (2023). Network gradient descent algorithm for decentralized federated learning. Journal of Business and Economic Statistics, 41(3), 806–818.
  • (169) Wu, S., Huang, D., & Wang, H. (2023). Quasi-Newton updating for large-scale distributed learning. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(4), 1326–1354.
  • (170) Wu, S., Zhu, X., & Wang, H. (2023). Subsampling and jackknifing: A practically convenient solution for large data analysis with limited computational resources. Statistica Sinica, 33(3), 2041–2064.
  • Xu et al., (2017) Xu, G., Sit, T., Wang, L., & Huang, C. Y. (2017). Estimation and inference of quantile regression for survival data under biased sampling. Journal of the American Statistical Association, 112(520), 1571–1586.
  • Xu et al., (2020) Xu, Q., Cai, C., Jiang, C., Sun, F., & Huang, X. (2020). Block average quantile regression for massive dataset. Statistical Papers, 61, 141–165.
  • Yang et al., (2013) Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. PMLR.
  • Yao and Wang, (2019) Yao, Y. & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60(2), 585–599.
  • Yu et al., (2022) Yu, J., Wang, H., Ai, M., & Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265–276.
  • Yu et al., (2021) Yu, L., Balasubramanian, K., Volgushev, S., & Erdogdu, M. (2021). An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. In Advanced in Neural Information Processing Systems. Curran Associates, Inc.
  • Yuan et al., (2016) Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835–1854.
  • Zeiler, (2012) Zeiler, M. D. (2012). AdaDelta: An adaptive learning rate method. arXiv: 1212.5701.
  • Zhang et al., (2016) Zhang, J., C., D. S., Mitliagkas, I., & Ré, C. (2016). ParallelSGD: When does averaging help? In International Conference on Machine Learning Workshop on Optimization in Machine Learning.
  • Zhang and Lin, (2015) Zhang, Y. & Lin, X. (2015). DiSCO: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning. PMLR.
  • Zhang et al., (2012) Zhang, Y., Wainwright, M. J., & Duchi, J. C. (2012). Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Zhao et al., (2018) Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A framework of rebalancing imbalanced healthcare data for rare events’ classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018.
  • Zhong et al., (2022) Zhong, W., Wan, C., & Zhang, W. (2022). Estimation and inference for multi-kink quantile regression. Journal of Business and Economic Statistics, 40(3), 1123–1139.
  • Zhou et al., (2023) Zhou, L., She, X., & Song, P. X. (2023). Distributed empirical likelihood approach to integrating unbalanced datasets. Statistica Sinica, 33(3), 2209–2231.
  • Zhu et al., (2006) Zhu, M., Su, W., & Chipman, H. A. (2006). LAGO: A computationally efficient approach for statistical detection. Technometrics, 48(2), 193–205.
  • Zhu et al., (2023) Zhu, W., Chen, X., & Wu, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, 118(541), 393–404.
  • (187) Zhu, X., Li, F., & Wang, H. (2021). Least-square approximation for a distributed system. Journal of Computational and Graphical Statistics, 30(4), 1004–1018.
  • Zhu et al., (2022) Zhu, X., Pan, R., Wu, S., & Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business and Economic Statistics, 40(4), 1892–1903.
  • (189) Zhu, Y., Huang, D., Gao, Y., Wu, R., Chen, Y., Zhang, B., & Wang, H. (2021). Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks, 141, 11–29.
  • Zhu et al., (2019) Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., & Chowdhury, F. (2019). Efficient user-level storage disaggregation for deep learning. In 2019 IEEE International Conference on Cluster Computing.
  • Zhuang et al., (2019) Zhuang, J., Cai, J., Wang, R., Zhang, J., & Zheng, W. (2019). CARE: Class attention to regions of lesion for classification on imbalanced data. In International Conference on Medical Imaging with Deep Learning. PMLR.
  • Zinkevich et al., (2010) Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  • Zou and Li, (2008) Zou, H. & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4), 1509–1533.