2501.07879v1

1
Distributed Nonparametric Estimation: from

Sparse to Dense Samples per Terminal
Deheng Yuan, Tao Guo and Zhongyi Huang
arXiv:2501.07879v1 [cs.LG] 14 Jan 2025
Abstract
Consider the communication-constrained problem of nonparametric function estimation, in which each distributed
terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal
rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse
to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either
dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol
by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using
information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins
model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson
and heteroskedastic regression models.
I. I NTRODUCTION
Distributed nonparametric estimation problems have attracted wide attention, and related theoretical studies can
shed light on the understanding of modern applications such as federated learning [1], [2], [3]. In this setting,
multiple distributed terminals cooperate to estimate a nonparametric function, while each of them can only observe
part of samples and use limited number of bits to describe the observation. The limitation of communication
resources often leads to an increase in estimation error compared with the classic centralized settings where all the
samples are accessed directly.
In this work, we investigate the nonparametric function estimation problem, where each of the m terminals
observes n i.i.d. samples and has a communication budget of l bits. We consider all parameter regimes from sparse
to dense samples per terminal, i.e., there is no restriction on the relative value of n compared to m. Inspired by [4],
a unified nonparametric estimation model describing many ways of sample generation is adopted in this work, so
that many specific settings are subsumed.
This work was partially supported by the NSFC Projects No. 12025104 and 62301144. (Corresponding authors: Tao Guo and Zhongyi Huang).
Deheng Yuan and Zhongyi Huang are with the Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China (Emails:
ydh22@mails.tsinghua.edu.cn, zhongyih@tsinghua.edu.cn).
Tao Guo is with the School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China (Email: taoguo@seu.edu.cn).
2
A. Our Contributions
The main contribution of this work is that we obtain the minimax optimal rates (up to logarithmic factors) for the
aforementioned estimation problem, under a few regularity assumptions. The assumptions are satisfied in several key
estimation settings, including density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.
Hence, the optimal rates for these models follow directly as corollaries of the main results.
Previous works focused on either the case m < nγ , γ < 2r (r is the Sobolev regularity parameter) with dense
samples [5], [4] or the specific density estimation problem with extremely sparse n = 1 sample [6], [7]. We fully
solve the problem by characterizing the optimal rates for all regimes from sparse to dense samples per terminal
classified by the relative size of n and m. As a result, We see that the dependence of the optimal rates on the
communication budget l can be qualitatively different for sparse and dense settings, where phase transitions are
clearly characterized.
To establish our results, we need to prove both the upper and lower bounds for the minimax rate. For the upper
bound, we design a two-layer estimation protocol. The outer layer transforms the original problem into a parametric
density estimation problem, and the inner layer solves it by exploiting the protocol developed in [8]. The design of
the outer layer employs wavelet-based estimator with sparsity properties, incorporating appropriate truncation and
quantization. Moreover, parameters linking two layers are tuned carefully to achieve optimality. For the lower bound,
we take advantage of information-theoretic methods [6], [9], [10]. We prove and apply a generalized tensorization
of the strong data processing inequality, in which the key step is to bound the strong data processing constant. To
establish this bound, we interpret the likelihood ratio by drawing connections to the classic balls and bins model.
B. Comparisons with Related Works
As a theoretical framework for federated learning, distributed nonparametric estimation problems under commu-
nication constraints received wide attention [11], [6], [5], [12], [13], [4], [7], as well as related problems under
differential privacy constraints [14], [15], [16], [17], [18]. For the problems under communication constraints,
optimal rates for certain special cases are obtained such as the nonparametric density estimate [6], [7], nonparametric
Gaussian regression [5], Gaussian sequence model with white noise [11], [12], [13] and two-party joint distribution
estimation [19], [20]. In [4], a general framework was developed and optimal rates were derived for several special
cases under specific assumptions.
In the setting where each terminal has n i.i.d. samples, previous works focused on either the regime m <
γ
n , γ < 2r (r is the Sobolev regularity parameter) with dense samples [5], [4] or the density estimation problem
with extremely sparse n = 1 sample [6], [7]. However, the problem for other choices of n is more difficult, since
the methods in [6], [7], [5], [4] cannot be applied without substantial development.
In the current work we set no restrictions on the relative size of n and m and obtain the optimal rates for all
regimes, which fully solves the problem. To handle this general case, our framework imposes assumptions which
are strengthened slightly from that in [4]. All the special cases therein are still subsumed in our setting. Moreover,
different from [4], our assumptions are imposed on single random variables, making them easier to verify.
3
B1
X1n Encoder 1
B2
(B1 , X2n ) Encoder 2
Decoder fˆ
...
n Bm
(B1:m−1 , Xm ) Encoder m
Fig. 1. Distributed interactive nonparametric estimation
Specialized to regression problems like the nonparametric Gaussian regression, our problem leads to a setting
with random design where the explanatory variable is randomly generated. The resulting problem is substantially
different and more difficult than the Gaussian sequence model studied by [11], [12], [13], and merits separate
investigation. The difference is reflected in the optimal rates, which depend exponentially in l for some regimes of
our problem, but always polynomially for the Gaussian sequence model.
C. Problem Formulation
We denote a discrete random variable by a capital letter and use the superscript n to denote an n-sequence, e.g.,
X n = (Xi )ni=1 . For any positive a and b, we say a b if a ≤ c · b for some constant c > 0 independent of
parameters we are concerned. The notation is defined similarly. Then we denote by a ≍ b if both a b and
a b hold.
Let pf be an unknown distribution (assume pf is the pmf or pdf) parameterized by a function f belonging to
1
a subset F of the standard Sobolev ball H r ([0, 1], L), r > 2, L > 0. See Section III-A for a brief introduction
of Sobolev spaces and wavelets. Assume that samples are generated at random according to pf . To be precise,
let Xij ∼ pf (x), i = 1, 2, · · · , m, j = 1, 2, · · · , n be i.i.d. random variables distributed over X . Denote the total
sample size by N = mn.
Consider the distributed nonparametric minimax estimation problem with communication constraints depicted in
Figure 1. Assume there are m encoders. For i = 1, ..., m, the i-th encoder observes the source message Xin =
(Xij )nj=1 and the first i − 1 coded messages (Bi′ )i−1
i′ =1 . Then the coded binary message Bi of length l is transmitted
to the decoder and the remaining m − i encoders. Upon receiving messages B m = (Bi )m
i=1 , the decoder needs to
establish a reconstruction fˆ ∈ H r ([0, 1], L) of f . We assume that l ≥ 4 in this work, which is reasonable for most
cases.
An (m, n, l) sequentially interactive protocol P is defined by a series of random encoding functions
Enci : X n × {0, 1}(i−1)l → {0, 1}l, ∀i = 1, ..., m,
and a random decoding function

Dec : {0, 1}ml → H r ([0, 1], L).
Then we have Bi = Enci (X n , B1:i−1 ) and the reconstruction of the function is fˆP = Dec(B1 , B2 , ..., Bm ).
4
Define the minimax convergence rate as
R(m, n, l, r) = inf sup E[kfˆP − f k22 ]. (1)

(m, n, l)-protocol P f ∈F
The parameter L is omitted in (1), since we assume that L is a constant and are only interested in the order of the
convergence rate R in this work.
II. M AIN R ESULTS
For the case l = ∞, i.e. there are no communication constraints, it is well known that (cf. Section 6.3.3 in [21]
and Section 15.3 in [22])
2r
R(m, n, ∞, r) ≍ N − 2r+1
for specific settings like density estimation and Gaussian regression. Recall that N = mn is the total sample size.
To see the effect of communication constraints, first define the effective sample size Ness as
nh 2r+1
i h 2r+1
io
(2l mn) 2r+2 ∧lm ∨ (lm)2r+1 ∧(lmn) 2r+2 ∧ mn. (2)
It is easy to see that for l = ∞, we have Ness = mn. Further denote by Poly(log N ) a polynomial of log N . In
the following, we show that the optimal rate for the problem in Section I-C with total sample size mn is roughly
the same as that for the problem without communication constraints with total sample size Ness , which is the main
theorem of this work.
2r
Theorem 1. Under Assumptions 1, 2 and 3, we have R(m, n, l, r) (Ness )− 2r+1 /Poly(log N ) and R(m, n, l, r)
2r
(Ness )− 2r+1 Poly(log N ).
2r
Theorem 1 shows that the minimax optimal rate R(m, n, l, r) is approximately (Ness )− 2r+1 under a few necessary
assumptions 1, 2 and 3 on the sample distribution pf . Details of assumptions are omitted here and will be formulated
in Sections IV and V. Assumption 1 is for the upper bound, which requires the existence of good estimators for
f based on each sample X ∼ pf . The remaining assumptions 2 and 3 are for the lower bound, which reveals
the difficulty of estimating f based on samples generated from pf . These assumptions are reasonable, and can be
verified for many common examples such as those presented in Section II-C.
Proof of Theorem 1: We collect different cases in (2) based on the exact formula of Ness . Logarithmic factors
in some boundaries that do not affect the conclusion in Theorem 1 are omitted. The proof of both lower and the
upper bounds needs to handle all the following cases respectively.
2r+1
1 m
1. Ness = (2l mn) 2r+2 for m ≥ n2r+1 and 1 ≤ l ≤ 2r+1 log n2r+1 . In this case n2r+1 ≤ Ness ≤ m.
1 m n2r+1
2. Ness = lm for m > n2r and 2r+1 log n2r+1 ∨ m ≤ l ≤ n. In this case Ness ≥ n2r+1 ∨ m.
1
n 2r+1
3. Ness = (lm)2r+1 for n > m2r+1 and 1 ≤ l ≤ m . In this case Ness ≤ n.
1
2r+1
n 2r+1 n2r+1 1
4. Ness = (lmn) 2r+2 for m < n2r+1 and m ∨1≤l≤ m ∧ (mn) 2r+1 . In this case n ≤ Ness < n2r+1 .
1
5. Ness = mn for l ≥ n ∧ (mn) 2r+1 .
We formulate the assumptions and give the detailed proof for upper and lower bounds in Sections IV and V
respectively. Combining the results of Theorems 4 and 5 directly implies Theorem 1.
5
A. Phase Transitions from Spase to Dense Samples per Terminal
Within a logarithmic gap, the optimal rate R(m, n, l, r) is fully characterized by Ness , which can be written in
the following equivalent form.

2r+1


 (2l mn) 2r+2 ∧ lm ∧ mn, if m ≥ n2r+1 ,



 h 2r+1
i
(lmn) 2r+2 ∨ lm ∧ mn,







if n2r < m < n2r+1 and n ≤ m2r+1 ,



Ness = (3)
2r+1
(lmn) 2r+2 ∧ mn, if m ≤ n2r and n ≤ m2r+1 ,






 2r+1
(lm)2r+1 ∧ (lmn) 2r+2

 ∧ mn,





if m ≤ n2r and n > m2r+1 .


First, we see that the dependence of R(m, n, l, r) on the communication budget l is different for sparse and dense
regimes. For the sparsest regime, i.e. m ≥ n2r+1 , R(m, n, l, r) first decays exponentially as l increases, and then
the rate slows to a polynomial order. For other regimes, R(m, n, l, r) always depends polynomially on l.
Second, in order for the distributed system to achieve roughly the same performance as the centralized one, the
1
minimum communication budget should be n for m > n2r and (mn) 2r+1 for m ≤ n2r .
B. Comparisons with Related Previous Works
Remark 1. The work [4], [5] only considered the case m ≤ nγ for γ < 2r with an extra assumption Ness > n.
2r+1
It is a special case of the last two cases in (3) by imposing that Ness > n. In this case, Ness = (lmn) 2r+2 ∧ mn
and Theorem 1 recovers the main results Theorems 1-2 of [4] under similar assumptions. Furthermore, our results
extend those in [4] in two directions. First, we give the optimal rates for regimes m < n2r , revealing more
complicated but interesting structures of the problem. Second, we do not require Ness > n. Although restricting the
scope to Ness > n suffices in certain situations, it is generally impractical in most others. Moreover, investigating
the problem without the restriction can give more insights from a theoretical view.
Remark 2. The work [7] considered the density estimation problem with extremely sparse samples (i.e. n = 1),
the Ls norm and the Besov space B(p, q, r). By letting p = q = s = 2, the Besov space reduces to the Sobolev
space H r ([0, 1], L) and the Ls norm becomes the L2 norm. In this way, their problem reduces to the special case
2r+1
of Example 1 with n = 1, in which their conclusion coincides with the first case in (3), that is Ness = (2l m) 2r+1 ∧m.
It is an interesting direction to generalize our results to the Besov space and Ls norm, though it is not the main
goal of this work due to space limitation.
1
Remark 3. We find the exponential dependence of R(m, n, l, r) in l for m ≥ n2r+1 and 1 ≤ l ≤ 2r+1
m
log n2r+1 . It
is different from the Gaussian sequence model considered in [11], [12], [13] where R(m, n, l, r) depends on l only
polynomially, although there are phase transitions in the polynomial order. This difference in minimax optimal rates
is because the ways of generating samples are different, between the Gaussian sequence model in [11], [12], [13]
and the random design problem in Section I-C. For the former, the noisy versions of all the Fourier coefficients are
known, instead of the random samples themselves for the latter.
6
Remark 4. In [23] and [8], the parametric distribution estimation problem was considered. The optimal rates
were shown to be exponential in l for some regimes, while polynomial for other regimes, which is similar to the
phenomena revealed in this work. Deeper connections of these two problems are found and discussed in Remark 6.
C. Examples
In this subsection, consider several important estimation settings such as density estimation, Gaussian, binary,
Poisson and heteroskedastic regression models. By directly verifying Assumption 1, 2 and 3 and specializing The-
orem 1, the minimax optimal rates for all these settings are obtained in a unified manner.
1) Nonparametric Density Estimation:
Example 1 (Density Estimation). Each of m distributed terminals observes n i.i.d. random samples Xin , and each
sample Xij is generated from a density function f ∈ H r ([0, 1], L). We want to estimate f at the decoder side. The
problem is a special case of the framework in Section I-C by letting
pf (x) = f (x), x ∈ [0, 1]. (4)
The optimal rate for the above density estimation problem is shown as follows and proved in Appendix D-A.
2r
−
Theorem 2. For the nonparametric density estimation problem, we have R(m, n, l, r) Ness2r+1 Poly(log N ) and
2r
R(m, n, l, r) (Ness )− 2r+1 /Poly(log N ).
2) Nonparametric Regression Problems: Consider nonparametric regression problems in the distributed setting.
Each of m distributed terminals observes n i.i.d. pairs of random variables (Tij , Yij )nj=1 , and each pair (Tij , Yij ) is
sampled under random design, where we assume that the explanatory variable Tij ∼ Unif([0, 1]) and the response
Yij follows some distribution parameterized by f (Tij ). We want to estimate f at the central decoder. Next we
specify the conditional distribution Yij |Tij and consider the resulting regression problems.
Example 2 (Nonparametric Gaussian Regression). Let Yij |Tij ∼ N (f (Tij ), 1), and then the model is a special
case of that in Section I-C by letting
1 (y−f (t))2
pf (t, y) = √ e− 2 , x = (t, y) ∈ [0, 1] × R. (5)
2π
Example 3 (Nonparametric Binary Regression (Classification)). Let f (t) ∈ [0, 1] for any t ∈ [0, 1] and Yij |Tij ∼
Bern(f (Tij )). Then the model is a special case of that in Section I-C by letting
pf (t, y) = f (t)1y=1 + (1 − f (t))1y=0 ,

(6)
x = (t, y) ∈ [0, 1] × {0, 1}.
Example 4 (Nonparametric Poisson Regression). Let f (t) > 0 for any t ∈ [0, 1] and Yij |Tij ∼ Poisson(f (Tij )).
Then the model is a special case of that in Section I-C by letting
(f (t))y
pf (t, y) = e−f (t) , x = (t, y) ∈ [0, 1] × N. (7)
y!
7
Example 5 (Nonparametric Heteroskedastic Regression). Let f (t) > 0 for any t ∈ [0, 1] and Yij |Tij ∼ N (0, f (Tij )).
Then the model is a special case of that in Section I-C by letting
1 y2
pf (t, y) = p e− 2f (t) , x = (t, y) ∈ [0, 1] × R. (8)
2πf (t)
For each of the above regression problems, the optimal rate is shown as follows and proved in Appendix D-B.
Theorem 3. For the nonparametric Gaussion, binary, Poisson and heteroskedastic regression problems, we have
2r
− 2r
R(m, n, l, r) Ness2r+1 Poly(log N ) and R(m, n, l, r) (Ness )− 2r+1 /Poly(log N ).
III. P RELIMINARY R ESULTS FOR THE P ROOF
In this section, we present preliminary results that are essential for the proof of both upper and lower bounds.
A. Preliminary Results on Sobolev Spaces and Wavelets
We want to find good approximation formulas for the Sobolev space H r ([0, 1], L), r, L > 0 in order to simplify
the L2 estimation problem in the space. It turns out that the wavelet construction is suitable, since it induces sparse
representations of randomly generated data, which is fully exploited in the proof of the upper bound.
We follow the construction by [24], for more details see [21], [25]. Start with two continuous father and mother
wavelet functions φ, ψ with S vanishing moments and bounded support on [0, 2S − 1] and [−S + 1, S] respectively,
where S > r. By linear scaling of φ and ψ and correcting them near the boundary, an orthonormal basis for L2 [0, 1]
is obtained.
We can construct an approximation f H ∈ H r ([0, 1], L) to f in the L2 norm with resolution H ∈ N by
H
2
X
H
f = fHs φHs , (9)
s=1
H
where {φHs }2s=1 is an orthonormal system and fHs = (f, φHs ). Furthermore, for any function f in the smaller
space H r ([0, 1], L) ⊆ L2 [0, 1], useful properties of the approximation f H are summarized in the following lemma.
See Section 4.3 in [21] and Corollary 26 in [25] for the proof.
Lemma 1. Let f ∈ H r ([0, 1], L) Then the approximation f H by (9) satisfies the following.
H
1) kφHs k∞ 2 2 , for any s = 1, ..., 2H .
2) Let NHs = s′ : [ s−1 s H

2H , 2H ] ∩ supp(φHs ) 6= ∅ for s = 1, ..., 2 . Then |NHs | ≤ 2S + 2.
′
3) The L2 convergence rate satisfies

kf − f H k22 2−2Hr . (10)
4) If r > 21 , then H r ([0, 1], L) ⊆ L∞ ([0, 1], L′ ) for some L′ (L, r) > 0, where L∞ ([0, 1], L′ ) = {f ∈ L∞ ([0, 1]) :
kf kL∞ ≤ L′ }.
Moreover, the wavelet construction is useful for the proof of the lower bound in Section V. Let ψhs (t) =
h
2 ψ(2h t − s) for h ∈ N and s ∈ Z. By the construction of ψ in [24] we have kψhs k22 = 1. Let h0 = ⌈log2 (2S +
2
2)⌉ and h ≥ h0 . For s = 1, · · · , 2h−h0 , there exists s′ ∈ Z such that the support of ψh−h0 ,s′ is contained in
8
n h−h o2h−h0
s−1 s 2h−h0 ′ 0
ψs2

, 2h−h
2h−h0 0
. Let ψ s = ψ h−h0 ,s′ for such s . Then is a subset of {ψh−h0 ,s′ }s′ ∈Z such that
s=1
h−h0
the support of ψs2 s−1 s

is contained in 2h−h 0
, 2h−h 0
.
B. Protocols for Estimating a Parametric Distribution
Suppose that we want to estimate a parametric distribution pW over a finite set W with size k = |W|. The
setting is the same as Section I-C, except that the task is different. To be precise, there are m encoders and the i-th
encoder holds n i.i.d. samples (Wij )nj=1 . Each encoder can send a length l message to help the decoder establish
an estimate p̂W (w), such that the L2 loss E[kp̂W − pW k22 ] is minimized.
The following lemma is essential, which characterizes the optimal error rates (up to logarithmic factors, cf. [8])
for the above distribution estimation problem.
Lemma 2. For the above distribution estimation problem, there exists an interactive protocol ASR(m, n, l, k) such
that for any pW ∈ ∆W , the protocol outputs an estimate p̂W satisfying,
1) if k ≤ n, m(l ∧ k) > 1000k log2 N , then E[kp̂W − pW k22 ] = O k 1

mnl ∨ mn ;
log n
1’) if k ≤ n, l ≥ log n and m⌊ ⌈logl n⌉ ⌋ ≥ k, then E[kp̂W − pW k22 ] = O kmnl 1
∨ mn ;
k

2 2 log( n +1) 1
2) if n < k ≤ (2l − 1) · n, l ≥ 2 and m(l ∧ n) > 2000n log N , then E[kp̂W − pW k2 ] = O ml ∨ mn ;
3) if k > (2l − 1) · n, l ≥ 4 and m(l ∧ n) > 4000n log2 N , then E[kp̂W − pW k22 ] = O k

2l mn .
The cases in 1-3) are achieved by [8], see Theorem 1 therein for the proof. The proof of 1’) can be found in
Appendix A-C.
IV. U PPER B OUNDS
The wavelet-based approximation formula (9) in Section III-A is useful for estimating the function f . Throughout
this section, let K = 2H to simplify the notations. In order to exploit (9), we make the following assumption
regarding the existence of a good sample-wise estimator of each wavelet coefficient fHs .
Assumption 1. Assume X = (T, Y ), where T ∈ [0, 1] and Y ∈ Y. For any H ∈ N and s = 1, ..., K, there
exists an estimator fˆHs (X) = h(Y )φHs (T ) of fHs that is unbiased (E[fˆHs (X)] = fHs ) and sub-exponential with
√ √
parameters ( c1 K, c2 K).
The definitions and related properties of sub-exponential random variables can be found in Appendix A-A. In
many estimation problems, such as the density estimation and regression problems in Section II-C, the construction
of the estimator fˆHs (X) in Assumption 1 is easily seen.
Then we can obtain the upper bound in the following theorem, which is the main goal of this section. The
theorem is proved by the estimation protocol and its error analysis in the following two subsections respectively.
2r
Theorem 4. Under Assumption 1, we have R(m, n, l, r) (Ness )− 2r+1 Poly(log N ).
9
A. The Layered Estimation Protocol

1
It suffices to let l ≤ n ∧ (mn) 2r+1 , otherwise we can simply discard the additional bits. That is, we consider
Cases 1-4 in the proof of Theorem 1.
The main characteristic of our protocol is that it consists of two layers. The outer layer converts the original
nonparametric distributed estimation problem into a distribution estimation problem. The inner layer estimates the
parametric distribution. This can be achieved by invoking the protocol in Lemma 2 for the distribution estimation
problem. The resolution parameter H of the wavelet approximation is carefully determined, so that the error induced
by inner and outer layers is balanced.
Preparation: Choose the resolution parameters (H, K = 2H ) based on the parameters (m, n, l, r), where H is
the smallest integer such that  1


(2l mn) 2r+2 , for Case 1,


 1
(lm) 2r+1 , for Case 2,





22S+2 · K ≥ lm (11)
, for Case 3,
2000 log2 N





 1
 (lmn) 2r+2


 , for Case 4.
2000 log2 N

1
Let K0 = c3 K 2 log N , where c3 is much larger than c2 (e.g. c3 > 400(r + 1)c2 , cf. Assumption 1). Define the
truncation function
TruncK0 (w) = (w ∧ K0 ) ∨ (−K0 ). (12)
The truncation function and the sample-wise estimation function fˆHs are known to all the encoders and the decoder.
Quantization: For i = 1, ..., m, upon observing Xin = (Tij , Yij )nj=1 , the i-th encoder first computes Sij =
⌊KTij ⌋ and determines NHSij based on Sij . Then for each j = 1, ..., n and s ∈ NHSij , the encoder computes
f˜Hs (Xij )+K0
fˆHs (Xij ), f˜Hs (Xij ) = TruncK0 (fˆHs (Xij )), and QHs (Xij ) = 2K0 . It is easily seen that QHs (Xij ) ∈
[0, 1]. Then it generates an i.i.d. random bit sequence (Vij (s))s∈NHSij of length NHSij = 2S + 2, and each bit
follows the distribution Bern(QHs (Xij )). The sequence Vij is the quantization of (f˜Hs (Xij ))s∈NHSij .
Next, the i-th encoder computes Wij = (Sij , Vij ) ∈ {1, ..., K} × {0, 1}2S+2. Denote the alphabet of Wij by
W = {1, ..., K} × {0, 1}2S+2, then |W| = 22S+2 K. Note that (Wij )i∈[1:m],j∈[1:n] are i.i.d. random variables, and
we denote the distribution of each Wij by pW (w). The i-th encoder holds n i.i.d. samples Win = (Wij )nj=1 .
Estimation of the Parametric Distribution pW : Then the encoders send messages to the decoder for estimation
of the parametric distribution pW following the protocol ASR(m, n, l, |W|) introduced in Lemma 2 and [8]. Let
the estimate of the distribution be p̂W (w).
Decoding and Reconstruction: Based on the estimate of the parametric distribution of Wij , the decoder
reconstructs an estimate f¯H of f as follows.
10
For any s, s′ = 1, ..., H, the decoder computes

  
 X 1
2K  p̂W (s′ , v) − p̂W (s′ ) ,


 0

 2
v:v(s)=1


¯(s′ )
fHs = (13)


 if s ∈ NHs′ ,




0, if s ∈/ NHs′ .

Then it computes
′
(s )
f¯Hs = f¯Hs ,
X
(14)
s′
and finally the estimate
K
f¯H = f¯Hs φHs .
X
(15)
s=1
Remark 5. The idea of estimating sparse wavelet coefficients through estimating a parametric distribution in
nonparametric estimation problem is inspired by [7], while only the density estimation problem with n = 1 was
considered therein. The general estimation problem defined in Section I-C with n > 1 is substantially more difficult
and require much more effort. There are two major differences.
First, the estimator fˆHs (X) is bounded for the density estimation problem. However, this is not the case for
regression problems and the general estimation problem in Section I-C. To overcome the difficulty, we truncate
fˆHs (X) in the protocol and show the resulting error is negligible, under a sub-exponential condition of fˆHs (X)
satisfied by most common problems.
Second, for the estimation of the parametric distribution, the work [7] takes advantage of the simulate-and-infer
protocol in [26], which is optimal for n = 1 but not for n > 1. Unlike the special case n = 1, we can see from [8]
that the optimal protocols for the distribution estimation problem in Lemma 2 with n > 1 vary across different
parameter regimes. Hence there are many possible choices of the parameter K and the protocol in the outer and
inner layers respectively, and finding the optimal one can be obscure. The main work in this section is to determine
the optimal parameter and protocol for cases 1-4 in the proof of Theorem 1, so that the optimal rate is always
achieved.
B. Error Analysis
The following lemma describes the overall error bound in terms of the inner layer error. See Appendix B-A for
the proof.
Lemma 3. For the estimate f¯H obtained by the protocol in Section IV-A,
E[kf¯H − f k22 ] K −2r + K log2 N E kp̂W − pW k22 .

(16)
Next we bound the inner layer error for cases 1-4 by Lemma 2 and (11). Then the overall error bounds are
evaluated accordingly. We present the sketch here, and detailed verification can be found in Appendix B-B.
22S+2 ·K
For Case 1, the condition of 3) in Lemma 2 is satisfied and we have E[kp̂W − pW k22 ] 2l mn
. Then
r
E[kf¯H − f k22 ] (2l mn)− r+1 log2 N.
11
2S+2 ·K
log( 2 +1) 1
For Case 2, the condition of 2) in Lemma 2 is satisfied and E[kp̂W − pW k22 ] n
ml ∨ mn . Then
2r
E[kf¯H − f k22 ] (lm)− 2r+1 log3 N.
22S+2 ·K 1
For Case 3, the condition of 1) or 1’) in Lemma 2 is satisfied and roughly E[kp̂W − pW k22 ] mnl ∨ mn .
Then
E[kf¯H − f k22 ] (lm)−2r log4r N.
22S+2 ·K 1
For Case 4, the condition of 1) in Lemma 2 is satisfied, and we have E[kp̂W − pW k22 ] mnl ∨ mn . Then
r
E[kf¯H − f k22 ] (lmn)− r+1 log4r+2 N.
Combining all cases completes the proof of Theorem 4.
Remark 6. From the construction of the protocol and its error analysis, we find a clear correspondence between
the distribution estimation problem in Section III-B and the nonparametric estimation problem considered in this
work (cf. Section I-C). With the help of the outer layer in our protocol, the protocol for the parametric distribution
estimation can be used as an “oracle” prepared for the inner layer. The optimal rate for each case of the nonparamtric
problem is implicitly achieved by a protocol for the distribution estimation problem.
From a high level, we can imagine a “homomorphism” from the distribution estimation problem to the nonpara-
metric estimation problem, and each case of the former is mapped to one case of the latter. We hope this observation
can give more insights to investigate various distributed statistical problems as a whole.
V. L OWER B OUNDS
In this section, let k = 2h−h0 , {ψsk }ks=1 be a subset of {ψh−h0 ,s′ }s′ ∈Z , where the support of ψsk is contained in
[ s−1 s
k , k ], as discussed in Section III-A. To construct multiple hypotheses that are useful for the proof, consider a
finite sieve F (k, C0 , ǫ) defined as

( k
)
−(r+ 21 )
X
fzk : fzk = C0 + ǫk zs ψsk , z k ∈ {−1, 1}k . (17)
s=1
The constant C0 depends on the problem. Specifically, it is 1 for the density estimation problem, 0 for Gaussian
1
regression, 2 for classification and a positive real number for Poisson and heteroskedastic regression in Section II-C.
Let ǫ ∈ (0, 1) be small enough such that, i) F (k, C0 , ǫ) ⊆ H r ([0, 1], L); ii) if C0 > 0, then fzk (t) ∈ [ C20 , 3C2 0 ],
h
∀t ∈ [0, 1]. This can be achieved since we have kψsk kH r 2hr and kψsk k∞ 2 2 . To simplify the notation, let
pzk = pfzk for any function fzk ∈ F (k, C0 , ǫ) and correspondingly, Pzk = PX n ∼pnk and Ezk = EX n ∼pnk , where
z z
the meaning will be clear in the context.

Then we make a few assumptions on the sample distribution pzk , which are essential to prove the lower bounds.
Assumption 2. Assume that X = (T, Y ), where T ∈ [0, 1] and Y ∈ Y. For any z k ∈ {±1}k and s = 1, ..., k,
1
pzk (x ∈ [ s−1 s
k , k ]×Y) = k and the conditional distribution pzk (x|x ∈ [ s−1 s
k , k ]×Y) only depends on zs . Specifically,
the distribution pzk admits a decomposition

1
pzk (x) = ps,zs (x). (18)
k
12
for any x = (t, y) with t ∈ [ s−1 s s−1 s

k , k ], where ps,zs (x) is a distribution on [ k , k ] × Y for any s = 1, ..., k and
zs = ±1.
Assumption 3. Let the sample-wise log-likelihood ratio be

ps,−zs (x)
Ls,zs (x) , log . (19)
ps,zs (x)
for any x = (t, y) with t ∈ [ s−1 s
k , k ], s = 1, ..., k and zs = ±1. We assume that for X ∼ ps,zs , Ls,zs (X) is
√
sub-exponential with parameters (ν = C1 · k −2r , β = C2 k −r ) and |E[Ls,zs (X)]| ≤ C3 · k −2r .
Then we have the main theorem of this section, focusing on the lower bound. It is proved in the rest of this
section.
2r
Theorem 5. Under Assumptions 2 and 3, we have R(m, n, l, r) (Ness · Poly(log N ))− 2r+1 .
A. Information-Theoretic Lower Bounding Methods
We define a prior distribution on H r ([0, 1], L) to be the uniform distribution on the sieve F (k, C0 , ǫ). Let
Z k = {Zs }ks=1 be a sequence of i.i.d. Rademacher random variables with mean 0. Then under the prior distribution,
the function and the sample distribution are f = fZ k and pf = pZ k , respectively. Then we have the following
lemma proved in Appendix C-A.
1
Pk 1
Lemma 4. If k s=1 I(Zs ; B m ) ≤ 2 for some k ∈ N, then R(m, n, l, r) k −2r .
With the help of Lemma 4, the proof of Theorem 5 is reduced to choosing suitable k and showing the information
Pk
inequality k1 s=1 I(Zs ; B m ) ≤ 21 , and then we obtain that R(m, n, l, r) k −2r . Methods to prove the inequality
are different for Cases 1-5. The bounds for Case 5 and Case 3 are easy and shown in Appendices C-B and C-C.
The proof for the other three cases need much more efforts, which is the goal of the remaining parts of this section.
B. Proof for the Remaining Cases
First we define the terminal-wise likelihood ratio to be

pnzk ⊙es (xn )
Ls,zk (xn ) , , (20)
pnzk (xn )
where z k ⊙ z ′k = (zs′ · zs′ ′ )ks′ =1 and es = ((−1)1s′ =s )ks′ =1 . It plays a central role in the rest of the proof, since two
Pk
of its properties lead to different kinds of bounds for k1 s=1 I(Zs ; B m ). The following two lemmas describe the
bounds respectively, whose detailed proof and discussions can be found in Section C-D.
Lemma 5. If Ezk [(Ls,zk (X n ) − 1)2 ] ≤ α2 for any s = 1, ..., k and z k ∈ {−1, 1}k , then we have
k
1X 2l mα2
I(Zs ; B m ) ≤ . (21)
k s=1 2k
Lemma 6. If there exists a Boolean function E(xn ) such that

1
i) Pzk [E(X n ) = 0] ≤ δ1 < 2 for any z k ∈ {−1, 1}k ;
13
ii) |Ls,zk (xn ) − 1| ≤ δ2 for any z k ∈ {−1, 1}k , s = 1, ..., k and xn with E(xn ) = 1.
Then we have
k
1X 1 8ml(δ12 + δ22 )
I(Zs ; B m ) ≤ m((log 2)δ12 + δ1 ) + . (22)
k s=1 k
Then we specialize Lemmas 5 and 6 to Cases 1, 2 and 4 and derive the corresponding bounds. The goal is to
bound Ls,zk (X n ) − 1 itself or its second moment. To achieve the goal, the underlying intuition is described as
follows. By Assumption 2, the terminal-wise likelihood ratio Ls,zk (xn ) is related to the sample-wise one in (19)
by
Y
Ls,zk (xn ) = exp (Ls,zs (xj )) , (23)
j:tj ∈[ s−1 s
k ,k]
Hence Ls,zk (X n ) is a product of many independent factors exp(Ls,zs (Xj )), if (Ti )ni=1 is given and Xi = (Ti , Yi ).
By Assumption 3, each of these factors has a small amplitude. If the number of these factors is bounded, then both
Ls,zk (X n ) − 1 and its second moment can be bounded. The number has a clear meaning by Assumption 2. It is
the number of balls in the s-th bin (denoted by Vs ) in the classic balls and bins model where n balls are thrown
into k bins at random (see Section A-D for details). By bounding Vs , the goal can be achieved and the whole proof
is completed. We sketch the proof for each case in the following, and details can be found in Appendices C-E, C-F
and C-G respectively.
r
1) Proof for Case 1: The goal is to show R(m, n, l, r) (2l mn)− r+1 for Case 1. Note that the expectation of
n
each exp(2Ls,zs (X)) is roughly exp(2ν 2 ) ≈ 1 + 2ν 2 by Assumption 3 and the number Vs is k on average. Then
1
by (23), for k ≍ (mn2l ) 2r+2 , we can show that
n
Ezk [(Ls,zk (X n ) − 1)2 ] = O ν2 .
k
Hence Lemma 5 implies that
k l
1X 2 mn
I(Zs ; B m ) = O .
k s=1 k 2r+2
Finally, by Lemma 4 we complete the proof.
2r
2) Proof for Case 2: We need to show R(m, n, l, r) (lm log2 m)− 2r+1 for Case 2. With the choice k ≍
1
(ml log2 m) 2r+1 , we want to obtain a bound for Ls,zk (X n ) − 1 with a large probability. Instead, we turn to bound
log Ls,zk (X n ). By Lemma 8, we obtain a uniform bound for all these numbers Vs ,
max Vs log m
1≤s≤k
with probability 1 − m−100 . Based on this event, by (23) and Assumption 3 we can further show that
| log Ls,zk (X n )| β log m
with probability 1 − m−100. Thus we can choose δ1 = O(m−100 ) and δ2 = O(β log m) in Lemma 6, which implies
that
k
ml log2 m

1X m
I(Zs ; B ) = O .
k s=1 k 2r+1
14
r
3) Proof for Case 4: We need to show R(m, n, l, r) (lmn)− r+1 log−(2r+3) n for Case 4. With the choice
1 2r+3
k ≍ (mnl) 2r+2 log 2r+1 n, we aim at bounding log Ls,zk (X n ) similar to the previous case. By Lemma 8, we have
n log3 n
max Vs
1≤s≤k k
with probability 1 − n−100 . Based on this event, by (23) and Assumption 3 we can further show that
s
ν 2 n log4 n
| log Ls,zk (X n )|
k
q
−100 −100(r+1) ν 2 n log4 n
with probability 1 − n . Thus we can choose δ1 = O(n ) and δ2 = O k in Lemma 6,
which implies that
k
mnl log4 m

1X
I(Zs ; B m ) = O .
k s=1 k 2r+2
A PPENDIX A
P RELIMINARY D EFINITIONS AND R ESULTS
A. Sub-Gaussian and Sub-Exponential Random Variables
We give definitions and properties of sub-Gaussian and sub-exponential random variables that are useful for this
work. More details can be found in [22].
A random variable X with mean µ = E[X] is sub-exponential with parameters (ν, β) if
ν 2 λ2 1
E[eλ(X−µ) ] ≤ e 2 , ∀|λ| < .
β
Note that a sub-exponemtial random has finite moments of any order.
1
A random variable X is sub-Gaussian with parameter σ if it is subexponential with parameter (σ, 0), where 0
is interpreted as ∞. Then a random vector X n is called sub-Gaussian with parameter σ if for any unit vector v n ,
Pn
j=1 vj Xj is sub-Gaussian with parameter σ.
b−a
If X ∈ [a, b], then it is sub-Gaussian with parameter 2 . See the discussion after Proposition 2.5 in [22] for the
n
proof. Let X be a random vector, if Xj for j = 1, ..., n are independent and each Xj is a sub-Gaussian random
variable with parameter σ, then it is easy to verify that X n is a sub-Gaussian random vector with parameter σ.
Let Xj , j = 1, ..., n be i.i.d. random variables and Xj is sub-exponential with parameters (νj , bj ). Then we can
qP
verify that nj=1 aj Xj is sub-exponential with parameters ( n 2 2
P
j=1 aj νj , max1≤j≤n bj ). The following lemma
characterizes the tail bound for a sub-exponential random variable, which is by Proposition 2.9 in [22].
Lemma 7. Let X be a sub-exponential random variable with mean µ and parameters (ν, β). Then

t2 ν2

2 exp − , 0 ≤ t ≤ ,


2ν 2

 β
P[|X − µ| ≥ t] ≤ (24)
ν2

2 exp − t ,

t> .

2β β

15
B. Preliminaries on divergences between distributions
Let p1 (u) and p2 (u) be two distributions over U, the KL divergence is defined by

p1 (U )
DU (p1 (u)||p2 (u)) = Ep1 log .
p2 (U )
The χ2 divergence is defined by
" 2 #
p1 (U )
χ2U (p1 (u)||p2 (u)) = Ep2 −1 .
p2 (U )
By the convexity of the logarithm function, it is easy to see that
DU (p1 (u)||p2 (u)) ≤ χ2U (p1 (u)||p2 (u)). (25)
C. Proof of Lemma 2
It remains to show the item 1’).

The Estimation Protocol
′
Let l′ = ⌊ ⌈logl n⌉ ⌋ ∧ k and m′ = ⌊ ml ′
k ⌋. Each encoder divides its l bits into l frames, and there are ⌈log n⌉ bits
in each frame. Then for each w ∈ W, each frame is sufficient for encoding the number of w among the n samples
at each encoder. We can allocate m′ k frames to all the w ∈ W, such that the frames held by the same encoder are
allocated to different w, and exactly m′ frames are allocated to each w ∈ W.
Each encoder then encodes the number of w among its n samples to each frame, where the frame is allocated
to w ∈ W. Then it connects all its frames and sends them to the decoder. For each w ∈ W, the decoder computes
num(w) by summing up the number of w, where each number is encoded in one of the m′ frames allocated to
num(w)
w. Then it computes p̂W (w) = m′ n and outputs the estimate p̂W .
Error Analysis
For each w ∈ W it is easy to see that num(w) is the sum of m′ n i.i.d. random variables, and each of them
follows the distribution Bern(pW (w)). So we have E[|p̂W (w) − pW (w)|2 ] = O( pW (w)(1−p
m′ n
W (w))
) and E[kp̂W (w) −
log n
pW (w)k22 ] = O( m1′ n ) = O( kmnl ∨ 1
mn ), completing the proof.
D. Analysis of the Balls and Bins Model
Suppose there are n balls and k bins and each ball is independently put into a bin at random. Let Vs be the
number of balls in the s-th bin. Then it is a sum of n i.i.d. Bern( k1 ) random variables. We are interested in the
maximal number of balls over all k bins, namely max1≤s≤k Vs , which is related to the strong data processing
constant in Section V. We can obtain the following inequalities characterizing cases k ≥ n and n ≥ k respectively.
Lemma 8. 1) Let k ≥ n and c ∈ N be sufficiently large, then

c
P max Vs ≥ c + 1 ≤ k exp − . (26)
1≤s≤k 2
2) Let n ≥ k and c ∈ N be sufficiently large, then

cn cn
P max Vs ≥ , ∀s = 1, ..., k ≤ k exp − . (27)
1≤s≤k k 8k
16
Proof: By the Chernoff’s bound, we have

c′2 n

h ni
P Vs ≥ (1 + c′ ) ≤ exp − , ∀c′ > 0. (28)
k (2 + c′ )k
Then we can derive the desired inequalities as follows.
Proof of 1: By letting c′ = k
n (1 + c) − 1 in (28), we have
k 2 !
n (1 + c) − 1 n
P [Vs ≥ c + 1] ≤ exp − k
n (1 + c) + 1 k
k
2 !
nc n
c
≤ exp − 2k = exp − .
n c k
2
By applying the union bound, we complete the proof.
Proof of 2: By letting c′ = c − 1 in (28), we have
(c − 1)2 n

h cn i
P Vs ≥ ≤ exp −
k (c + 1)k
c 2
( ) n cn
≤ exp − 2 = exp − .
2ck 8k
The conclusion is then implied by the union bound.
A PPENDIX B
E RROR A NALYSIS FOR THE P ROTOCOL IN S ECTION IV
A. Proof of Lemma 3
The overall error can be bounded as follows.
E[kf¯H − f k22 ] ≤2 E[kf¯H − f H k22 ] + E[kf H − f k22 ]

K
E[|fHs − f¯Hs |2 ] + 2−2Hr
X

s=1
K
E[|fHs − f¯Hs |2 ] + K −2r ,
X
=
s=1
where the second inequality is because (φHs )K

is an orthonormal system and (10) in Lemma 1 holds, and the
s=1
last equality is by K = 2 . Then it suffices to bound the first term s=1 E[|fHs − f¯Hs |2 ]. By the bias-variance
H
PK
decomposition, we have
K K h K
i2 X h i 2
E[|fHs − f¯Hs |2 ] = ˜ ¯ ˜
X X
fHs − E fHs (X) + E fHs − E fHs (X) . (29)
s=1 s=1 s=1
It suffices to bound two terms on the right hand side of (29) respectively.
For the first term, we have
h i h i
fHs − E f˜Hs (X) = E fˆHs (X) − f˜Hs (X)
h i
≤E 1|fˆHs (X)|>K0 fˆHs (X) − K0
Z ∞ h i
= P |fˆHs (X)| > K0 + a da
0
17
√ √
By Assumption 1, fˆHs (X) is sub-exponential with parameters ( c1 K, c2 K). Hence fˆHs (X) has finite second
moment, i.e., for some Σ > 0,
E[|fˆHs (X)|2 ] ≤ Σ2 . (30)
Note that
1 1 1 c1 K
K0 = c3 K 2 log N ≍ K 2 log N K 2 ≍ √
c2 K
and by (30),
q
E[|fˆHs (X)|] ≤ E[|fˆHs (X)|2 ] ≤ Σ = O(1).
Then by Lemma 7 we have

Z ∞ h i
P |fˆHs (X)| > K0 + a da
0
Z ∞
√
K0 + a 1 K0
≤ 2 exp − 1 da ≍ K 2 exp − 1 KN −100 ,
0 4c2 K 2 4c2 K 2
1
where the last step is by the choice of c3 with c3 > 400(r + 1)c2 , and then K0 = c3 K 2 log N ≥ 100(r + 1) log N ·
1
(4c2 K 2 ). Then we have
K h i 2 3
fHs − E f˜Hs (X)
X
K 2 N −100(r+1) K −2r . (31)
s=1
since we have K N by the choice of K in (11).

˜
Now consider the second term. Since V (s)|X ∼ Bern(QHs (X)) and QHs (X) = fHs (X)+K 2K0
0
, then
     
X X 1 X X 1
2K0  pW (s′ , v) − pW (s′ ) = 2K0 E 1S=s′  P[V = v|X] − 
2 2
s′ :s∈NHs′ v:v(s)=1 s′ :s∈NHs′ v:v(s)=1

X 1
= 2K0 E 1S=s′ P[V (s) = 1|X] −
2
s′ :s∈NHs′

X 1
= E 1S=s′ · 2K0 QHs (X) −
2
s′ :s∈NHs′
X h i
= E 1S=s′ f˜Hs (X)
s′ :s∈NHs′
h i
=E 1s∈NHS f˜Hs (X)
h i
=E f˜Hs (X) .
18
where the last equality is because f˜Hs (X) = fˆHs (X) = 0 for s ∈
/ NHS . Then by (13) and (14), we have
h i2
f¯Hs − E f˜Hs (X)
    2
X X 1 X X 1
= 2K0  p̂W (s′ , v) − p̂W (s′ ) − 2K0  pW (s′ , v) − pW (s′ )
2 2
′s :s∈NHs′ v:v(s)=1 ′ s :s∈NHs′ v:v(s)=1
  2
X X X
′ ′ ′ ′
=K02  (p̂W (s , v) − pW (s , v)) + (pW (s , v) − p̂W (s , v))
s′ :s∈NHs′ v:v(s)=1 v:v(s)=0
2
X X
≤K02 (2S + 2)22S+2 |p̂W (s′ , v) − pW (s′ , v)| ,
s′ :s∈NHs′ v
where the last step is by the Cauchy-Schwarz inequality and |NHs′ | ≤ 2S + 2. By taking the summation, we have
K h i 2
X
s=1
K
2
X X X
≤K02 (2S + 2)22S+2 |p̂W (s′ , v) − pW (s′ , v)|
s′ =1 s:s∈NHs′ v
K
2
X X
≤K02 (2S + 2)2 22S+2 |p̂W (s′ , v) − pW (s′ , v)| .
s′ =1 v
Taking the expectation, we have
K h i 2
X
E K log2 N kp̂W − pW k22 . (32)
s=1
Combining (29), (31) and (32), we have
K
E[|fHs − f¯Hs |2 ] K −2r + K log2 N E kp̂W − pW k22 ,
X
s=1
completing the proof.
B. Detailed Error Analysis for Theorem 4

1 1
(2l mn) 2r+2 (2l mn) 2r+2
1) Error Analysis for Case 1: Recall that 22S+2 ≤ K < 22S+1 , m ≥ n2r+1 and 1 ≤ l ≤
1
1 m
2r+1 log n2r+1 . Then we can verify that |W| = 22S+2 · K ≥ (2l mn) 2r+2 > (2l − 1) · n and m(l ∧ n) ≥ m >
22S+2 ·K
4000n log2 N . Hence the condition of 3) in Lemma 2 is satisfied, and we have E[kp̂W − pW k22 ] 2l mn .
Then by (16) in Lemma 3 we have

22S+2 · K r
E[kf¯H − f k22 ] K −2r + K log2 N l
(2l mn)− r+1 log2 N.
mn2
1 1
(lm) 2r+1 (lm) 2r+1
2) Error Analysis for Case 2: Recall that 22S+2
≤K < 22S+1
, and it suffices to consider the case for
2 m n2r+1 1
2r+1 log n2r+1 ∨ m ≤ l ≤ n. Then we can verify that (2 − 1) · n ≥ |W| = 22S+2 · K ≥ (lm) 2r+1 ≥ n and
l
m(l ∧n) = ml > 2000n log2 N . Hence the condition of 2) in Lemma 2 is satisfied, and we have E[kp̂W − pW k22 ]
2S+2 ·K
log( 2 +1) 1
n
ml ∨ mn .
Then by (16) in Lemma 3 we have

2S+2
!
·K
log( 2 + 1) 1 2r
E[kf¯H − f k22 ] K −2r + K log2 N n
∨ (lm)− 2r+1 log3 N.
ml mn
19
3) Error Analysis for Case 3: First let m > 22S+2 · 2000 log2 N . Recall that lm
22S+2 ·2000 log2 N
≤ K <
1
lm n 2r+1
22S+2 ·1000 log2 N
, n > m2r+1 and 1 ≤ l ≤ m . Then we can verify that |W| = 22S+2 · K ≤ lm ≤ n
and m(l ∧ |W|) = ml > 1000|W| log2 N . Hence the condition of 1) in Lemma 2 is satisfied, and we have
22S+2 ·K 1
E[kp̂W − pW k22 ] mnl ∨ mn .
By (16) in Lemma 3 and n ≥ (ml)2r+1 , then we have

2S+2
¯ 2 2 ·K 1
H 2
E[kf − f k2 ] K −2r
+ K log N ∨ (lm)−2r log4r N.
mnl mn
Then let l > 22S+2 · 2000 log2 N . Recall that lm
22S+2 ·2000 log2 N
≤K< lm
22S+2 ·1000 log2 N
, n > m2r+1 and 1 ≤ l ≤
1
n 2r+1
m . In this case we can verify that l > log n, |W| = 22S+2 · K ≤ lm ≤ n and m⌊ ⌈logl n⌉ ⌋ ≥ ml
2 log n ≥ |W|.
22S+2 ·K log(22S+2 ·K) 1
Hence the condition of 1’) in Lemma 2 is satisfied, and we have E[kp̂W − pW k22 ] mnl ∨ mn .
By (16) in Lemma 3 and n ≥ (ml)2r+1 , then we have

2S+2
· K log(22S+2 · K)

¯ 2 2 1
H 2
E[kf − f k2 ] K −2r
+ K log N ∨ (lm)−2r log4r N.
mnl mn
If m ≤ 22S+2 · 2000 log2 N and l ≤ 22S+2 · 2000 log2 N , then ml log4 N and the desired bound is vacuous.
1 1
(lmn) 2r+2 (lmn) 2r+2
4) Error Analysis for Case 4: Recall that 22S+2 ·2000 log2 N
≤ K < 22S+2 ·1000 log2 N
, and it suffices to let
1
2r+1 1 1
n 2r+1 n
m ∨1 ≤ l ≤ m ∧ ( 2000mn
log2 N
) 2r+1 . Then we can verify that |W| = 22S+2 · K ≤ (lmn) 2r+2 ≤ n and
1
m(l ∧ |W|) = ml ≥ (lmn) 2r+2 > 1000|W| log2 N . Hence the condition of 1) in Lemma 2 is satisfied, and we
22S+2 ·K 1
have E[kp̂W − pW k22 ] mnl ∨ mn .
By (16) in Lemma 3, then we have

22S+2 · K

1 r
E[kf¯H − f k22 ] K −2r + K log2 N ∨ (lmn)− r+1 log4r+2 N.
mnl mn
This completes the proof of Theorem 4.
A PPENDIX C
P ROOF OF THE L OWER B OUNDS IN S ECTION V
A. Proof of Lemma 4
Let P be an (m, n, l)-protocol defined in Section I-C, then we have

X 1
sup E[kfˆP − f k22 ] ≥ E[kfˆP − fzk k22 ]
f ∈F k
2k
z
=E[kfˆP − fZ k k22 ].
First, we convert the estimation problem into a testing problem by the following procedure. Let
Ẑ k = arg min kfzk − fˆP k2 .

zk
Then we have
kfẐ k − fZ k k2 ≤kfˆP − fẐ k k2 + kfˆP − fZ k k2
≤2kfˆP − fZ k k2 .
20
Hence we have 4E[kfˆP − fZ k k22 ] ≥ E[kfẐ k − fZ k k22 ].

Next we establish lower bound for the average testing error. for any z k , z ′k ∈ {−1, 1}k , since (ψsk )ks=1 have
pairwise disjoint supports, we have
k
X
kfzk − fz′k k22 =ǫ2 k −(2r+1) kψsk k22 · 1zs 6=zs′
s=1
k
X
=ǫ2 k −(2r+1) 1zs 6=zs′ .
s=1
Hence we have
1
E[kfˆP − fZ k k22 ] ≥ E[kfẐ k − fZ k k22 ]
4
k (33)
1 X
= ǫ2 k −(2r+1) P[Ẑs 6= Zs ].
4 s=1
Since Z k − X mn − B m − Ẑ k is a Markov chain, similar to Lemma 10 in [9], we have

k k
!
1X m 1X
I(Zs ; B ) ≥ 1 − h P[Ẑs 6= Zs ] , (34)
k s=1 k s=1
1 Pk
where h(p) = −p log2 p−(1−p) log2 (1−p) is the binary entropy function. By the assumption that k s=1 I(Zs ; B m ) ≤
1
2, then by (34) we have
k
1X 1
P[Ẑs 6= Zs ] ≥ .
k s=1 10
Thus by (33),
E[kfˆP − fZ k k22 ] k −2r .
Then we have R(m, n, l, r) k −2r , completing the proof.
B. Proof of the Centralized Bound for Case 5

2r
Consider the centralized bound R(m, n, l, r) (mn)− 2r+1 for Case 5. Note that the bound and the following
proof is still valid for all the other cases, though it is not tight except in Case 5. It can be immediately obtained
1
by letting k = (100mnC3 ) 2r+1 in the following lemma.
1 Pk mnC3
Lemma 9. Under Assumption 2 and 3, we have k s=1 I(Zs , B m ) ≤ 2k2r+1 .
Lemma 9 is derived from Assumption 2 and 3 as follows.

Proof: Let es = {(−1)δs′ s }ks′ =1 , s = 1, ..., k, where δs′ s is equal to 1 if s′ = s and 0 otherwise. Let ⊙ be
the element-wise product of sequences, i.e., {zs }ks=1 ⊙ {zs′ }ks=1 = {zs zs′ }ks=1 . Then z k ⊙ es is obtained by only
flipping the sign of zs in z k .
21
By the Markov chain Z − X mn − B m and the data processing inequality, we have
I(Zs ; B m ) ≤I(Zs ; X mn )

mn 1 mn 1 mn
≤E DX mn pZ k (x )|| pZ k (x ) + pZ k ⊙es (x )
2 2
1
≤ E [DX mn (pZ k (xmn )||pZ k ⊙es (xmn ))]
2
1
= E [DX mn (pZ k ⊙es (xmn )||pZ k (xmn ))]
2
mn
= E [DX (pZ k ⊙es (x)||pZ k (x))]
2
mn h i
= E D[ s−1 , s ]×Y (ps,−Zs (x)||ps,Zs (x))
2k k k
mn
= E Eps,−Zs [−Ls,−Zs (X)|Zs ]
2k
mnC3
≤ 2r+1 ,
2k
where the second and the third inequality is due to the convexity of KL divergence, the last equality is due
Pk
to Assumption 2 and the last inequality is due to Assumption 3. Then we have k1 s=1 I(Zs , B m ) ≤ 2k mnC3
2r+1 ,
C. Proof for Case 3
We show that R(m, n, l, r) (lm)−2r for the case 3. Since Zs , s = 1, ..., k are i.i.d. random variables, we have
I(Zs ; Z1:s−1 ) = 0. Then by the chain rule, we have
k
X k
X
I(Zs ; B m ) ≤ I(Zs ; B m , Z1:s−1 )
s=1 s=1
k
X
= I(Zs ; B m |Z1:s−1 )
s=1
=I(Z k ; B m )
≤H(B m ) ≤ ml.
1 Pk
By letting k = 100ml, we have k s=1 I(Zs ; B m ) ≤ 21 . This combined with Lemma 4 completes the proof.
D. Technical Bounds by the Terminal-Wise Likelihood Ratio
We establish two technical lemmas to prove the remaining bounds. Recall that the terminal-wise likelihood ratio
(for testing zs ) is defined to be
pnzk ⊙es (xn )
Ls,zk (xn ) = . (35)
pnzk (xn )
Then by Assumption 2, the terminal-wise likelihood ratio Ls,zk (xn ) can be written as
n
Y pzk ⊙es (xj ) Y ps,−zs (xj )
Ls,zk (xn ) = = ,
j=1
pzk (xj ) ps,zs (xj )
j:tj ∈[ s−1 s
k ,k]
22
and hence
X
log(Ls,zk (xn )) = Ls,zs (xj ). (36)
j:tj ∈[ s−1 s
k ,k]
In the following two sections, we recall Lemmas 5 and 6 and give their proof respectively.
1
Pk
1) Exponential Bound: If Ezk [(Ls,zk (X n )−1)2 ] has an upper bound, then we can get an bound for k s=1 I(Zs ; B m )
that is exponential in l as follows.
Lemma 10. If Ezk [(Ls,zk (X n ) − 1)2 ] ≤ α2 for any s = 1, ..., k and z k ∈ {−1, 1}k , then we have
k
1X 2l mα2
I(Zs ; B m ) ≤ . (37)
k s=1 2k
Proof: Similar to the proof of the centralized bound in Section C-B, we start by observing that
I(Zs ; B m )

m 1 m 1 m
≤E D{0,1}ml pZ k (b )|| pZ k (b ) + pZ k ⊙es (b )
2 2
1 (38)
≤ E D{0,1}ml (pZ k ⊙es (bm )||pZ k (bm ))

2
m
1X
= E D{0,1}l (pZ k ⊙es (bi |B1:i−1 )||pZ k (bi |B1:i−1 ))
2 i=1
Note that
pZ k (bi |B1:i−1 ) = EZ k [p(bi |Xin , B1:i−1 )] = EZ k [p(bi |X n , B1:i−1 )]
and
pZ k ⊙es (bi |B1:i−1 ) = EZ k ⊙es [p(bi |X n , B1:i−1 )] = EZ k [Ls,zk (X n )p(bi |X n , B1:i−1 )].
Then by (25) in Appendix A-B, we have
D{0,1}l (pZ k ⊙es (bi |B1:i−1 )||pZ k (bi |B1:i−1 ))
≤χ2{0,1}l (pZ k ⊙es (bi |B1:i−1 )||pZ k (bi |B1:i−1 ))

(39)
n n
2
X EZ k [(Ls,zk (X ) − 1) · p(bi |X , B1:i−1 )]
= .
EZ k [p(bi |X n , B1:i−1 )]
bi ∈{0,1}l
Note that Ls,zk (X n ), s = 1, ..., k are independent, and
Ezk [Ls,zk (X n ) − 1] = 0,
(40)
Ezk [(Ls,zk (X n ) − 1)(Ls′ ,zk (X n ) − 1)] = 0, ∀s 6= s′ .
By (40), {1, L1,zk (X n ) − 1, L2,zk (X n ) − 1, ..., Lk,zk (X n ) − 1} is an orthogonal system. Then by the Bessel
inequality and Ezk [(Ls,zk (X n ) − 1)2 ] ≤ α2 for any s = 1, ..., k,
k
X 2
EZ k [(Ls,zk (X n ) − 1) · p(bi |X n , B1:i−1 )]
s=1 (41)
≤α2 EZ k [p(bi |X n , B1:i−1 )2 ] ≤ α2 EZ k [p(bi |X n , B1:i−1 )],
23
where the last inequality is since p(bi |X n , B1:i−1 ) ∈ [0, 1].

Combining Equations (38), (39) and (41), we have
k
1X 2l mα2
I(Zs ; B m ) ≤ ,
k s=1 2k
2) Polynomial Bound: if Ls,zk (X n ) is bounded with large probability, then we can get a bound which is a
polynomial of l, summarized as the following lemma.
Lemma 11. If there exists some Boolean function E(xn ) such that,
1
1) Pzk [E(X n ) = 0] ≤ δ1 < 2 for any z k ∈ {−1, 1}k ;
2) |Ls,zk (xn ) − 1| ≤ δ2 for any z k ∈ {−1, 1}k , s = 1, ..., k and xn with E(xn ) = 1.
Then we have
k
1X 1 8ml(δ12 + δ22 )
I(Zs ; B m ) ≤ m((log 2)δ12 + δ1 ) + . (42)
k s=1 k
The nature of Lemma 6 is a strong data processing inequality with the strong data processing constant 8(δ12 + δ22 ).
To give a tight bound of the constant sufficient for deriving the bound for k1 ks=1 I(Zs ; B m ), we need to bound
P
δ1 and δ2 simultaneously. The task can be paraphrased as finding a tight bound of Ls,zk (X n ) − 1 that holds with
sufficiently large probability. Such a problem is solved by relating it to the balls and bins model, detailed in the
next subsection.
Remark 7. The idea behind Lemma 6 and its proof is similar to Theorem 2 in [10], that is to obtain a tight strong data
processing constant. The major improvements of Lemma 6 is that it admits an irregular event {E(X n ) = 0} with
small probability, while such an event is not included in Theorem 2 in [10]. This makes Lemma 6 more flexible to
use especially for various regression problems (e.g. the nonparametric Gaussian regression problem in Example 2),
since in these problems sub-Gaussian or boundedness condition of the terminal-wise likelihood ratio Ls,zk (X n ) is
not satisfied. To overcome the difficulty, we establish the bound of Ls,zk (X n ) on the regular event {E(X n ) = 1}
and then argue that the effect of the irregular event {E(X n ) = 1} with small probability is limited in Lemma 5.
Remark 8. In the work [13], similar bound (18) therein for information measures is simpler and makes further
analysis easier. However, the proof therein depends heavily on the symmetry of the Gaussian random variable,
which clearly does not hold for the general estimation problems in this work. Hence we have to handle the small
error probability more carefully. This also explains the reason that the work [13] obtains a tight bound for the
Gaussian sequence model, but it is much harder to eliminate the polynomial factor of log N for the problem here.
Proof: By the chain rule of the mutual information, we have

m
X
I(Zs ; B m ) = I(Zs ; Bi |B1:i−1 ).
i=1
24
For each term, by the assumption 1), we have
I(Zs ; Bi |B1:i−1 ) ≤ I(Zs ; Bi , E(Xin )|B1:i−1 )
=I(Zs ; E(Xin )|B1:i−1 ) + I(Zs ; Bi |B1:i−1 , E(Xin ))
≤H(E(Xin )) + I(Zs ; Bi |B1:i−1 , E(Xin ))
=H(E(Xin )) + P[E(Xin ) = 0] · I(Zs ; Bi |B1:i−1 , E(Xin ) = 0) + P[E(Xin ) = 1] · I(Zs ; Bi |B1:i−1 , E(Xin ) = 1)
≤h(δ1 ) + δ1 + I(Zs ; Bi |B1:i−1 , E(Xin ) = 1).
It suffices to bound the last term above. Before doing that, we first define the conditional terminal-wise likelihood
ratio for any xn with E(xn ) = 1 to be
pnzk ⊙es (xn |E(xn ) = 1)
Ls,zk (xn |E(xn ) = 1) = , (43)
pnzk (xn |E(xn ) = 1)
Then we can derive the following useful bound for Ls,zk (xn |E(xn ). By the assumptions 1) and 2),
Pzk [E(X n ) = 1] pnzk ⊙es (xn )

Ls,zk (xn |E(xn ) = 1) − 1 = · −1
Pzk ⊙es [E(X n ) = 1] pnzk (xn )
Pzk [E(X n ) = 1] Pzk [E(X n ) = 1]
≤ − 1 + · Ls,zk (xn ) − 1
Pzk ⊙es [E(X n ) = 1] Pzk ⊙es [E(X n ) = 1] (44)
1 1
≤ −1 + · δ2
1 − δ1 1 − δ1
≤2(δ1 + δ2 ).
Now consider the term I(Zs ; Bi |B1:i−1 , E(Xin ), note that
I(Zs ; Bi |B1:i−1 , E(Xin ) = 1)

1
≤ E D{0,1}l (pZ k ⊙es (bi |B1:i−1 , E(Xin ) = 1)||pZ k (bi |B1:i−1 , E(Xin ) = 1))

2
1
= E D{0,1}l (pZ k ⊙es (bi |B1:i−1 , E(X n ) = 1)||pZ k (bi |B1:i−1 , E(X n ) = 1)) .

2
In the above equation, further note that
pZ k (bi |B1:i−1 , E(X n ) = 1) = EpZ k (·|E(xn)=1) [p(bi |X n , B1:i−1 )]
and
pZ k ⊙es (bi |B1:i−1 , E(X n ) = 1) = EpZ k (·|E(xn)=1) [Ls,zk (X n |E(X n ) = 1)p(bi |X n , B1:i−1 )].
Then by (25) in Appendix Section A-B, we have
D{0,1}l (pZ k ⊙es (bi |B1:i−1 , E(X n ) = 1)||pZ k (bi |B1:i−1 , E(X n ) = 1))
≤χ2{0,1}l (pZ k ⊙es (bi |B1:i−1 , E(X n ) = 1)||pZ k (bi |B1:i−1 , E(X n ) = 1))
2
X EpZ k (·|E(xn )=1) [(Ls,zk (X n |E(X n ) = 1) − 1)p(bi |X n , B1:i−1 )]
=
l
EpZ k (·|E(xn)=1) [p(bi |X n , B1:i−1 )]
bi ∈{0,1}
25
Combining these results, we can obtain that

k
1X
I(Zs ; B m )
k s=1
 2 
1 Xm
 X Xk EpZ k (·|E(xn )=1) [(Ls,zk (X n |E(X n ) = 1) − 1)p(bi |X n , B1:i−1 )] 
≤m(h(δ1 ) + δ1 ) + E .
2k i=1 l s=1
EpZ k (·|E(xn)=1) [p(bi |X n , B1:i−1 )]
bi ∈{0,1}
(45)
The following transportation lemma is useful for bounding the last term. It can be proved by well-known arguments
(cf. Lemma 3 in [10]). The definition and related properties of sub-Gaussian random variables can be found in
Appendix A-A.
Lemma 12 (Transportation Lemma). Let P and Q be two measures on a probability space U and P ≪ Q. U is
a sub-Gaussian random vector with parameter σ under the measure P . Then we have
kEQ [U ]k22 ≤ 2σ 2 D(Q||P ). (46)
Now let X n be i.i.d. random variables and each Xi is generated following the distribution pZ k (·|E(xn ) = 1).
Then we have Ls,zk (X n |E(X n ) = 1) for s = 1, ..., k are independent. By the discussion in Appendix A-A, since
each Ls,zk (X n |E(X n ) = 1) − 1 is bounded by σ = 2(δ1 + δ2 ) in (44), it is sub-Gaussian with parameter σ.
Then we have (Ls,zk (X n |E(X n ) = 1) − 1)ks=1 is a sub-Gaussian random vector with parameter σ. Applying the
dQ p(bi |·,B1:i−1 )
transportation lemma to U = (Ls,zk (X n |E(X n ) = 1) − 1)ks=1 , P = pZ k (·|E(xn ) = 1) and dP = EP [p(bi |·,B1:i−1 )] ,
then we have
2
k
X EpZ k (·|E(xn)=1) [(Ls,zk (X n |E(X n ) = 1) − 1)p(bi |X n , B1:i−1 )]
s=1
EpZ k (·|E(xn )=1) [p(bi |X n , B1:i−1 )]
" !#
2 n p(bi |X n , B1:i−1 )
≤8(δ1 + δ2 ) EpZ k (·|E(xn)=1) p(bi |X , B1:i−1 ) log .
EpZ k (·|E(xn)=1) [p(bi |X n , B1:i−1 )]
This combined with (45) imply that
k
1X
I(Zs ; B m )
k s=1
m
4(δ1 + δ2 )2 X
≤m(h(δ1 ) + δ1 ) + I(Bi ; Xin |B1:i−1 , E(X n ) = 1)
k i=1
m
1 8(δ12 + δ22 ) X
≤m((log 2)δ12 + δ1 ) + H(Bi )
k i=1
1 8ml(δ12 + δ22 )
≤m((log 2)δ12 + δ1 ) + ,
k
26
E. Detailed Proof for Case 1

1
By Assumption 3, since β = O(k r ) ≫ 2 we have
Eps,zs [exp(2Ls,zs (X))] ≤ exp(2ν 2 ).
Then by (23) and 1 + x ≤ ex ≤ 1 + 2x for 0 ≤ x ≤ 1,

  
X
Ezk [(Ls,zk (X n ))2 ] =Ezk exp  2Ls,zs (Xj )
j:Tj ∈[ s−1 s
k ,k]
h Pn 1 s−1 s
i
=E Eps,zs [exp (2Ls,zs (X))] j=1 Tj ∈[ k , k ]
  
Xn
≤E exp 2ν 2 · 1Tj ∈[ s−1 , s ] 
k k
j=1
n r n−r
X n 1 1
= 1− exp(2ν 2 r)
r=0
r k k
n

1 2
= 1+ exp(2ν ) − 1
k
n
≤ exp exp(2ν 2 ) − 1
k
4nν 2

≤ exp .
k
Then by (40), we have
4nν 2 8nν 2

Ezk [(Ls,zk (X n ) − 1)2 ] = Ezk [(Ls,zk (X n ))2 ] − 1 ≤ exp −1≤ ,
k k
1
4nν 2 4C1 n 8nν 2 8C1 n
as long as k = k2r+1 < 1. Let k = (C4 mn2l ) 2r+2 and α2 = k = k2r+1 , where C4 is a big constant. Then
1 1
2r+1 4nν 2
by m > n , we have k > (C4 mn) 2r+2 > (C4 ) 2r+2 n and hence k < 1 is satisfied. By Lemma 5, we have
k
1X 2l+2 mn 1
I(Zs ; B m ) ≤ 2r+2 < .
k s=1 k 2
r
Then by Lemma 4, we obtain that R(m, n, l, r) (mn2l )− r+1 .
F. Detailed Proof for Case 2
Recall that
n2r+1
1∨ ≤ l ≤ n. (47)
m
1
Let k = (1000C52 C22 ml log2 m) 2r+1 , where C5 is a large constant to be determined, and then we have
1 1
n ∨ m 2r+1 ≤ k ≤ m r . (48)
Let

1, if Ls,zk (xn ) − 1 ≤ 2C5 β log m, ∀z k ∈ {−1, 1}k , s = 1, ..., k,

E(xn ) =
0, otherwise.

27
Then we have δ2 ≤ 2C5 β log m immediately. To use Lemma 6, we need to bound the quantity Pzk [E(X n ) = 0]
for any z k . Note that by (23) and 2C5 β log m ≍ k −r log m → 0, we have
∁
{E(X n ) = 0} = Ls,zk (X n ) − 1 ≤ 2C5 β log m, ∀z k ∈ {−1, 1}k , s = 1, ..., k

∁
⊆ | log Ls,zk (X n )| < C5 β log m, ∀z k ∈ {−1, 1}k , s = 1, ..., k

 ∁
 X 
= Ls,zs (Xj ) < C5 β log m, ∀s = 1, ..., k, zs ∈ {−1, 1}
 s−1 s

j:Tj ∈[ k ,k]
  ∁ (49)
k
\ \  X 
= Ls,zs (Xj ) < C5 β log m 
 
s=1 zs =±1 j:Tj ∈[ s−1 s
k ,k]
 
k
[ [  X 
= Ls,zs (Xj ) ≥ C5 β log m .
 
s=1 zs =±1 j:Tj ∈[ s−1 s
k ,k]
nP o
Define the event Es,zs = j:Tj ∈[ s−1 s Ls,zs (Xj ) ≥ C5 β log m . For any xn where xj = (tj , yj ), define the
k ,k]
empirical distribution for (⌊ktj ⌋)nj=1 to be

n
X
n
Vs (x ) = 1tj ∈[ s−1 , s ] , (50)
k k
j=1
for any s = 1, ..., k. Then define E ′ (xn ) based on (Vs (xn ))ks=1 by

1, if Vs (xn ) ≤ 200 log m, ∀s = 1, ..., k,

′ n
E (x ) =
0, otherwise.

Note that by the union bound and (49),

k
X X
Pzk [E(X n ) = 0] ≤ Pzk [E ′ (X n ) = 0] + Pzk [Es,zs |E ′ (X n ) = 1]. (51)
s=1 zs =±1
The first term can be bounded by letting c = C6 log m in (26) where C6 is a large constant to be determined, which
yields
′ C6 C6
n
Ezk [E (X ) = 0] ≤ k exp − log m = km− 2 . (52)
2
Now consider each term in the latter summation. Conditioned on ⌊kTj ⌋, j = 1, ..., n, (Xj )nj=1 are independent.
Furthermore, on the event {E ′ (X n ) = 1}, Vs (X n ) ≤ C6 log m for any s = 1, ..., k. So we have
 
h i Vs (X n )
X
Pzk Es,zs ⌊kTj ⌋, j = 1, ..., n, E ′ (X n ) = 1 = P  Ls,zs (Xj′ ) ≥ C5 β log m ,
j=1
PVs (X n )
where (Xj′ )nj=1 are i.i.d. random variables generated from the distribution ps,zs . By Assumption 3, j=1 Ls,zs (Xj′ )
h PVs (X n ) i
Ls,zs (Xj′ ) ≤ C6 Ck32rlog m . By (48), we have
p
is sub-exponential with parameters ( C6 ν 2 log m, β) and E j=1
C6 ν 2 log m C6 C1 k −2r log m

= ≤ C5 C2 k −r log m = C5 β log m
β C2 k −r
C6 C1
as long as C5 > C22
, and
C6 C3 log m
C5 β log m ≍ k −r log m ≫ k −2r log m ≍ .
k 2r
28
Then Hoeffding’s inequality (Lemma 7 in Appendix A-A) implies that

1
2 C5 β log m
h i
C5
′ n
Pzk Es,zs ⌊kTj ⌋, j = 1, ..., n, E (X ) = 1 ≤ 2 exp − = 2m− 4 .
2β
Thus we have
Pzk [Es,zs |E ′ (X n ) = 1]
h h ii
=Ezk Pzk Es,zs ⌊kTj ⌋, j = 1, ..., n, E ′ (X n ) = 1 (53)
C5
≤2m− 4 .
Combining the bound (51), (52) and (53), we have
C6 C5
Pzk [E(X n ) = 0] ≤km− 2 + 4km− 4 ≤ m−100
where the last inequality is by (48), as long as C5 and C6 are both big enough. Then we can choose δ1 = m−100 .
By Lemma 6, the inequality (47) and (48), we finally have
k
1X 1 8ml(δ12 + δ22 )
I(Zs ; B m ) ≤m((log 2)δ12 + δ1 ) +
k s=1 k
8m−200+1 l 32C52 mlβ 2 log2 m
≤4m−50+1 + +
k k
−199 2 2 2
8m n 32C C ml log m
≤4m−49 + + 5 2
n k 2r+1
1
≤ ,
2
1
where the last inequality is by the choice of k = (1000C52 C22 ml log2 m) 2r+1 . Then by Lemma 4, we complete the
proof.
G. Detailed Proof for Case 4

1
Recall that r > 2 and
1
n 2r+1 n2r+1 1
∨1≤l ≤ ∧ (mn) 2r+1 ≤ n. (54)
m m
1 2r+3
Let k = (1000C1 C7 mnl) 2r+2 log 2r+1 n, where C7 is a large constant to be determined, and then we have
k ≪ n log2 n ≤ n2 (55)
and
2r+1
k 2r+1 (mnl) 2r+2 log2r+3 n n log2r+3 n ≫ n log4 n. (56)
Let
 s
 C7 ν 2 n log4 n
1, if Ls,zk (xn ) − 1 ≤ 2 , ∀z k ∈ {−1, 1}k , s = 1, ..., k,


E(xn ) = k


0, otherwise.

29
q
C7 ν 2 n log4 n
Then we have δ2 ≤ 2 kq immediately. To use Lemma 6, we first bound the quantity Pzk [E(X n ) = 0]
q
C7 ν 2 n log4 n n log4 n
for any z k . Note that by (56), k ≍ k2r+1 → 0. Then by (23) and similar to (49) we have
 s 
k 2 4 
[ [  X C7 ν n log n
{E(X n ) = 0} ⊆ Ls,zs (Xj ) ≥ . (57)
s=1 zs =±1
 s−1 s
k 
j:Tj ∈[ k ,k]
q
C7 ν 2 n log4 n
. Define E ′ (xn ) based on (Vs (xn ))ks=1 (cf. (50)) by
P
Let Es,zs = j:Tj ∈[ s−1 s Ls,zs (Xj ) ≥
k ,k] k
 3
1, if Vs (xn ) < C8 n log n , ∀s = 1, ..., k,

E ′ (xn ) = k

0, otherwise.

Then by Equation (27), we have

C8 n log3 n

′ C8 log n C8
n
Ezk [E (X ) = 0] ≤ k exp − ≤ k exp − = kn− 8 . (58)
8k 8
C8 n log3 n
(Xj )nj=1 are independent given ⌊kTj ⌋, j = 1, ..., n. And on the event {E ′ (X n ) = 1}, Vs (X n ) ≤ k for any
s = 1, ..., k. So we have
 s 
Vs (X n )
h i X C7 ν 2 n log4 n
Pzk Es,zs ⌊kTj ⌋, j = 1, ..., n, E ′ (X n ) = 1 = P  Ls,zs (Xj′ ) ≥ ,
j=1
k
P s (X n )
where (Xj′ )nj=1 are i.i.d. copies of X n . By Assumption 3, Vj=1 Ls,zs (Xj′ ) is sub-exponential with parameters
q h n
i 3
2 3 Vs (X )
( C8 ν nklog n , β) and E n log n
Ls,zs (Xj′ ) ≤ C3 Ck82r+1
P
j=1 . By (55), we have
s s
C8 ν 2 n log3 n n log3 n n log4 n C7 ν 2 n log4 n
≍ r+1
≫ 2r+1
≍
kβ k k k
and by (56), we have
s s
C7 ν 2 n log4 n n log4 n n log3 n C3 C8 n log3 n
≍ ≍ .
k k 2r+1 k 2r+1 k 2r+1
Then Lemma 7 in Appendix A-A implies that
C7 ν 2 n log4 n
!
h
′ n
i
4k C7 log n C
− 7
Pzk Es,zs ⌊kTj ⌋, j = 1, ..., n, E (X ) = 1 ≤ 2 exp − 2C8 ν 2 n log3 n
= 2 exp − = 2n 8C8 .
8C8
k
Thus we have
C7
Pzk [Es,zs |E ′ (X n ) = 1] ≤ 2n− 8C8 . (59)
Combining the bounds (57), (58) and (59), we have

k
X X
Pzk [E(X n ) = 0] ≤Pzk [E ′ (X n ) = 0] + Pzk [Es,zs |E ′ (X n ) = 1]
s=1 zs =±1
C8 C
− − 8C7
≤kn 8 + 4kn 8 ≤ n−100(r+1) ,
C7
where the last inequality is by (55), as long as C8 and C8 are both big enough. Then we can choose δ1 = n−100(r+1) .
30
By Lemma 6, (54) and (56), we finally have

k
1X 1 8ml(δ12 + δ22 )
I(Zs ; B m ) ≤n2r+1 ((log 2)δ12 + δ1 ) +
k s=1 k
8n2r+1 n−200(r+1)) l 32C7 mnlν 2 log4 n
≤4n−50+1 + +
k k2
−199 4
8n n 32C1 C7 mnl log n
≤4n−49 + 1 +
n 2r+1 k 2r+2
1
≤ ,
2
1 2r+3
1
where the last inequality is by the choice of k = (1000C1 C7 mnl) 2r+2 log 2r+1 n and r > 2. By Lemma 4, we
complete the proof.
A PPENDIX D
P ROOF FOR THE S PECIFIC C ASES IN S ECTION II-C
A. Proof of Corollary 2
Let T = X and Y = 1, then X = (T, Y ).

1) Verification of Assumption 1: Let h(y) = y, then fˆHs (X) = φHs (X). By 1) in Lemma 1, we have |fˆHs (X)| ≤
√ √
C · 2 2 = C K for some C > 0, hence fˆHs (X) is sub-exponential with parameters ( C 2 K, 0). Moreover, we
H
have
Z 1
E[fˆHs (X)] = f (x)φHs (x)dx = fHs ,
0
hence fˆHs (X) is an unbiased estimator of fHs .

2) Verification of Assumption 2: For any s = 1, ..., k, zs = ±1 and x ∈ [ s−1 s
k , k ], we have
k
1
X 1 1

pzk (x) = fzk (x) = 1 + ǫk −(r+ 2 ) zs ψsk′ (x) = · k 1 + ǫk −(r+ 2 ) zs ψsk (x) .
k
s′ =1
Note that
Z s
k 1

k 1 + ǫk −(r+ 2 ) zs ψsk (x) dx = 1,
s−1
k
1 1
and by 1) in Lemma 1, 1 + ǫk −(r+ 2 ) zs ψsk (x) ≥ 1 − kǫk −(r+ 2 ) zs ψsk k∞ 1 − k −r > 0. Hence ps,zs (x) =
1

k 1 + ǫk −(r+ 2 ) zs ψsk (x) is a distribution function on [ s−1 s
k , k ].
3) Verification of Assumption 3: For any s = 1, ..., k, zs = ±1 and x ∈ [ s−1 s
k , k ],
1
! 1
!
1 − ǫk −(r+ 2 ) zs ψsk (x) 2ǫk −(r+ 2 ) zs ψsk (x)
Ls,zs (x) = log 1 = log 1 − 1 .
1 + ǫk −(r+ 2 ) zs ψsk (x) 1 + ǫk −(r+ 2 ) zs ψsk (x)
Then for X ∼ ps,zs , by 1) in Lemma 1, |Ls,zs (X)| ≤ C ′ k −r for some C ′ > 0, hence Ls,zs (X) is sub-exponential
√
with parameters ( C 2 k −2r , 0). Moreover, by the inequality −x − x2 ≤ log(1 − x) ≤ −x for |x| < 12 , then
" #  !2 
1
−(r+ 21 )
2ǫk −(r+ 2 ) zs ψsk (X) 2ǫk z ψ
s s
k
(X)
|E[Ls,zs (X)]| ≤ E 1 + E 1

1 + ǫk −(r+ 2 ) zs ψsk (X) 1 + ǫk −(r+ 2 ) zs ψsk (X)
0 + k −(2r+1) · 2h ≍ k −2r ,

31
B. Proof of Corollary 3
In all these cases, X = (T, Y ) with T ∈ [0, 1].
1. Nonparametric Gaussian Regression Problem in Example 2
1) Verification of Assumption 1: Let h(y) = y, then fˆHs (X) = φHs (T )Y . Note that
Z 1
ˆ
E[fHs (X)] = E[φHs (T )E[Y |T ]] = E[φHs (T )f (T )] = f (t)φHs (t)dt = fHs ,
0

H √
By 1) in Lemma 1, we have |φHs (T )| ≤ C · 2 2 = C K for some C > 0. Then by 4) in Lemma 1,
√ √
|φHs (T )f (T )| ≤ CL′ K almost surely, hence φHs (T )f (T ) is sub-Gaussian with parameter CL′ K. Since
h i 1 2
Y |T ∼ N (f (T ), 1), we have E eλφHs (T )(Y −f (T )) T = e 2 (λφHs (T )) , ∀λ ∈ R. Then for any λ ∈ R, we have
h i h h i i
ˆ
E eλ(fHs (X)−fHs ) =E E eλφHs (T )(Y −f (T )) T eλ(φHs (T )f (T )−fHs )
h 1 2
i
=E e 2 (λφHs (T )) eλ(φHs (T )f (T )−fHs )
1 2 2
h i
≤e 2 C Kλ E eλ(φHs (T )f (T )−fHs )
2
1
K(1+L′2 )λ2
≤e 2 C
,
implying that fˆHs (X) is sub-exponential with parameters ( C 2 K(1 + L′2 ), 0).
p
2) Verification of Assumption 2: For any s = 1, ..., k, zs = ±1 and x = (t, y) with t ∈ [ s−1 s

k , k ], we have
−(r+ ) 1
1 (y−fz (t))2 1 k (y−ǫk 2 zs ψ k (t))2
k s
pzk (x) = √ e− 2 == · √ e− 2 .
2π k 2π
Note that
s −(r+ ) 1
k 2 zs ψ k (t))2
Z k
Z
(y−ǫk s
√ e− 2 dydt = 1,
s−1
k R 2π
−(r+ 1 )
(y−ǫk 2 zs ψ k (t))2
− s
√k e > 0 is a distribution function on [ s−1 s
Hence ps,zs (x) = k , k ] × R.
2
2π
k , k ],
1
2 1
2 1
Ls,zs (x) = y + ǫk −(r+ 2 ) zs ψsk (t) − y − ǫk −(r+ 2 ) zs ψsk (t) = 4ǫk −(r+ 2 ) zs yψsk (t).
1
Then for X = (T, Y ) ∼ ps,zs , Ls,zs (X) = 4ǫk −(r+ 2 ) zs Y ψsk (T ). Similar to the verification of Assumption 1,
p
Y ψsk (T ) is sub-exponential with parameters ( C 2 k(1 + L′2 ), 0) for some C > 0. Thus Ls,zs (X) is sub-exponential
p
with parameters ( C 2 k −2r (1 + L′2 ), 0).
Moreover, by 3) in Lemma 1 we have
1
|E[Ls,zs (X)]| =4ǫk −(r+ 2 ) E E[Y |T ]ψsk (T )

1
=4ǫk −(r+ 2 ) E f (T )ψsk (T )

1
=4ǫk −(r+ 2 ) fh−h0 ,s
1
≤4ǫk −(r+ 2 ) kf − f h−h0 −1 k2
1 √
k −(r+ 2 ) k −2r k −2r ,
32
2. Nonparametric Binary Regression Problem in Example 3
Z 1
E[fˆHs (X)] = E[φHs (T )E[Y |T ]] = E[φHs (T )f (T )] = f (t)φHs (t)dt = fHs ,
0
hence fˆHs (X) is an unbiased estimator of fHs . Moreover, by 1) in Lemma 1 and Y ∈ {0, 1}, we have |fˆHs (X)| ≤
√ √
C · 2 2 = C K for some C > 0, hence fˆHs (X) is sub-exponential with parameters ( C 2 K, 0).
H

k , k ], we have

1 1 1 1 1
pzk (x) = · k ( + ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( − ǫk −(r+ 2 ) zs ψsk (t))1y=0 .
k 2 2
Note that
s
1 1
Z k 1 1
X
k ( + ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( − ǫk −(r+ 2 ) zs ψsk (t))1y=0 dt = 1,
s−1 2 2
k y∈{0,1}
h 1 1
i
Hence ps,zs (x) = k ( 12 + ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( 12 − ǫk −(r+ 2 ) zs ψsk (t))1y=0 > 0 is a distribution function
on [ s−1 s
k , k ] × {0, 1}.

k , k ],
1 1
!
( 12 − ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( 21 + ǫk −(r+ 2 ) zs ψsk (t))1y=0
Ls,zs (x) = log 1 1
( 12 + ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( 21 − ǫk −(r+ 2 ) zs ψsk (t))1y=0
1
!
2ǫk −(r+ 2 ) zs ψsk (t)(1 − 21y=1 )
= log 1 + 1 1 1 .
( 2 + ǫk −(r+ 2 ) zs ψsk (t))1y=1 + ( 12 − ǫk −(r+ 2 ) zs ψsk (t))1y=0
Then for X = (T, Y ) ∼ ps,zs , by 1) in Lemma 1, |Ls,zs (X)| ≤ C ′ k −r for some C ′ > 0, hence Ls,zs (X) is
√
sub-exponential with parameters ( C 2 k −2r , 0). Moreover, by the inequality −x − x2 ≤ log(1 − x) ≤ −x for
|x| < 12 , then
" 1
#
2ǫk −(r+ 2 ) zs ψsk (T )(1 − 21Y =1 )
|E[Ls,zs (X)]| ≤ E 1 1
( 21 + ǫk −(r+ 2 ) zs ψsk (T ))1Y =1 + ( 21 − ǫk −(r+ 2 ) zs ψsk (T ))1Y =0
 !2 
1
2ǫk −(r+ 2 ) zs ψsk (T )(1 − 21Y =1 )
+E  1 1

( 12 + ǫk −(r+ 2 ) zs ψsk (T ))1Y =1 + ( 21 − ǫk −(r+ 2 ) zs ψsk (T ))1Y =0
0 + k −(2r+1) · 2h ≍ k −2r ,
3. Nonparametric Poisson Regression Problem in Example 4
Z 1
E[fˆHs (X)] = E[φHs (T )E[Y |T ]] = E[φHs (T )f (T )] = f (t)φHs (t)dt = fHs ,
0

33
H √
√ √
h i λφ (T )
Y |T ∼ Poisson(f (T )), we have E eλφHs (T )(Y −f (T )) T = ef (T )(e Hs −λφHs (T )−1) , ∀λ ∈ R. For any λ with
1
|λ| < C
√
K
, we have |λφHs (T )| ≤ 1 and then eλφHs (T ) − λφHs (T ) − 1 ≤ (λφHs (T ))2 . Then
h i h h i i
ˆ
h 2
i
≤E e(λφHs (T )) eλ(φHs (T )f (T )−fHs )
2 2
h i
≤eC Kλ E eλ(φHs (T )f (T )−fHs )
2
1
K(2+L′2 )λ2
≤e 2 C ,
√
implying that fˆHs (X) is sub-exponential with parameters ( C 2 K(2 + L′2 ), C K).
p

k , k ], we have
1
−(r+ 2 )
1 2 zs ψ k (t)) (C0 + ǫk
−(r+ 1 ) zs ψsk (t))y
pzk (x) = · ke−(C0 +ǫk s .
k y!
Note that
s 1
(C0 + ǫk −(r+ 2 ) zs ψsk (t))y
Z k −(r+ 1 )
2 zs ψ k (t))
X
ke−(C0 +ǫk s dt = 1,
s−1 y!
k y∈N
1
−(r+ 1 ) (C0 +ǫk−(r+ 2 ) zs ψsk (t))y
2 zs ψ k (t))
Hence ps,zs (x) = ke−(C0 +ǫk s
y! > 0 is a distribution function on [ s−1 s
k , k ] × N.

k , k ],
1
!
−(r+ 12 ) k C0 − ǫk −(r+ 2 ) zs ψsk (t)
Ls,zs (x) = 2ǫk zs ψs (t) + y log 1 .
C0 + ǫk −(r+ 2 ) zs ψsk (t)
Now let X = (T, Y ) ∼ ps,zs , then
1
!
−(r+ 21 ) 2ǫk −(r+ 2 ) zs ψsk (T )
Ls,zs (X) = 2ǫk zs ψsk (T ) + Y log 1 − 1 .
C0 + ǫk −(r+ 2 ) zs ψsk (T )
By 1) in Lemma 1, we have
1
! 1
2ǫk −(r+ 2 ) zs ψsk (T ) 4ǫk −(r+ 2 ) zs ψsk (T )
log 1 − 1 1 ≤ ≤ C1 k −r
C0 + ǫk −(r+ 2 ) zs ψsk (T )
C0 + ǫk −(r+ 2 ) zs ψsk (T )
1

2ǫk−(r+ 2 ) zs ψsk (T )
for some C1 > 0. Then similar to the verification of Assumption 1, Y log 1 − 1 is sub-
C0 +ǫk−(r+ 2 ) zs ψsk (T )
p 1
exponential with parameters ( C12 k −2r (2 + L′2 ), C1 k −r ). By 1) in Lemma 1, |2ǫk −(r+ 2 ) zs ψsk (T )| ≤ C2 k −r for
1 p
some C2 > 0, hence 2ǫk −(r+ 2 ) zs ψsk (T ) is sub-exponential with parameters ( C22 k −2r , 0). Therefore Ls,zs (X) is
p
sub-exponential with parameters ( [C12 (2 + L′2 ) + C22 ]k −2r , C1 k −r ) (by the discussion in Appendix A-A).
Moreover, note that
" 1
!#
−(r+ 12 )
k C0 − ǫk −(r+ 2 ) zs ψsk (T )
|E[Ls,zs (X)]| = 2ǫk zs E ψs (T ) + E E[Y |T ] log 1
C0 + ǫk −(r+ 2 ) zs ψsk (T )
" 1
!#
1 − C0−1 ǫk −(r+ 2 ) zs ψsk (T )
= E f (T ) log 1 .
1 + C0−1 ǫk −(r+ 2 ) zs ψsk (T )
34
Since f (T ) > 0, by the inequality −2x − 4x2 ≤ log 1−x 2 1

1+x ≤ −2x + 4x for |x| < 2 ,
h i 2
−1 −(r+ 21 ) k −1 −(r+ 12 ) k
|E[Ls,zs (X)]| ≤2 E f (T )C0 ǫk zs ψs (T ) + 4E f (T ) C0 ǫk zs ψs (T )
1
1
2
≤2C0−1 ǫk −(r+ 2 ) fh−h0 ,s + 4L′ C0−1 ǫk −(r+ 2 ) · k
1
k −(r+ 2 ) kf − f h−h0 −1 k2 + k −2r k −2r ,
where the second inequality is by 1) in Lemma 1 and the last inequality is by 3) in Lemma 1. This completes the
proof.
4. Nonparametric Heteroskedastic Regression Problem in Example 5
1) Verification of Assumption 1: Let h(y) = y 2 , then fˆHs (X) = φHs (T )Y 2 . Note that
Z 1
E[fˆHs (X)] = E[φHs (T )E[Y 2 |T ]] = E[φHs (T )f (T )] = f (t)φHs (t)dt = fHs ,
0

H √
√ √
h 2
i −λφHs (T )f (T ) 2
Y 2 |T ∼ f (T )χ21 , we have E eλφHs (T )(Y −f (T )) T = √ e ≤ e2(φHs (T )f (T )λ) , ∀|λ| < 4CL1′ √K ,
1−2λφHs (T )f (T )
2 √
where the last inequality is by e−x−2x ≤ 1 − 2x for |x| < 41 . For any λ with |λ| < 4CL1′ √K ,
h i h h i i
ˆ 2
h 2
i
≤E e2(λφHs (T )f (T )) eλ(φHs (T )f (T )−fHs )
2 ′2 2
h i
≤e2C KL λ E eλ(φHs (T )f (T )−fHs )
2
1
KL′2 λ2
≤e 2 ·5C ,
√ √
implying that fˆHs (X) is sub-exponential with parameters ( 5C 2 KL′2 , 4CL′ K).
k , k ], we have
y2
−
−(r+ 1 )
!
1 k 2 C0 +ǫk 2 zs ψ k (t)
s
pzk (x) = · r e .
k −(r+ 1
) k
2π C0 + ǫk 2 zs ψs (t)
Note that
y2
s −
−(r+ 1 )
!
k
Z k
Z
2 C0 +ǫk 2 zs ψ k (t)
s
s−1
r e dydt = 1,
R 1
k
2π C0 + ǫk −(r+ 2 ) zs ψsk (t)
y2
−
−(r+ 1 )
!
2 C0 +ǫk 2 zs ψ k (t)
k
> 0 is a distribution function on [ s−1 s
s
Hence ps,zs (x) = e k , k ] × R.
r
1
2π C0 +ǫk−(r+ 2 ) zs ψsk (t)
35

k , k ],
1
!
C0 − ǫk −(r+ 2 ) zs ψsk (t) y2

1 1 1
Ls,zs (x) = log 1 + − .
2 C0 + ǫk −(r+ 2 ) zs ψsk (t) 2 C0 − ǫk −(r+ 21 ) zs ψsk (t) C0 + ǫk −(r+ 21 ) zs ψsk (t)
Now let X = (T, Y ) ∼ ps,zs , then
1
! 1
1 2ǫk −(r+ 2 ) zs ψsk (T ) 2 ǫk −(r+ 2 ) zs ψsk (T )
Ls,zs (X) = log 1 − 1 +Y · 1 .
2 C0 + ǫk −(r+ 2 ) zs ψsk (T ) C02 − (ǫk −(r+ 2 ) zs ψsk (T ))2
By 1) in Lemma 1, we have
1
! 1
1 2ǫk −(r+ 2 ) zs ψsk (T ) 2ǫk −(r+ 2 ) zs ψsk (T )
log 1 − 1 ≤ 1 ≤ C1 k −r
2 C0 + ǫk −(r+ 2 ) zs ψsk (T ) C0 + ǫk −(r+ 2 ) zs ψsk (T )
1

2ǫk−(r+ 2 ) zs ψsk (T )
p
for some C1 > 0. Hence 12 log 1 − −(r+ )1
k
is sub-exponential with parameters ( C12 k −2r , 0). By
C0 +ǫk 2 zs ψs (T )
1) in Lemma 1, we have
1
ǫk −(r+ 2 ) zs ψsk (T )
1 ≤ C2 k −r .
C02 − (ǫk −(r+ 2 ) zs ψsk (T ))2
1
ǫk−(r+ 2 ) z ψ k (T )
Then similar to the verification of Assumption 1, Y 2 · 2 s s
−(r+ 1 )
is sub-exponential with parameters
C −(ǫk 2 zs ψ k (T ))2
p 0 s p
( 5C22 k −2r L′2 , 4C2 L′ k −r ). Therefore Ls,zs (X) is sub-exponential with parameters ( (C12 + 5C22 L′2 )k −2r , 4C2 L′ k −r )
(by the discussion in Appendix A-A).
Moreover, note that
" 1
!# " 1
#
1 C0 − ǫk −(r+ 2 ) zs ψsk (t) 2 ǫk −(r+ 2 ) zs ψsk (T )
|E[Ls,zs (X)]| = E log 1 + E E[Y |T ] · 1
2 C0 + ǫk −(r+ 2 ) zs ψsk (t) C02 − (ǫk −(r+ 2 ) zs ψsk (T ))2
" 1
!# " 1
#
1 1 − C0−1 ǫk −(r+ 2 ) zs ψsk (t) ǫk −(r+ 2 ) zs ψsk (T )
≤ E log 1 + E f (T ) · 1 .
2 1 + C0−1 ǫk −(r+ 2 ) zs ψsk (t) C02 − (ǫk −(r+ 2 ) zs ψsk (T ))2
1−x
For the first term, by the inequality −2x − 4x2 ≤ log 1+x ≤ −2x + 4x2 for |x| < 21 ,
" 1
!#
1 − C0−1 ǫk −(r+ 2 ) zs ψsk (t)
E log 1
1 + C0−1 ǫk −(r+ 2 ) zs ψsk (t)
h i 2
−1 −(r+ 12 ) k −1 −(r+ 21 ) k
≤2 E C0 ǫk zs ψs (T ) + 4E C0 ǫk zs ψs (T )
0 + k −2r = k −2r .
For the second term, since f (T ) > 0, then by the inequality x − x2 ≤ 1−xx
2 ≤ x + x
2
for |x| < 21 ,
" 1
#
ǫk −(r+ 2 ) zs ψsk (T )
E f (T ) · 2 1
C0 − (ǫk −(r+ 2 ) zs ψsk (T ))2
h i 2
1 1
≤C0−1 E f (T )C0−1 ǫk −(r+ 2 ) zs ψsk (T ) + C0−1 E f (T ) C0−1 ǫk −(r+ 2 ) zs ψsk (T )
1
k −(r+ 2 ) kf − f h−h0 −1 k2 + k −2r k −2r ,
where the second inequality is by 1) in Lemma 1 and the last inequality is by 3) in Lemma 1. Combining these
two parts yields |E[Ls,zs (X)]| k −2r , which completes the proof.
36
R EFERENCES
[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from
decentralized data,” in International Conference on Artificial Intelligence and Statistics, vol. 54, Fort Lauderdale, FL, USA, Apr. 2017,
pp. 1273–1282. 1
[2] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing
Magazine, vol. 37, no. 3, pp. 50–60, May 2020. 1
[3] P. Kairouz, et al., “Advances and open problems in federated learning,” Foundations and Trends in Machine Learning, vol. 14, no. 1–2,
pp. 1–210, Jun. 2021. 1
[4] A. Zaman and B. Szabó, “Distributed nonparametric estimation under communication constraints,” 2022. [Online]. Available:
https://arxiv.org/abs/2204.10373 1, 2, 5
[5] B. Szabó and H. van Zanten, “Adaptive distributed methods under communication constraints,” The Annals of Statistics, vol. 48, no. 4,
pp. 2347–2380, Aug. 2020. 2, 5
[6] L. P. Barnes, Y. Han, and A. Ozgur, “Lower bounds for learning distributions under communication constraints via fisher information,”
Journal of Machine Learning Research, vol. 21, no. 236, pp. 1–30, Feb. 2020. 2
[7] J. Acharya, C. L. Canonne, A. V. Singh, and H. Tyagi, “Optimal rates for nonparametric density estimation under communication
constraints,” IEEE Transactions on Information Theory, vol. 70, no. 3, pp. 1939–1961, Mar. 2024. 2, 5, 10
[8] D. Yuan, T. Guo, and Z. Huang, “Adaptive refinement protocols for distributed distribution estimation under ℓp -losses,” 2024. [Online].
Available: https://arxiv.org/abs/2410.06884 2, 6, 8, 9, 10
[9] J. Acharya, C. L. Canonne, Y. Liu, Z. Sun, and H. Tyagi, “Interactive inference under information constraints,” IEEE Transactions on
Information Theory, vol. 68, no. 1, pp. 502–516, Jan. 2022. 2, 20
[10] J. Acharya, C. L. Canonne, Z. Sun, and H. Tyagi, “Unified lower bounds for interactive high-dimensional estimation under information
constraints,” in International Conference on Neural Information Processing Systems, vol. 36, New Orleans, LA, US, Dec. 2023, pp.
51 133–51 165. 2, 23, 25
[11] Y. Zhu and J. Lafferty, “Distributed nonparametric regression under communication constraints,” in International Conference on Machine
Learning, vol. 80, Stockholm, Sweden, Jul. 2018, pp. 6009–6017. 2, 3, 5
[12] B. Szabó and H. van Zanten, “Distributed function estimation: Adaptation using minimal communication,” Mathematical Statistics and
Learning, vol. 5, no. 3, pp. 159–199, Dec. 2022. 2, 3, 5
[13] T. T. Cai and H. Wei, “Distributed nonparametric function estimation: Optimal rate of convergence and cost of adaptation,” The Annals of
Statistics, vol. 50, no. 2, pp. 698–725, Apr. 2022. 2, 3, 5, 23
[14] C. Butucea, A. Dubois, M. Kroll, and A. Saumard, “Local differential privacy: Elbow effect in optimal density estimation and adaptation
over Besov ellipsoids,” Bernoulli, vol. 26, no. 3, pp. 1727–1764, Aug. 2020. 2
[15] M. Kroll, “On density estimation at a fixed point under local differential privacy,” Electronic Journal of Statistics, vol. 15, no. 1, pp.
1783–1813, Jan. 2021. 2
[16] M. Sart, “Density estimation under local differential privacy and Hellinger loss,” Bernoulli, vol. 29, no. 3, pp. 2318–2341, Aug. 2023. 2
[17] C. Lalanne, A. Garivier, and R. Gribonval, “About the cost of central privacy in density estimation,” Transactions on Machine Learning
Research, Aug. 2023. 2
[18] T. T. Cai, A. Chakraborty, and L. Vuursteen, “Optimal federated learning for nonparametric regression with heterogeneous distributed
differential privacy constraints,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06755 2
[19] J. Liu, “Communication complexity of two-party nonparametric global density estimation,” in Annual Conference on Information Sciences
and Systems, Princeton, NJ, USA, Mar. 2022, pp. 292–297. 2
[20] ——, “A few interactions improve distributed nonparametric estimation, optimally,” IEEE Transactions on Information Theory, vol. 69,
no. 12, pp. 7867–7886, Dec. 2023. 2
[21] E. Giné and R. Nickl, Mathematical Foundations of Infinite-Dimensional Statistical Models. New York: Cambridge University Press,
2015. 4, 7
[22] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. 4, 14
[23] J. Acharya, C. Canonne, Y. Liu, Z. Sun, and H. Tyagi, “Distributed estimation with multiple samples per user: Sharp rates and phase
transition,” in International Conference on Neural Information Processing Systems, vol. 34, Dec. 2021, pp. 18 920–18 931. 6
[24] I. Daubechies, Ten Lectures on Wavelets. Philadelphia: Society for Industrial and Applied Mathematics, 1992. 7
37
[25] J. Simon, “Sobolev, Besov and Nikolskii fractional spaces: Imbeddings and comparisons for vector valued spaces on an interval,” Annali
di Matematica Pura ed Applicata, vol. 157, pp. 117–148, Dec. 1990. 7
[26] J. Acharya, C. L. Canonne, and H. Tyagi, “Inference under information constraints II: Communication constraints and shared randomness,”
IEEE Transactions on Information Theory, vol. 66, no. 12, pp. 7856–7877, Dec. 2020. 10

2501.07879v1

Uploaded by

Copyright:

Available Formats

2501.07879v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2501.07879v1

Uploaded by

Copyright:

Available Formats

1

Distributed Nonparametric Estimation: from

B. Comparisons with Related Works

Fig. 1. Distributed interactive nonparametric estimation

Enci : X n × {0, 1}(i−1)l → {0, 1}l, ∀i = 1, ..., m,

and a random decoding function

Define the minimax convergence rate as

R(m, n, l, r) = inf sup E[kfˆP − f k22 ]. (1)

II. M AIN R ESULTS

A. Phase Transitions from Spase to Dense Samples per Terminal

B. Comparisons with Related Previous Works

pf (x) = f (x), x ∈ [0, 1]. (4)

pf (t, y) = f (t)1y=1 + (1 − f (t))1y=0 ,

III. P RELIMINARY R ESULTS FOR THE P ROOF

A. Preliminary Results on Sobolev Spaces and Wavelets

3) The L2 convergence rate satisfies

B. Protocols for Estimating a Parametric Distribution

IV. U PPER B OUNDS

A. The Layered Estimation Protocol

For any s, s′ = 1, ..., H, the decoder computes

E[kf¯H − f k22 ]  K −2r + K log2 N E kp̂W − pW k22 .

E[kf¯H − f k22 ]  (lm)−2r log4r N.

Combining all cases completes the proof of Theorem 4.

finite sieve F (k, C0 , ǫ) defined as

the meaning will be clear in the context.

the distribution pzk admits a decomposition

for any x = (t, y) with t ∈ [ s−1 s s−1 s

Assumption 3. Let the sample-wise log-likelihood ratio be

A. Information-Theoretic Lower Bounding Methods

B. Proof for the Remaining Cases

First we define the terminal-wise likelihood ratio to be

Lemma 6. If there exists a Boolean function E(xn ) such that

| log Ls,zk (X n )|  β log m

A. Sub-Gaussian and Sub-Exponential Random Variables

B. Preliminaries on divergences between distributions

DU (p1 (u)||p2 (u)) ≤ χ2U (p1 (u)||p2 (u)). (25)

It remains to show the item 1’).

D. Analysis of the Balls and Bins Model

Lemma 8. 1) Let k ≥ n and c ∈ N be sufficiently large, then

Proof: By the Chernoff’s bound, we have

The overall error can be bounded as follows.

E[kf¯H − f k22 ] ≤2 E[kf¯H − f H k22 ] + E[kf H − f k22 ]

where the second inequality is because (φHs )K

Then by Lemma 7 we have

since we have K  N by the choice of K in (11).

B. Detailed Error Analysis for Theorem 4

Then by (16) in Lemma 3 we have

Then by (16) in Lemma 3 we have

By (16) in Lemma 3 and n ≥ (ml)2r+1 , then we have

By (16) in Lemma 3 and n ≥ (ml)2r+1 , then we have

By (16) in Lemma 3, then we have

Let P be an (m, n, l)-protocol defined in Section I-C, then we have

Ẑ k = arg min kfzk − fˆP k2 .

kfẐ k − fZ k k2 ≤kfˆP − fẐ k k2 + kfˆP − fZ k k2

Hence we have 4E[kfˆP − fZ k k22 ] ≥ E[kfẐ k − fZ k k22 ].

Since Z k − X mn − B m − Ẑ k is a Markov chain, similar to Lemma 10 in [9], we have

Then we have R(m, n, l, r)  k −2r , completing the proof.

B. Proof of the Centralized Bound for Case 5

Lemma 9 is derived from Assumption 2 and 3 as follows.

By the Markov chain Z − X mn − B m and the data processing inequality, we have

E[kf¯H − f k22 ] K −2r + K log2 N E kp̂W − pW k22 .

E[kf¯H − f k22 ] (lm)−2r log4r N.

| log Ls,zk (X n )| β log m

since we have K N by the choice of K in (11).

Then we have R(m, n, l, r) k −2r , completing the proof.