Gradient Learning under Tilted Empirical Risk Minimization

Liu, Liyuan; Song, Biqin; Pan, Zhibin; Yang, Chuanwu; Xiao, Chi; Li, Weifu

doi:10.3390/e24070956

Open AccessArticle

Gradient Learning under Tilted Empirical Risk Minimization

by

Liyuan Liu

^1,†

,

Biqin Song

^1,†,

Zhibin Pan

^1,2,

Chuanwu Yang

³,

Chi Xiao

^4,* and

Weifu Li

^1,2,*

¹

College of Science, Huazhong Agricultural University, Wuhan 430062, China

²

Hubei Key Laboratory of Applied Mathematics, Hubei University, Wuhan 430062, China

³

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

⁴

Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2022, 24(7), 956; https://doi.org/10.3390/e24070956

Submission received: 16 June 2022 / Revised: 6 July 2022 / Accepted: 7 July 2022 / Published: 9 July 2022

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Gradient Learning (GL), aiming to estimate the gradient of target function, has attracted much attention in variable selection problems due to its mild structure requirements and wide applicability. Despite rapid progress, the majority of the existing GL works are based on the empirical risk minimization (ERM) principle, which may face the degraded performance under complex data environment, e.g., non-Gaussian noise. To alleviate this sensitiveness, we propose a new GL model with the help of the tilted ERM criterion, and establish its theoretical support from the function approximation viewpoint. Specifically, the operator approximation technique plays the crucial role in our analysis. To solve the proposed learning objective, a gradient descent method is proposed, and the convergence analysis is provided. Finally, simulated experimental results validate the effectiveness of our approach when the input variables are correlated.

Keywords:

gradient learning; operator approximation; reproducing kernel Hilbert spaces; tilted empirical risk minimization

1. Introduction

Data-driven variable selection aims to select informative features related with the response in high-dimensional statistics and plays a critical role in many areas. For example, if the milk production of dairy cows can be predicted by the blood biochemical indexes, then the doctors are eager to know which indexes can drive the milk production because each of them is independently measured with additional burden. Therefore, an explainable and interpretable system to select the effective variables is critical to convince domain experts. Currently, the methodologies on variable selection methods can be roughly divided into three categories including linear models [1,2,3], nonlinear additive models [4,5,6], and partial linear models [7,8,9]. Although achieving promising performance in some applications, these methods mentioned above still suffer from two main limitations. Firstly, the target function of these methods is restricted on the assumption of specific structures. Secondly, these methods cannot revive how the coordinates vary with respect to each other. As an alternative, Mukherjee and Zhou [10] proposed the gradient learning (GL) model, which aims to learn the gradient functions and enjoys the model-free property.

Despite the empirical success [11,12,13], there are still some limitations of the GL model, such as high computational cost, lacking the sparsity in high-dimensional data and lacking the robustness to complex noises. To this end, several variants of the GL model have been devoted to developing alternatives for individual purposes. For example, Dong and Zhou [14] proposed a stochastic gradient descent algorithm for learning the gradient and demonstrated that the gradient estimated by the algorithm converges to the true gradient. Mukherjee et al. [15] provided an algorithm to reduce dimension on manifolds for high-dimensional data with few observations. They obtained generalization error bounds of the gradient estimates and revealed that the convergence rate depends on the intrinsic dimension of the manifold. Borkar et al. [16] combined ideas from Spall’s Simultaneous Perturbation Stochastic Approximation with compressive sensing and proposed to learn the gradient with few function evaluations. Ye et al. [17] originally proposed a sparse GL model to further address the sparsity for high-dimensional variable selection of the estimated sparse gradients. He et al. [18] developed a three-step sparse GL method which allows for efficient computation, admits general predictor effects, and attains desirable asymptotic sparsistency. Following the research direction of robustness, Guinney et al. [19] provided a multi-task model which are efficient and robust for high-dimensional data. In addition, Feng et al. [20] provided a robust gradient learning (RGL) framework by introducing a robust regression loss function. Meanwhile, a simple computational algorithm based on gradient descent was provided, and the convergence of the proposed method is also analyzed.

Despite rapid progress, the GL model and its extensions mentioned above are established under the framework of empirical risk minimization (ERM). While enjoying the nice statistical properties, ERM usually performs poorly in situations where average performance is not an appropriate surrogate for the problem of interest [21]. Recently, a novel framework, named tilted empirical risk minimization (TERM), is proposed to flexibly address the deficiencies in ERM [21]. By using a new loss named t-tilted loss, it has been shown that TERM (1) can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; (2) has variance reduction properties that can benefit generalization; and (3) can be viewed as a smooth approximation to a superquantile method. Considering these strength, we propose to investigate the GL under the framework of TERM. The main contributions of this paper can be summarized as follows:

New learning objective. We propose to learn the gradient function under the framework of TERM. Specifically, the t-tilted loss is embedded into the GL model. To the best of our knowledge, it may be the first endeavor in this topic.
Theoretical guarantees. For the new learning objective, we estimate the generalization bound by error decomposition and operator approximation technique, and further provide the theoretical consistency and the convergence rate. To be specific, the convergence rate can recover the result of traditional GL as t tends 0 [10].
Efficient computation. A gradient descent method is provided to solve the proposed learning objective. By showing the smoothness and strongly convex of the learning objective, the convergence to the optimal solution is proved.

The rest of this paper is organized as follows: Section 2 proposes the GL with t-tilted loss (TGL) and states the main theoretical results on the asymptotic estimation. Section 3 provides the computational algorithm and its convergence analysis. Numerical experiments on synthetic data sets will be implemented in Section 4. Finally, Section 5 closes this paper with some conclusions.

2. Learning Objective

In this section, we introduce TGL and provide the main theoretical results on the asymptotic estimation.

2.1. Gradient Learning with t-Tilted Loss

Let X be a compact subset of

R^{n}

and

Y \in R

. Assume that

ρ

is a probability measure on

Z : = X \times Y

. It induces the marginal distribution

ρ_{X}

on X and conditional distributions

ρ (\cdot | x)

at

x \in X

. Denote

L_{ρ_{X}}^{2}

as the

L^{2}

space with the metric

{∥ f ∥}_{ρ} = (\int_{X} {| f (x) |}^{2} d ρ_{X})^{1 / 2}

. In addition, the regression function

f_{ρ} : X \to Y

associated with

ρ

is defined as

f_{ρ} (x) = \int_{Y} y d ρ (y | x), x \in X .

For

x = {(x^{1}, x^{2}, \dots, x^{n})}^{T} \in X

, the gradient of

f_{ρ}

is the vector of functions (if the partial derivatives exist)

\nabla f_{ρ} = {(\frac{\partial f_{ρ}}{\partial x^{1}}, \frac{\partial f_{ρ}}{\partial x^{2}}, \dots, \frac{\partial f_{ρ}}{\partial x^{n}})}^{T} .

The relevance between the l-th coordinate and

f_{ρ}

can be evaluated via the norm of its partial derivative

∥ \frac{\partial f_{ρ}}{\partial x^{l}} ∥

, where a large value implies a large change in the function

f_{ρ}

with respect to a sensitive change in the l-th coordinate. This fact gives an intuitive motivation for the GL. In terms of Taylor series expansion, the following equation holds:

f_{ρ} (x) \approx f_{ρ} (\tilde{x}) + \nabla f_{ρ} (\tilde{x}) \cdot (x - \tilde{x}),

(1)

for

x \approx \tilde{x}

and

x, \tilde{x} \in X

. Inspired by (1), we denote the weighted square loss of

\vec{f}

as

V (\vec{f}, z, \tilde{z} = ω (x, \tilde{x}) {(\tilde{y} - y + \vec{f} {(\tilde{x})}^{T} (x - \tilde{x}))}^{2}, \vec{f} \in {(L_{ρ_{X}}^{2})}^{n}, z, \tilde{z} \in Z,

(2)

where the restriction

x \approx \tilde{x}

will be enforced by weights

ω (x, \tilde{x})

given by

\frac{1}{s^{n + 2}} e^{- | x - \tilde{x} |^{2} / 2 s^{2}}

with a constant

0 < s \leq 1

, see, e.g., [10,11,19]. Then, the expected risk of

\vec{f}

can be given by

E (\vec{f}) = \int_{Z} \int_{Z} V (\vec{f}, z, \tilde{z}) d ρ (z) d ρ (\tilde{z}) .

(3)

As mentioned in [21], the

\vec{f}

defined in (3) usually performs poorly in situations where average performance is not an appropriate surrogate. Inspired from [21], for

t \in R^{∖ 0}

, we address the deficiencies by introducing the t-tilted loss and define the expected risk of

\vec{f}

with t-tilted loss as

E (\vec{f}, t) = \frac{1}{t} log \int_{Z} \int_{Z} e^{t V (\vec{f}, z, \tilde{z})} d ρ (z) d ρ (\tilde{z}) .

(4)

Remark 1.

Note that

t \in R^{∖ 0}

is a real-valued hyperparameter, and it can encompass a family of objectives which can address the fairness (

t > 0

) or robustness (

t < 0

) by different choices. In particular, it recovers the expected risk (3) as

t \to 0

.

On this basis, the GL with t-tilted loss is formulated as the following regularization scheme:

{\vec{f}}_{λ, t} = \underset{\vec{f} \in H_{K}^{n}}{arg min} {E (\vec{f}, t) + λ ∥ \vec{f} ∥_{K}^{2}},

(5)

where

λ > 0

is a regularization parameter. Here,

K : X \times X \to R

is a Mercer kernel that is continuous, symmetric, and positive semidefinite [22,23] and

H_{K}

induced by K be an RKHS defined as the closure of the linear span of the set of functions

{K_{x} : = K (x, \cdot) : x \in X}

with the inner product

{〈 \cdot, \cdot 〉}_{K}

satisfying

{〈 K_{x}, K_{\tilde{x}} 〉}_{K} = K (x, \tilde{x})

. The reproducing property takes the form

{〈 K_{x}, f 〉}_{K} = f (x), \forall x \in X, \forall f \in H_{K} .

Then, we denote

H_{K}^{n}

as an n-fold RKHS with the inner product

\begin{matrix} {〈 \vec{f}, \vec{h} 〉}_{K} = \sum_{l = 1}^{n} {〈 f^{l}, h^{l} 〉}_{K}, \vec{f} = {(f^{1}, f^{2}, \dots, f^{n})}^{T}, \vec{h} = {(h^{1}, h^{2}, \dots, h^{n})}^{T} \in H_{K}^{n}, \end{matrix}

and norm

∥ \vec{f} ∥_{K}^{2} = {〈 \vec{f}, \vec{f} 〉}_{K}

.

2.2. Main Results

This subsection states our main theoretical results on the asymptotic estimation of

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ}

on the space

{(L_{ρ_{X}}^{2})}^{n}

with norm

∥ \vec{f} ∥_{ρ} = (\sum_{l = 1}^{n} ∥ f^{l} {∥_{ρ}^{2})}^{1 / 2}

. Before proceeding, we provide some necessary assumptions which have been used extensively in machine learning literature, e.g., [24,25].

Assumption 1.

Supposing that

\nabla f_{ρ} \in H_{K}^{n}

and the kernel K is

C^{3}

, there exists a constant

c_{υ} > 0

such that

| f_{ρ} (x) - f_{ρ} (\tilde{x}) - \nabla f_{ρ} {(\tilde{x})}^{T} (x - \tilde{x}) | \leq c_{υ} {|x - \tilde{x}|}^{2}, \forall x, \tilde{x} \in X .

(6)

Assumption 2.

Assume

| y | \leq M

,

| x | \leq M_{X}

almost surely. Suppose that, for some

ς \in (0, \frac{2}{3})

,

c_{l}, c_{h} > 0

, the marginal distribution

ρ_{X}

satisfies

ρ_{X} ({x \in X : inf_{\tilde{x} \in R^{n} ∖ X} |x - \tilde{x}| \leq s}) \leq c_{h}^{2} s^{4 ς}, \forall s > 0,

(7)

and the density

p (z)

of

d ρ (z)

exists and satisfies

c_{l} \leq p (z) \leq c_{h}, |p (z) - p (\tilde{z})| \leq c_{h} {|z - \tilde{z}|}^{ς}, \forall z, \tilde{z} \in Z .

(8)

Taking the functional derivatives of (5), we know that

{\vec{f}}_{λ, t}

can be expressed in terms of the following integral operator on the space

{(L_{ρ_{X}}^{2})}^{n}

.

Definition 1.

Let integral operator

L_{K, s} : {(L_{ρ_{X}}^{2})}^{n} \to {(L_{ρ_{X}}^{2})}^{n}

be defined by

\begin{matrix} L_{K, s} \vec{f} = \int_{Z} \int_{Z} ϕ (z, \tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) d ρ (\tilde{z}) d ρ (z), \end{matrix}

(9)

where

ϕ (z, \tilde{z}) = {(\int_{Z} \int_{Z} e^{t V ({\vec{f}}_{λ, t}, u, v)} d ρ (u) d ρ (v))}^{- 1} e^{t V ({\vec{f}}_{λ, t}, z, \tilde{z})} .

The operator

L_{K, s}

has its range in

H_{K}^{n}

. It can also be regarded as a positive operator on

H_{K}^{n}

. We shall use the same notion for the operators on these two different domains. Given the definition of integral operator

L_{K, s}

, we can write

{\vec{f}}_{λ, t}

in the following equation.

Theorem 1.

Given the integral operator

L_{K, s}

, we have the following relationship:

\begin{matrix} {\vec{f}}_{λ, t} = {(L_{K, s} + λ I)}^{- 1} {\vec{f}}_{ρ, s}, \end{matrix}

(10)

where

{\vec{f}}_{ρ, s} = \int_{Z} \int_{Z} ϕ (z, \tilde{z}) ω (x, \tilde{x}) (f_{ρ} (x) - f_{ρ} (\tilde{x})) K_{\tilde{x}} (x - \tilde{x}) d ρ (\tilde{z}) d ρ (z)

, and I is the identity operator.

Proof of Theorem 1.

To solve the scheme (5), we take the functional derivative with respect to

\vec{f}

, apply it to an element

δ \vec{f}

of

H_{K}^{n}

and set it equal to 0. We obtain

\begin{matrix} \int_{Z} \int_{Z} ϕ (z, \tilde{z}) ω (x, \tilde{x}) (\tilde{y} - y + {\vec{f}}_{λ, t} {(\tilde{x})}^{T} (x - \tilde{x})) δ \vec{f} {(\tilde{x})}^{T} (x - \tilde{x}) d ρ (\tilde{z}) d ρ (z) + λ {〈 {\vec{f}}_{λ, t}, δ \vec{f} 〉}_{K} = 0 . \end{matrix}

Since it holds for any

δ \vec{f} \in H_{K}^{n}

, it is trivial to obtain

\begin{matrix} \int_{Z} \int_{Z} ϕ (z, \tilde{z}) ω (x, \tilde{x}) (\tilde{y} - y + {\vec{f}}_{λ, t} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) d ρ (\tilde{z}) d ρ (z) + λ {\vec{f}}_{λ, t} = 0 \end{matrix}

and

\begin{matrix} λ {\vec{f}}_{λ, t} + L_{K, s} {\vec{f}}_{λ, t} = {\vec{f}}_{ρ, s} . \end{matrix}

The desired result follows by shifting items. □

On this basis, we propose to bound the error

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ}

by a functional analysis approach and present the error decomposition as following proposition. The proof is straightforward and omitted for brevity.

Proposition 1.

For the

{\vec{f}}_{λ, t}

defined in (5), it holds that

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ} \leq ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ} + {∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥}_{ρ} .

(11)

In the sequel, we focus on bounding

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ}

and

∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ}

, respectively. Before we embark on the proof, we single out a important property regarding

ϕ (z, \tilde{z})

that will be useful in later proofs.

Lemma 1.

Under the Assumptions 1 and 2, there exists

B_{t}

and

A_{t}

dependent on t satisfying

B_{t} = e^{- 8 | t | (M^{2} + C_{K} M_{X})} \leq ϕ (z, \tilde{z}) \leq A_{t} = e^{8 | t | (M^{2} + C_{K} M_{X})} .

(12)

Proof of Lemma 1.

Since the kernel K is

C^{3}

and

{\vec{f}}_{λ, t} \in H_{K}^{n}

, we know from Zhou [26] that

f_{λ, t}^{l}

is

C^{1}

for each l. There exists a constant

C_{K}

satisfying

| {\vec{f}}_{λ, t} {(x) |}^{2} \leq C_{K}, \forall x \in X

. Hence, using Cauchy inequality, we have

\begin{matrix} V ({\vec{f}}_{λ, t}, z, \tilde{z}) & = ω (\tilde{x}, x) {(\tilde{y} - y + {\vec{f}}_{λ, t} {(\tilde{x})}^{T} (x - \tilde{x}))}^{2} \\ \leq 2 (4 M^{2} + | {\vec{f}}_{λ, t} (\tilde{x}) |^{2} | x - \tilde{x} |^{2}) \\ \leq 8 (M^{2} + C_{K} M_{X}) . \end{matrix}

By a direct computation, we obtain

\begin{matrix} e^{- 8 | t | (M^{2} + C_{K} M_{X})} \leq {(\int_{Z} \int_{Z} e^{t V ({\vec{f}}_{λ, t}, u, v)} d ρ (u) d ρ (v))}^{- 1} e^{t V ({\vec{f}}_{λ, t}, z, \tilde{z})} \leq e^{8 | t | (M^{2} + C_{K} M_{X})} . \end{matrix}

The desired result follows. □

Denote

κ = {sup}_{x \in X} K (x, x)

and the moments of the Gaussian as

J_{p} = \int_{R^{n}} e^{- \frac{{| x |}^{2}}{2}} {|x|}^{p} d x

,

p = 1, 2, 3, \dots

, we establish the following Lemma.

Lemma 2.

Under Assumptions 1 and 2, we have

\begin{matrix} ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{K} \leq 2 M \frac{s}{λ} κ c_{υ} c_{h} J_{3} A_{t} . \end{matrix}

(13)

Proof of Lemma 2.

Taking notice of (10), it follows that

\begin{matrix} {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} = {(L_{K, s} + λ I)}^{- 1} ({\vec{f}}_{ρ, s} - L_{K, s} \nabla f_{ρ}) . \end{matrix}

Then, we have

\begin{matrix} ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{K} \leq & ∥ {(L_{K, s} + λ I)}^{- 1} ∥_{K} {∥ {\vec{f}}_{ρ, s} - L_{K, s} \nabla f_{ρ} ∥}_{K} \\ \leq & \frac{1}{λ} {∥ {\vec{f}}_{ρ, s} - L_{K, s} \nabla f_{ρ} ∥}_{K} . \end{matrix}

We note that

\begin{matrix} J_{p} s^{p - 2} = \int_{R^{n}} ω (x, \tilde{x}) | x - \tilde{x} |^{p} d \tilde{x} = \int_{R^{n}} \frac{1}{s^{n + 2}} e^{\frac{- {|x - \tilde{x}|}^{2}}{2 s^{2}}} {| x - \tilde{x} |}^{p} d \tilde{x}, p = 2, 3, \dots . \end{matrix}

From Assumptions 1 and 2, we have

∥ {\vec{f}}_{ρ, s} - L_{K, s} \nabla f_{ρ} ∥_{K} \leq \int_{Z} \int_{Z} ω (x, \tilde{x}) {|x - \tilde{x}|}^{3} ϕ (z, \tilde{z}) {∥ K_{\tilde{x}} ∥}_{K} c_{υ} d ρ (z) d ρ (\tilde{z}) \leq 2 M s κ c_{υ} c_{h} J_{3} A_{t} .

The desired result follows. □

As for

∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ}

, the multivariate mean value theorem ensures that there exists

R_{t} (\tilde{z}) = ϕ (\tilde{z}, η_{z}), η_{z} \in R^{n} \times Y

, such that

\begin{matrix} \int_{Z} \int_{R^{n} \times Y} e^{- \frac{{|x - \tilde{x}|}^{2}}{2 s^{2}}} \frac{| x - \tilde{x} |^{2}}{s^{2 + n}} ϕ (z, \tilde{z}) K_{\tilde{x}} \vec{f} (\tilde{x}) p (\tilde{z}) d z d ρ (\tilde{z}) \\ = & \int_{Z} \int_{R^{n} \times Y} e^{- \frac{{|x - \tilde{x}|}^{2}}{2 s^{2}}} \frac{| x - \tilde{x} |^{2}}{s^{2 + n}} R_{t} (\tilde{z}) K_{\tilde{x}} \vec{f} (\tilde{x}) p (\tilde{z}) d z d ρ (\tilde{z}) . \end{matrix}

(14)

From (14), we can define the integral operator associated with the Mercer kernel K which is related to

L_{K, s}

. Using Lemma 16 and Lemma 18 in [10], we establish the following Lemma.

Lemma 3.

Under the Assumption 2, denote

c_{ρ} = {(2 M A_{t} κ^{2} c_{h} (2 J_{2 + ς} + J_{4} + c_{h} J_{2}))}^{\frac{1}{ς}}

and

V_{p} = \int_{Z} {(p (z))}^{2} R_{t} (z) d z

. For any

0 < s \leq m i n {c_{ρ} λ^{\frac{1}{ς}}, 1}

, we have

\begin{matrix} ∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ} \leq 2 \sqrt{λ} {(V_{p} n {(2 π)}^{\frac{n}{2}} M)}^{- \frac{1}{2}} {∥ L_{K}^{- \frac{1}{2}} \nabla f_{ρ} ∥}_{ρ}, \end{matrix}

(15)

where

L_{K}

is a positive operator on

{(L_{ρ_{X}}^{2})}^{n}

defined by

L_{K} \vec{f} = \int_{Z} K_{x} \vec{f} (x) \frac{p (z) R_{t} (z)}{V_{p}} d ρ (z), \vec{f} \in {(L_{ρ}^{2})}^{n} .

Proof of Lemma 3.

To estimate (15), we need to consider the convergence of

L_{K, s}

as

s \to 0

. Denote the stepping stone

\begin{matrix} \vec{g} = \int_{Z} \int_{Z} ω (x, \tilde{x}) (x - \tilde{x}) R_{t} (\tilde{z}) K_{\tilde{x}} {(x - \tilde{x})}^{T} \vec{f} (\tilde{x}) p (\tilde{z}) d z d ρ (\tilde{z}), \end{matrix}

we deduce that

\begin{matrix} ∥ L_{K, s} \vec{f} - 2 M V_{p} n {(2 π)}^{\frac{n}{2}} L_{K} \vec{f} ∥_{K} & \leq ∥ L_{K, s} \vec{f} - \vec{g} + \vec{g} - 2 M V_{p} n {(2 π)}^{\frac{n}{2}} L_{K} \vec{f} ∥_{K} \\ \leq ∥ L_{K, s} \vec{f} - \vec{g} ∥_{K} + {∥ \vec{g} - 2 M V_{p} n {(2 π)}^{\frac{n}{2}} L_{K} \vec{f} ∥}_{K} . \end{matrix}

Using the multivariate mean value theorem, there exists

z_{ζ}, z_{σ} \in R^{n} \times Y

, such that

\begin{matrix} ∥ L_{K, s} \vec{f} - \vec{g} ∥_{K} = & ∥ p (z_{ζ}) \int_{Z} \int_{R^{n} \times Y} R_{t} (\tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) d z d ρ (\tilde{z}) \\ - \int_{Z} \int_{Z} ω (x, \tilde{x}) (x - \tilde{x}) R_{t} (\tilde{z}) K_{\tilde{x}} {(x - \tilde{x})}^{T} \vec{f} (\tilde{x}) p (\tilde{z}) d z d ρ (\tilde{z}) ∥_{K} \\ \leq & ∥ p (z_{ζ}) \int_{Z} \int_{R^{n} \times Y} R_{t} (\tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) d z d ρ (\tilde{z}) \\ - \int_{Z} \int_{R^{n} \times Y} R_{t} (\tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) p (z) d z d ρ (\tilde{z}) ∥_{K} \\ + ∥ \int_{Z} \int_{Z} R_{t} (\tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) (p (z) - p (\tilde{z})) d z d ρ (\tilde{z}) ∥_{K} \\ \leq & ∥ p (z_{ζ}) - p (z_{σ}) \int_{Z} \int_{R^{n} \times Y} R_{t} (\tilde{z}) ω (x, \tilde{x}) () \vec{f} {(\tilde{x})}^{T} (x - \tilde{x}) K_{\tilde{x}} (x - \tilde{x}) d z d ρ (\tilde{z}) ∥_{K} \\ + ∥ \int_{Z} \int_{Z} R_{t} (\tilde{z}) ω (x, \tilde{x}) (\vec{f} {(\tilde{x})}^{T} (x - \tilde{x})) K_{\tilde{x}} (x - \tilde{x}) (p (z) - p (\tilde{z})) d z d ρ (\tilde{z}) ∥_{K} \\ \leq & 4 M s^{ς} κ c_{h} J_{2 + ς} {∥ \vec{f} ∥}_{ρ} A_{t} . \end{matrix}

Noticing

n {(2 π)}^{\frac{n}{2}} = J_{2}

, we have

\begin{matrix} 2 V_{p} n {(2 π)}^{\frac{n}{2}} M L_{K} \vec{f} = & \int_{Z} \int_{R^{n} \times Y} ω (x, \tilde{x}) R_{t} (\tilde{z}) K_{\tilde{x}} \vec{f} (\tilde{x}) {(x - \tilde{x})}^{T} (x - \tilde{x}) p (\tilde{z}) d z d ρ (\tilde{z}) . \end{matrix}

Then, by (7), we can obtain the following conclusion from Lemma 16 in [10] when

0 \leq s \leq 1

,

\begin{matrix} ∥ \vec{g} - 2 M V_{p} n {(2 π)}^{\frac{n}{2}} L_{K} \vec{f} ∥_{K} & \leq ∥ \int_{Z} \int_{(R^{n} \times Y) ∖ Z} ω (x, \tilde{x}) R_{t} (\tilde{z}) K_{\tilde{x}} \vec{f} (\tilde{x}) p (\tilde{z}) {| x - \tilde{x} |}^{2} d z d ρ (\tilde{z}) ∥_{K} \\ \leq 2 M c_{ρ} A_{t} \int_{X} \int_{R^{n} ∖ X} ω (x, \tilde{x}) K_{\tilde{x}} | \vec{f} (\tilde{x}) | | (x - \tilde{x}) |^{2} d x d ρ_{X} (\tilde{x}) \\ \leq 2 M s^{ς} κ c_{h} (J_{4} + c_{h} J_{2}) {∥ \vec{f} ∥}_{ρ} A_{t} . \end{matrix}

Combining the above two estimates, there holds for any

0 \leq s \leq 1

,

\begin{matrix} ∥ L_{K, s} - 2 M V_{p} n {(2 π)}^{\frac{n}{2}} L_{K} ∥_{K} \leq 2 M A_{t} κ^{2} c_{h} s^{ς} (2 J_{2 + ς} + J_{4} + c_{h} J_{2}) . \end{matrix}

(16)

Using Lemma 18 in [10] and (16), the desired result follows. □

Since the measure

d \tilde{ρ} = \int_{Y} \frac{p (z) R_{t} (z)}{V_{p}} d ρ

is probability one on X, we know that the operator

L_{K}

can be used to define the reproducing kernel Hilbert space [22]. Let

L_{K}^{1 / 2}

be the

\frac{1}{2}

-th power of the positive operator

L_{K}

on

{(L_{\tilde{ρ}}^{2})}^{n}

with norm

∥ \vec{f} ∥_{\tilde{ρ}} = (\sum_{l = 1}^{n} ∥ f^{l} {∥_{\tilde{ρ}}^{2})}^{1 / 2}

having a range in

H_{K}^{n}

, where

∥ f^{l} ∥_{\tilde{ρ}} = (\int_{X} | f^{l} (x) {|^{2} d \tilde{ρ})}^{1 / 2}

. Then,

H_{K}^{n}

is the range of

L_{K}^{1 / 2} :

\begin{matrix} ∥ \vec{f} ∥_{\tilde{ρ}} = {∥ L_{K}^{1 / 2} \vec{f} ∥}_{K}, \vec{f} \in {(L_{\tilde{ρ}}^{2})}^{n} . \end{matrix}

(17)

The assumption we shall use is

∥ L_{K}^{- 1 / 2} \nabla f_{ρ} ∥_{\tilde{ρ}} < \infty

. It means that

\nabla f_{ρ}

lies in the range of

L_{K}^{1 / 2}

. Finally, we can give the upper bound of the error

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ}

.

Theorem 2.

Under the Assumptions 1 and 2, choose

λ = m^{- \frac{τ}{n + 2 + 3 τ}}

and

s = {(κ c_{h})}^{\frac{2}{ζ}} m^{- \frac{1}{n + 2 + 3 τ}}

. For any

m \geq {(κ c_{h})}^{2 (n + 2 + 3 τ) / τ}

, there exists a constant

C_{ρ, K}

such that we have

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ} \leq C_{ρ, K} \frac{A_{t}}{\sqrt{B_{t}}} {(\frac{1}{m})}^{\frac{ζ}{2 n + 4 + 6 ζ}} .

(18)

Proof of Theorem 2.

Using Cauchy inequality, for

\vec{f} = {(f^{1}, f^{2}, \dots, f^{n})}^{T} \in {(L_{ρ_{X}}^{2})}^{n}

, we have

\begin{matrix} \int_{X} {(f^{l} (x))}^{2} d ρ_{X} (x) & \leq {(\int_{Z} {(f^{l} (x))}^{2} \frac{p (z) R_{t} (z)}{V_{p}} d ρ (z))}^{\frac{1}{2}} {(\int_{Z} {(f^{l} (x))}^{2} \frac{V_{p}}{p (z) R_{t} (z)} d ρ (z))}^{\frac{1}{2}} \\ \leq \sqrt{\frac{V_{p}}{c_{l} B_{t}}} {(\int_{Z} {(f^{l} (x))}^{2} \frac{p (z) R_{t} (z)}{V_{p}} d ρ (z))}^{\frac{1}{2}} {(\int_{X} {(f^{l} (x))}^{2} d ρ_{X} (x))}^{\frac{1}{2}} . \end{matrix}

It means that

\begin{matrix} {(\int_{X} {(f^{l} (x))}^{2} d ρ_{X} (x))}^{\frac{1}{2}} \leq \sqrt{\frac{V_{p}}{c_{l} B_{t}}} {(\int_{Z} {(f^{l} (x))}^{2} \frac{p (z) R_{t} (z)}{V_{p}} d ρ (z))}^{\frac{1}{2}} . \end{matrix}

According to the definitions of

∥ f^{l} ∥_{ρ}

and

∥ f^{l} ∥_{\tilde{ρ}}

, it is trivial to obtain

\begin{matrix} ∥ \vec{f} ∥_{ρ} \leq \sqrt{\frac{V_{p}}{c_{l} B_{t}}} {∥ \vec{f} ∥}_{\tilde{ρ}} . \end{matrix}

(19)

Since

s = {(κ c_{h})}^{\frac{2}{ζ}} λ^{\frac{1}{ζ}}, λ = {(\frac{1}{m})}^{\frac{ζ}{n + 2 + 3 ζ}}

, we see from the fact

J_{2} > 1

that the restriction

0 < s \leq min {c_{ρ} λ^{\frac{1}{ζ}}, 1}

in Lemma 3 is satisfied for

m \geq {(κ c_{h})}^{2 (n + 2 + 3 τ) / τ}

. Then, combining Lemmas 2 and 3, Equation (17) and inequality (19), we have

\begin{matrix} ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ} \leq & ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{ρ} + {∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥}_{ρ} \\ \leq & κ ∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} + λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥_{K} + {∥ λ {(L_{K, s} + λ I)}^{- 1} \nabla f_{ρ} ∥}_{ρ} \\ \leq & 2 M \frac{s}{λ} κ^{2} c_{υ} c_{h} J_{3} A_{t} + 2 \sqrt{\frac{V_{p}}{c_{l} B_{t}}} \sqrt{λ} {(M V_{p} n {(2 π)}^{\frac{n}{2}})}^{- \frac{1}{2}} {∥ \nabla f_{ρ} ∥}_{K} \\ \leq & C_{ρ, K} \frac{A_{t}}{\sqrt{B_{t}}} {(\frac{1}{m})}^{\frac{ζ}{2 n + 4 + 6 ζ}}, \end{matrix}

where

C_{ρ, K} = ({(2 κ c_{h})}^{\frac{2}{ζ}} + 2) max {M κ^{2} c_{υ} c_{h} J_{3}, \sqrt{\frac{V_{p}}{c_{l}}} {(M V_{p} n {(2 π)}^{\frac{n}{2}})}^{- \frac{1}{2}} C_{K}}

. □

Remark 2.

Theorem 2 shows when

m \to + \infty

,

∥ {\vec{f}}_{λ, t} - \nabla f_{ρ} ∥_{ρ} \to 0

. This means that the scheme (5) is consistent. In addition,

A_{t}

and

B_{t}

tend to 1 as t tends 0, we can see that the convergence rate of Scheme (5) is

- \frac{ζ}{2 n + 4 + 6 ζ}

, which is consistent with previous result in [10]. It means that the proposed method can be regarded as an extension of traditional GL.

3. Computing Algorithm

In this section, we present the GL model under TERM and propose to use the gradient descent algorithm to find the minimizer. Finally, the convergence of the proposed algorithm is also guaranteed.

Given a set of observations

z = {\{z_{i} = (x_{i}, y_{i})\}}_{i = 1}^{m} \in Z^{m}

independently drawn according to

ρ

and assume that the RKHS are rich that the kernel matrix

K = {(K (x_{i}, x_{j}))}_{i, j = 1}^{m}

is strictly positive definite [27]. According to the Representer Theorem of kernel methods [28], we assert the approximation of

{\vec{f}}_{λ, t}

has the following form:

\sum_{i = 1}^{m} c_{i} K_{x_{i}}, c_{i} = {(c_{i}^{1}, \dots, c_{i}^{n})}^{T} \in R^{n} .

Let

c = {(c_{1}^{T}, \dots, c_{m}^{T})}^{T} \in R^{m n}

, the empirical version of (4) is formulated as follows:

\begin{matrix} c_{z, λ} : = \underset{c \in R^{m n}}{arg min} \{E_{z} (c, t) + λ ∥ \sum_{i = 1}^{m} c_{i} K_{x_{i}} ∥_{K}^{2}\}, \end{matrix}

(20)

where

E_{z} (c, t) = \frac{1}{t} log (\frac{1}{m^{2}} \sum_{i, j = 1}^{m} exp \{t ω (x_{i}, x_{j}) {(y_{i} - y_{j} + \sum_{p = 1}^{m} K (x_{p}, x_{i}) {\hat{x}}_{i j} c_{p})}^{2}\}),

with

{\hat{x}}_{i j} = {(x_{j} - x_{i})}^{T}

. For simplicity, we denote

V_{z} (c, z_{i}, z_{j}) = ω (x_{i}, x_{j}) {(y_{i} - y_{j} + \sum_{p = 1}^{m} K (x_{p}, x_{i}) {\hat{x}}_{i j} c_{p})}^{2}

and

ϕ_{z} (c, z_{i}, z_{j}) = exp \{t (V_{z} (c, z_{i}, z_{j}) - E_{z} (c, t))\} .

The gradients of

E_{z} (c, t)

and

∥ \sum_{i = 1}^{m} c_{i} K_{x_{i}} ∥_{K}^{2}

at c are given by

\begin{matrix} \nabla_{c} E_{z} (c, t) = & \frac{1}{m^{2}} \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) 2 ω (x_{i}, x_{j}) (y_{i} - y_{j} + \sum_{p = 1}^{m} K (x_{p}, x_{i}) {\hat{x}}_{i j} c_{p}) \times \\ {(K (x_{1}, x_{i}) {\hat{x}}_{i j}, \dots, K (x_{m}, x_{i}) {\hat{x}}_{i j})}^{T}, \end{matrix}

and

\nabla_{c} ∥ \sum_{i = 1}^{m} c_{i} K_{x_{i}} ∥_{K}^{2} = 2 \sum_{i = 1}^{m} {(K (x_{i}, x_{1}) c_{i}^{T}, \dots, K (x_{i}, x_{m}) c_{i}^{T})}^{T} .

Correspondingly, scheme (20) can be solved via the following gradient method:

c^{k} = c^{k - 1} - α (\nabla_{c} E_{z} (c^{k - 1}, t) + λ \nabla_{c} ∥ \sum_{i = 1}^{m} c_{i, k - 1} K_{x_{i}} ∥_{K}^{2}),

(21)

where

c^{k} = {(c_{1, k}^{T}, \dots, c_{m, k}^{T})}^{T} \in R^{m n}

is the calculated solution at iteration k, and

α

is the step-size. The detailed gradient descent scheme is stated in Algorithm 1. To prove the convergence, we introduce the following lemma derived from Theorem 1 in [29].

Lemma 4.

When

h (c)

has an γ-Lipschitz continuous gradient (γ-smoothness) and is μ-strongly convex, for the basic unconstrained optimization problem

c^{*} = arg min h (c)

, the gradient descent algorithm

c^{k} = c^{k - 1} - \frac{1}{γ} \nabla h (c^{k - 1})

with a step-size of

1 / γ

has a global linear convergence rate

h (c^{k}) - h (c^{*}) \leq {(1 - \frac{μ}{γ})}^{k} (h (c^{0}) - h (c^{*})) .

Algorithm 1 Gradient descent for the Gradient Learning under TERM

From Lemma 4, we obtain the following conclusion which states that the proposed algorithm converges to (20) by choosing a suitable step size

α

.

Theorem 3.

Denote

L (c, t) = E_{z} (c, t) + λ {∥ \sum_{i = 1}^{m} c_{i} K_{x_{i}} ∥}_{K}^{2}

,

β_{m a x}, β_{m i n}

are the maximum and minimum eigenvalues of kernel matrix

K

, respectively. There exist

μ \in R^{+}

and

γ \in R^{+}

dependent on t such that

L (c^{k}, t)

is γ-smoothness and μ-strongly convex for any

t > (- n λ β_{m i n} / 64 (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4})

. In addition, let the minimizer

c_{z, λ}

defined in scheme (20) and

{c^{k}}

be the sequence generated by Algorithm 1 with

α = 1 / γ

, we have

\begin{matrix} L (c^{k}, t) - L (c_{z, λ}, t) \leq {(1 - \frac{μ}{γ})}^{k} (L (c^{0}, t) - L (c_{z, λ}, t)) . \end{matrix}

(22)

Proof of Theorem 3.

Note that the strong convexity and the smoothness are related to the Hessian Matrix, and we provide the proof by dividing the Hessian Matrix into three parts:

\begin{matrix} \nabla_{c c^{T}}^{2} L (c, t) = & \underset{E_{1}}{\underset{︸}{\frac{t}{m^{2}} \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) (\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t)) \nabla_{c} V_{z} {(c, z_{i}, z_{j})}^{T}}} \\ + \underset{E_{2}}{\underset{︸}{\frac{1}{m^{2}} \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) \nabla_{c c^{T}}^{2} V_{z} (c, z_{i}, z_{j})}} + \underset{E_{3}}{\underset{︸}{λ \nabla_{c c^{T}}^{2} {∥ \sum_{i = 1}^{m} c_{i} K_{x_{i}} ∥}_{K}^{2}}} . \end{matrix}

(23)

(1) Estimation on

E_{1}

: Note that

m^{2} \nabla_{c} E_{z} (c, t) = \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) \nabla_{c} V_{z} (c, z_{i}, z_{j})

and

\sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) = m^{2}

. It follows that

\begin{matrix} \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) (\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t)) \nabla_{c}^{T} E_{z} (c, t) = 0 . \end{matrix}

Hence, we can get the following equation:

\begin{matrix} E_{1} = \frac{t}{m^{2}} \sum_{i, j = 1}^{m} ϕ_{z} (c, z_{i}, z_{j}) (\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t)) {(\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t))}^{T} . \end{matrix}

(24)

Similar to the proof of Lemma 1, for

i, j = 1, \dots, m

, it directly follows that

\begin{matrix} ω (x_{i}, x_{j}) (y_{i} - y_{j} + \sum_{p = 1}^{m} K (x_{p}, x_{i}) {\hat{x}}_{i j} c_{p}) \leq 2 \sqrt{2 (M^{2} + C_{K} M_{X})} . \end{matrix}

Note that, for

i, j = 1, \dots, m

,

\nabla_{c} V_{z} (c, z_{i}, z_{j}) \nabla_{c} V_{z} {(c, z_{i}, z_{j})}^{T}

has a sole eigenvalue, it means

\begin{matrix} \nabla_{c} V_{z} (c, z_{i}, z_{j}) \nabla_{c} V_{z} {(c, z_{i}, z_{j})}^{T} ⪯ & 32 (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4} I_{m n}, \end{matrix}

(25)

and we have

\begin{matrix} {(\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t))}^{T} (\nabla_{c} V_{z} (c, z_{i}, z_{j}) - \nabla_{c} E_{z} (c, t)) \leq 128 (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4} . \end{matrix}

It means that the maximum eigenvalue of

E_{1}

is

128 t (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4}

. Then, the following inequations are satisfied

\begin{matrix} \{\begin{matrix} 0_{m n} ⪯ E_{1} ⪯ 128 t (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4} I_{m n}, t > 0; \\ 128 t (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4} I_{m n} ⪯ E_{1} ⪯ 0_{m n}, t < 0, \end{matrix} \end{matrix}

(26)

where

0_{m n}

is the

m n \times m n

matrix with all elements zero.

(2) Estimation on

E_{2}

: Note that

\nabla_{c c^{T}}^{2} V_{z} (c, z_{i}, z_{j})

can be rewritten as

\begin{matrix} 2 ω (x_{i}, x_{j}) (K (x_{1}, x_{i}) {\hat{x}}_{i j}, \dots, K (x_{m}, x_{i}) {\hat{x}}_{i j}) {(K (x_{1}, x_{i}) {\hat{x}}_{i j}, \dots, K (x_{m}, x_{i}) {\hat{x}}_{i j})}^{T} . \end{matrix}

Similar to (25), we have

\nabla_{c c^{T}}^{2} V_{z} (c, z_{i}, z_{j}) ⪯ 2 κ^{4} M_{x}^{2} I_{m n}

. It follows

\begin{matrix} 0_{m n} ⪯ E_{2} ⪯ 2 κ^{4} M_{x}^{2} I_{m n} . \end{matrix}

(27)

(3) Estimation on

E_{3}

: By a direct computation, we have

\begin{matrix} E_{3} = 2 λ [\begin{matrix} I_{n} K (x_{1}, x_{1}) & I_{n} K (x_{1}, x_{2}) & \dots & I_{n} K (x_{1}, x_{m}) \\ I_{n} K (x_{2}, x_{1}) & I_{n} K (x_{2}, x_{2}) & \dots & I_{n} K (x_{2}, x_{m}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ I_{n} K (x_{m}, x_{1}) & I_{n} K (x_{m}, x_{2}) & \dots & I_{n} K (x_{m}, x_{m}) \end{matrix}] . \end{matrix}

Setting

Q = {(q_{11}, q_{21}, \dots, q_{n 1}, \dots, q_{1 m}, q_{2 m}, \dots, q_{n m})}^{T} \in R^{m n}

, we deduce that

\begin{matrix} Q^{T} E_{3} Q = 2 λ \sum_{l = 1}^{n} \sum_{i = 1}^{m} \sum_{j = 1}^{m} K (x_{i}, x_{j}) q_{l i} q_{l j} . \end{matrix}

Note that the matrix of quadratic form

\sum_{i = 1}^{m} \sum_{j = 1}^{m} K (x_{i}, x_{j}) q_{l i} q_{l j}

is

K

, then we can obtain

\begin{matrix} 2 λ n β_{m i n} I_{m n} ⪯ E_{3} ⪯ 2 λ n β_{m a x} I_{m n} . \end{matrix}

(28)

Combining (26), (27) and (28), there exist two constants

μ = min {2 n λ β_{m i n} + 128 t (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4}, 2 n λ β_{m i n}}

and

γ = max {128 t (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4} + 2 n λ β_{m a x}, 2 κ^{4} M_{x}^{2} + 2 n λ β_{m a x}}

satisfying that

\begin{matrix} μ I_{m n} ⪯ \nabla_{c c^{T}}^{2} L (c, t) ⪯ γ I_{m n} . \end{matrix}

Note

μ > 0

as

t > - n λ β_{m i n} / 64 (M^{2} + C_{K} M_{X}) M_{X}^{2} m κ^{4}

, and it means that

L (c, t)

is

γ

-smoothness and

μ

-strongly convex. The desired result follows by Lemma 4. □

4. Simulation Experiments

In this section, we carry out simulation studies with the TGL model (

t < 0

for robust) on a synthetic data set in the robust variable selection problem. Let the observation data set

z = {\{z_{i} = (x_{i}, y_{i})\}}_{i = 1}^{m}

with

x_{i} = (x_{i}^{1}, \dots, x_{i}^{n})

be generated by the following linear equations:

y_{i} = x_{i} \cdot w + ϵ,

where

ϵ

represents the outliers or noises. To be specific, three different noises are used: Cauchy noise with the location parameter

a = 2

and scale parameter

b = 4

, Chi-square noise with 5 DOF scaled by 0.01 and Gaussian noise

N (0, 0.3)

. Three different proportions of outliers including

0 %

,

20 %

, or

40 %

are drawn from the Gaussian noise

N (0, 100)

. Meanwhile, we consider two different cases with

(m, n) = (50, 50), (30, 80)

corresponding to m = n and m < n, respectively. The weighted vector

w = (w^{1}, \dots, w^{n})

over different dimensions is constructed as follows:

w^{l} = 2 + 0.5 sin (\frac{2 π l}{10})

, for

l = 1, \dots, N_{n}

and 0, otherwise.

Here,

N_{n} = 30

means the number of effective variables. Two situations including uncorrelated variables

x \sim N (0_{n}, I_{n})

and correlated variables

x \sim N (0_{n}, Σ_{n})

are implemented for x, where the covariance matrix

Σ_{n}

is given with the

(l, p)

th entry

0 . 5^{| l - p |}

.

For the variable selection algorithms, we perform the TGL with

t = 6 \times 10^{- 6}, - 1, - 10

and compare the traditional GL model [10] and RGL model [20]. For the GL and TGL models,

N_{n}

variables are selected by ranking

\begin{matrix} r_{l} = \frac{∥ f_{z, λ}^{l} ∥_{K}^{2}}{\sum_{p = 1}^{n} {∥ f_{z, λ}^{p} ∥}_{K}^{2}}, l = 1, \dots, n . \end{matrix}

For the RGL model,

N_{n}

variables are selected by ranking

\begin{matrix} r_{l} = \frac{\sum_{i = 1}^{m} {(c_{i}^{l})}^{2}}{\sum_{q = 1}^{n} \sum_{i = 1}^{m} {(c_{i}^{q})}^{2}}, l = 1, \dots, n . \end{matrix}

A model selecting more effective variables

(\leq N_{n})

means a better algorithm.

We repeat experiments for 30 times with the observation set

z

generated in each circumstance. The average selected effective variables for different circumstances are reported in Table 1, and the optimal results are marked in bold. Several useful conclusions can be drawn from Table 1.

(1) When the input variables are uncorrelated, the three models have similar performance under different noise conditions and can provide satisfactory variable selection results (approaching

N_{n}

) without outliers. However, the performance degrades severely for GL and a little for TGL (

t < 0

for robust) with the increasing proportions of outliers, especially in case

(m, n) = (30, 80)

. In contrast, RGL can always provide satisfying performance. This is consistent with the previous phenomenon [20].

(2) When the input variables are correlated, the three models also have similar performance under different noise conditions but only can select partial effective variables ranging from

N_{n} / 3

to

2 N_{n} / 3

. In general, they degrade slowly with the increasing proportions of outliers and perform better in case

(m, n) = (50, 50)

than in

(m, n) = (30, 80)

. Specifically, the TGL model with

t = - 1

gives slightly better selection results than GL and RGL in case

(m, n) = (50, 50)

. It supports the superiority of TGL to some extent.

(3) It is worth noting that the TGL model with

t = 6 \times 10^{- 6}

has similar performance to GL. This phenomenon supports the theoretical conclusion that TGL recovers the GL as

t \to 0

and the algorithmic effectiveness that the proposed gradient descent method can converge to the minimizer.

(4) Noting that the TGL model with different parameters t has great differences in the variable selection results, we further conduct some simulation studies to investigate the influence. Figure 1 shows the variable selection results of different parameters t ranging from

- 100

to

- 0.1

. We can see that the satisfying performance can be achieved when the parameter t is near

- 1

. It does not turn out well when

| t |

is too large. This coincides with our previous discussion that

L (c, t)

is strongly convex with limited t.

5. Conclusions

In this paper, we have proposed a new learning objective TGL by embedding the t-tilted loss into the GL model. On the theoretical side, we have established its consistency and provided the convergence rate with the help of error decomposition and operator approximation technique. On the practical side, we have proposed a gradient descent method to solve the learning objective and provided the convergence analysis. Simulated experiments have verified the theoretical conclusion that TGL recovers the GL as

t \to 0

and the algorithmic effectiveness that the proposed gradient descent method can converge to the minimizer. In addition, they also demonstrated the superiority of TGL when the input variables are correlated. Along the line of the present work, several open problems deserve further research—for example, using the random feature approximation to scale up the kernel methods [30] and learning with data-dependent hypothesis space to achieve a tighter error bound [31]. These problems are under our research.

Author Contributions

All authors have made a great contribution to the work. Methodology, L.L., C.Y., B.S. and C.X.; formal analysis, L.L. and C.X.; investigation, C.Y., Z.P. and C.X.; writing—original draft preparation, L.L., B.S. and W.L.; writing—review and editing, W.L. and C.X.; visualization, C.Y. and Z.P.; supervision, C.X.; project administration, B.S.; funding acquisition, B.S. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities of China (2662020LXQD002), the Natural Science Foundation of China (12001217), the Key Laboratory of Biomedical Engineering of Hainan Province (Opening Foundation 2022003), and the Hubei Key Laboratory of Applied Mathematics (HBAM 202004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Guo, C.; Xiong, H.; Wang, Y. Sparse additive machine with ramp loss. Anal. Appl. 2021, 19, 509–528. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Zheng, F.; Deng, C.; Huang, H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2373–2387. [Google Scholar] [CrossRef]
Deng, H.; Chen, J.; Song, B.; Pan, Z. Error bound of mode-based additive models. Entropy 2021, 23, 651. [Google Scholar] [CrossRef] [PubMed]
Engle, R.F.; Granger, C.W.J.; Rice, J.; Weiss, A. Semiparametric Estimates of the Relation Between Weather and Electricity Sales. J. Am. Stat. Assoc. 1986, 81, 310–320. [Google Scholar] [CrossRef]
Zhang, H.; Cheng, G.; Liu, Y. Linear or Nonlinear? Automatic Structure Discovery for Partially Linear Models. J. Am. Stat. Assoc. 2011, 106, 1099–1112. [Google Scholar] [CrossRef] [Green Version]
Huang, J.; Wei, F.; Ma, S. Semiparametric Regression Pursuit. Stat. Sin. 2012, 22, 1403–1426. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mukherjee, S.; Zhou, D. Learning Coordinate Covariances via Gradients. J. Mach. Learn. Res. 2006, 7, 519–549. [Google Scholar]
Mukherjee, S.; Wu, Q. Estimation of Gradients and Coordinate Covariation in Classification. J. Mach. Learn. Res. 2006, 7, 2481–2514. [Google Scholar]
Jia, C.; Wang, H.; Zhou, D. Gradient learning in a classification setting by gradient descent. J. Approx. Theory 2009, 161, 674–692. [Google Scholar]
He, X.; Lv, S.; Wang, J. Variable selection for classification with derivative-induced regularization. Stat. Sin. 2020, 30, 2075–2103. [Google Scholar] [CrossRef]
Dong, X.; Zhou, D.X. Learning gradients by a gradient descent algorithm. J. Math. Anal. Appl. 2008, 341, 1018–1027. [Google Scholar] [CrossRef] [Green Version]
Mukherjee, S.; Wu, Q.; Zhou, D. Learning gradients on manifolds. Bernoulli 2010, 16, 181–207. [Google Scholar] [CrossRef] [Green Version]
Borkar, V.S.; Dwaracherla, V.R.; Sahasrabudhe, N. Gradient Estimation with Simultaneous Perturbation and Compressive Sensing. J. Mach. Learn. Res. 2017, 18, 161:1–161:27. [Google Scholar]
Ye, G.B.; Xie, X. Learning sparse gradients for variable selection and dimension reduction. Mach. Learn. 2012, 87, 303–355. [Google Scholar] [CrossRef] [Green Version]
He, X.; Wang, J.; Lv, S. Efficient kernel-based variable selection with sparsistency. arXiv 2018, arXiv:1802.09246. [Google Scholar] [CrossRef]
Guinney, J.; Wu, Q.; Mukherjee, S. Estimating variable structure and dependence in multitask learning via gradients. Mach. Learn. 2011, 83, 265–287. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.; Yang, Y.; Suykens, J.A.K. Robust Gradient Learning with Applications. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 822–835. [Google Scholar] [CrossRef]
Li, T.; Beirami, A.; Sanjabi, M.; Smith, V. On tilted losses in machine learning: Theory and applications. arXiv 2021, arXiv:2109.06141. [Google Scholar]
Cucker, F.; Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 2002, 39, 1–49. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Wang, Y. Kernel-based sparse regression with the correntropy-induced loss. Appl. Comput. Harmon. Anal. 2018, 44, 144–164. [Google Scholar] [CrossRef]
Feng, Y.; Fan, J.; Suykens, J.A. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
Yang, L.; Lv, S.; Wang, J. Model-free variable selection in reproducing kernel hilbert space. J. Mach. Learn. Res. 2016, 17, 2885–2908. [Google Scholar]
Zhou, D.X. Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inf. Theory 2003, 49, 1743–1752. [Google Scholar] [CrossRef]
Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
Schölkopf, B.; Smola, A.J.; Bach, F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Karimi, H.; Nutini, J.; Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Riva del Garda, Italy, 19–23 September 2016; Springer: Cham, Switzerland, 2016; pp. 795–811. [Google Scholar]
Dai, B.; Xie, B.; He, N.; Liang, Y.; Raj, A.; Balcan, M.F.F.; Song, L. Scalable Kernel Methods via Doubly Stochastic Gradients. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Wang, Y.; Chen, H.; Song, B.; Li, H. Regularized modal regression with data-dependent hypothesis spaces. Int. J. Wavelets Multiresolution Inf. Process. 2019, 17, 1950047. [Google Scholar] [CrossRef]

Figure 1. The influence of different t on the variable selection results.

Table 1. Variable selection results for different circumstances.

	Methods	Uncorrelated Variables			Correlated Variables
		$0 %$	$20 %$	$40 %$	$0 %$	$20 %$	$40 %$
Cauchy noise	$G L$	$28.70$	$24.27$	$19.03$	$20.27$	$17.53$	$16.53$
$(m, n) = (50, 50)$	$R G L$	$29.00$	$26.57$	$27.7$	$20.80$	$15.40$	$14.16$
	$T G L_{t = 6 \times 10^{- 6}}$	$29.63$	$24.06$	$18.04$	$20.67$	$17.00$	$16.23$
	$T G L_{t = - 1}$	$29.53$	$26.07$	$26.00$	$21.07$	$17.6$	$17.13$
	$T G L_{t = - 10}$	$29.53$	$24.23$	$24.03$	$16.93$	$15.78$	$15.67$
Chi-square noise	$G L$	$29.40$	$24.73$	$20.37$	$18.40$	$17.93$	$16.03$
$(m, n) = (50, 50)$	$R G L$	$29.63$	$26.90$	$27.60$	$19.90$	$16.10$	$14.67$
	$T G L_{t = 6 \times 10^{- 6}}$	$29.84$	$24.4$	$20.90$	$18.20$	$17.30$	$17.20$
	$T G L_{t = - 1}$	$29.14$	$24.56$	$25.18$	$21.10$	$18.77$	$17.93$
	$T G L_{t = - 10}$	$25.13$	$24.10$	$24.93$	$20.83$	$17.10$	$16.60$
Gaussian noise	$G L$	$28.83$	$25.16$	$20.13$	$18.04$	$16.70$	$15.93$
$(m, n) = (50, 50)$	$R G L$	$29.40$	$26.70$	$27.20$	$19.87$	$16.40$	$14.36$
	$T G L_{t = 6 \times 10^{- 6}}$	$29.23$	$25.23$	$20.20$	$18.37$	$17.76$	$16.3$
	$T G L_{t = - 1}$	$27.63$	$26.20$	$25.90$	$21.06$	$18.40$	$17.90$
	$T G L_{t = - 10}$	$22.9$	$25.23$	$25.06$	$21.43$	$17.13$	$16.23$
Cauchy noise	$G L$	$29.60$	$11.33$	$12.30$	$11.93$	$11.57$	$10.97$
$(m, n) = (30, 80)$	$R G L$	$29.87$	$29.97$	$29.93$	$16.50$	$16.97$	$15.20$
	$T G L_{t = 6 \times 10^{- 6}}$	$28.47$	$10.67$	$10.49$	$11.13$	$11.03$	$10.93$
	$T G L_{t = - 1}$	$27.06$	$20.67$	$11.3$	$17.08$	$14.4$	$11.56$
	$T G L_{t = - 10}$	$16.66$	$16.23$	$15.12$	$13.97$	$13.92$	$13.54$
Chi-square noise	$G L$	$29.83$	$11.47$	$12.57$	$12.57$	$11.67$	$11.33$
$(m, n) = (30, 80)$	$R G L$	$29.93$	$29.93$	$29.71$	$19.87$	$18.80$	$17.50$
	$T G L_{t = 6 \times 10^{- 6}}$	$29.03$	$11.10$	$12.90$	$12.50$	$10.87$	$11.43$
	$T G L_{t = - 1}$	$29.37$	$23.60$	$23.53$	$16.08$	$14.4$	$11.40$
	$T G L_{t = - 10}$	$28.17$	$23.33$	$23.23$	$13.97$	$13.92$	$13.54$
Gaussian noise	$G L$	$29.77$	$11.83$	$12.27$	$12.92$	$12.44$	$11.54$
$(m, n) = (30, 80)$	$R G L$	$29.70$	$29.93$	$29.93$	$19.73$	$13.67$	$9.83$
	$T G L_{t = 6 \times 10^{- 6}}$	$28.47$	$10.67$	$10.49$	$13.06$	$9.79$	$8.73$
	$T G L_{t = - 1}$	$27.06$	$20.67$	$11.3$	$16.08$	$14.4$	$11.90$
	$T G L_{t = - 10}$	$16.66$	$16.23$	$15.12$	$13.97$	$13.92$	$13.54$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Song, B.; Pan, Z.; Yang, C.; Xiao, C.; Li, W. Gradient Learning under Tilted Empirical Risk Minimization. Entropy 2022, 24, 956. https://doi.org/10.3390/e24070956

AMA Style

Liu L, Song B, Pan Z, Yang C, Xiao C, Li W. Gradient Learning under Tilted Empirical Risk Minimization. Entropy. 2022; 24(7):956. https://doi.org/10.3390/e24070956

Chicago/Turabian Style

Liu, Liyuan, Biqin Song, Zhibin Pan, Chuanwu Yang, Chi Xiao, and Weifu Li. 2022. "Gradient Learning under Tilted Empirical Risk Minimization" Entropy 24, no. 7: 956. https://doi.org/10.3390/e24070956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gradient Learning under Tilted Empirical Risk Minimization

Abstract

1. Introduction

2. Learning Objective

2.1. Gradient Learning with t-Tilted Loss

2.2. Main Results

3. Computing Algorithm

4. Simulation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI