1 Introduction

In this paper, we are mainly interested in the numerical methods for the weighted low-rank Hankel matrix optimization:

$$\begin{aligned} \min \ f(X) := \frac{1}{2} \Vert W \circ (X-A) \Vert , \ \ \text{ s.t. } \ \ X \in {\mathbb {M}_r}\cap \mathbb {H}^{k \times \ell }, \ \ \end{aligned}$$
(1)

where \(\mathbb {H}^{k \times \ell }\) is the space of all \(k \times \ell \) Hankel matrices in the real/complex field with the standard trace inner product, \(\Vert \cdot \Vert \) is the Frobenius norm, \({\mathbb {M}_r}\) is the set of matrices whose ranks are not greater than a given rank r, A is given, and W is a given weight matrix \(W_{ij} \ge 0\). Here \((A \circ B)\) represents elementwise multiplication (e.g., Hadamard product) between A and B. We note that the size of all matrices involved are of \(k \times \ell \). The difficulties in solving (1) are with the low-rank constraint and how to effectively handle a general weight matrix W, the latter of which is often overlooked in existing literature. Our main purpose is to develop a novel, globally convergent algorithm for (1) and its efficiency will be benchmarked against several state-of-the-art algorithms.

In what follows, we first explain an important application of (1) to time series data, which will be tested in our numerical experiment part. We then review the latest advances on algorithms relating to the alternating projection method. We finish this section by explaining our approach and main contributions.

1.1 Applications in time series

Problem (1) arises from a large number of applications including signal processing, system identification and finding the greatest common divisor between polynomials [23]. To motivate our investigation on (1), let us consider a complex-valued time series \(\mathbf{a}= (a_1, a_2, \ldots , a_n)\) of finite rank [17, Chp. 5]:

$$\begin{aligned} a_t = \sum _{s=1}^m P_s(t) \lambda _s^t, \quad t = 1,2,\ldots , n \end{aligned}$$
(2)

where \(P_s(t)\) are a complex polynomial of degree \((\nu _s-1)\) (\(\nu _s\) are positive integers) and \(\lambda _s \in \mathbb {C}\setminus \{0\}\) are distinct. Define \(r := \nu _1+\ldots +\nu _m\) (”:=” means ”define”). Then it is known [29, Prop. 2.1] that the rank of the Hankel matrix A generated by \(\mathbf{a}\):

$$\begin{aligned} A = {{\mathcal {H}}}(\mathbf{a}) := \left[ \begin{array}{cccc} a_1 &{} a_2 &{} \cdots &{} a_\ell \\ a_2 &{} a_3 &{} \cdots &{} a_{\ell +1} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ a_k &{} a_{k+1} &{} \cdots &{} a_n \end{array} \right] \end{aligned}$$

must be r, where the choice of \((k, \ell )\) satisfies \(n = k+\ell -1\) and \(r\le k \le n-r+1\).

Suppose now that the time series \(\mathbf{a}\) is contaminated and/or has missing values. To reconstruct \(\mathbf{a}\), a natural approach is to computing its nearest time series \(\mathbf{x}\) by the least squares:

$$\begin{aligned} \min \ \sum _{i=1}^n w_i | a_i - x_i|^2, \quad \text{ s.t. } \quad \text{ rank }(X) \le r, \ \ X = {{\mathcal {H}}}(\mathbf{x}), \end{aligned}$$
(3)

where \(\mathbf{w}=(w_1, \ldots , w_n) \ge 0 \) are the corresponding weight vector emphasizing the importance of each elements of \(\mathbf{a}\). The equivalent reformulation of (3) as (1) is obtained by setting

$$\begin{aligned} W := {{\mathcal {H}}}( \sqrt{\mathbf{v}} \circ \sqrt{\mathbf{w}}) \quad \text{ and } \quad v_i = \left\{ \begin{array}{ll} 1/i &{} \ \text{ for } \ i=1, \ldots , k-1 \\ 1/k &{} \ \text{ for } \ i=k, \ldots , n-k+1 \\ 1/(n-i+1) &{} \ \text{ for } \ i = n-k +2, \ldots , n, \end{array} \right. \end{aligned}$$

where \(\mathbf{v}\) is known as the averaging vector of Hankel matrix of size \(k \times \ell \) \((k \le \ell )\) and \(\sqrt{\mathbf{w}}\) is the elementwise square root of \(\mathbf{w}\). We note that the widely studied (1) with \(W_{ij} \equiv 1\) corresponds to \(w_i = 1/v_i\), which is known as the trapezoid weighting. Another popular choice for financial time series is the exponential weights \(w_i = \exp (\alpha i)\) for some \(\alpha >0\). We refer to [15, Sect. 2.4] for more comments on the choice of weights.

A special type of the time series of (2) arises from the spectral compressed sensing, which has attracted considerable attention lately [4]. In its one dimensional case, \(a_t\) is often a superposition of a few complex sinusoids:

$$\begin{aligned} a_t = \sum _{s=1}^r d_s \exp \left\{ (2\pi j \omega _s - \tau _s) t \right\} , \end{aligned}$$
(4)

where \(j = \sqrt{-1}\), r is the model order, \(\omega _s\) is the frequency of each sinusoid, and \(d_s \not =0\) is the weight of each sinusoid, and \(\tau _s \ge 0\) is a damping factor. We note that (4) is a special case of (2) with \(P_s(t) = d_s\) (hence \(\nu _s = 1\)) and \(\lambda _s = \exp (2\pi j \omega _s - \tau _s)\). If \(a_t\) is sampled at all integer values from 1 to n, we get a sample vector \(\mathbf{a}\in \mathbb {C}^n\). Consequently, the rank of \({{\mathcal {H}}}(\mathbf{a})\) must be r. However, in practice, only a subset \(\varOmega \) of the sampling points \(\{1, \ldots , n\}\) may be observed (possibly contaminated), leading to the question of how to best reconstruct a(t) based on its partial observation \(a_i\) on \(\varOmega \). This has led to the Hankel matrix completion/approximation problem of (1), see [4, Sect. II.A] and [2, Sect. 2.1]. A popular choice of W in the spectral compressed sensing is \(W_{ij} = 1\) for all (ij), resulting in the distance between X and A in (1) being measured by the standard Frobenius norm. In this paper, we assume

Assumption 1

W is Hankel and non-negative (i.e., \(W_{ij} \ge 0\) for all (ij)).

1.2 On alternating projection methods

Low-rank matrix optimization is an active research area. Our short review is only able to focus on a small group of those papers that motivated our research. We note that there are four basic features about the problem (1): (i) X has to be low rank; (ii) X has Hankel structure; (iii) the objective is weighted; and (iv) X may be complex valued. The first feature is the most difficult one to handle because it causes the nonconvexity of the problem. Many algorithms have been developed proposing different ways to handle this low rank constraint. One of the most popular choices is to use the truncated singular value decomposition to project X to be its nearest rank r matrix and we denote the projection by \(\varPi _{{\mathbb {M}_r}}(X)\). This has given rise to the basic Method of Alternating Projections (MAP) (also known as the Cadzow method [1]): Given \(X^0 \in \mathbb {H}\), update

$$\begin{aligned} X^{\nu +1} = \varPi _{\mathbb {H}} \Big ( \varPi _{{\mathbb {M}_r}} (X^\nu ) \Big ) , \quad \nu = 0, 1, 2 ,\ldots \end{aligned}$$
(5)

where \(\varPi _{\mathbb {H}}(\cdot )\) is the orthogonal projection operator to the Hankel subspace \(\mathbb {H}^{k \times \ell }\). Despite its popularity in engineering sciences, Cadzow’s method can not guarantee the convergence to an optimal solution. Even convergence occurs, not much is known about where it converges to. It has also been criticized for completely ignoring the objective function, see [5, 6, 8, 14]. In particular, the weight matrix W does not enter Cadzow’s method at all because the truncated SVD does not admit a closed-form solution under a weighted norm. Gillard and Zhigljavsky [16] proposed to replace \(\varPi _{{\mathbb {M}_r}}\) by its diagonally weighted variants and studied how to best approximate the weight matrix W by a diagonal weight matrix. Qi et. al. [25] proposed to use a sequential diagonal weight matrices aiming to get a better approximation to the original weight matrix. Despite the improved numerical convergence, those methods in [16, 25] still inherit the essential problem of convergence of Cadzow’s method. Recently, Lai and Varghese [20] considered a similar method for a matrix completion problem and established the linear convergence of their method under a kind of “transversality” condition provided that the initial point is close enough to a true rank-r completion. We refer to [7] for a more general transversality condition that ensures a local linear convergence rate of MAP onto nonconvex sets.

Alternating projections of \(\varPi _{{\mathbb {M}_r}}(\cdot )\) and \(\varPi _{\mathbb {H}}(\cdot )\) also play an important role in the class of iterative hard thresholding (IHT) algorithms for spectral compressed sensing. For example, Cai et. al. [3] established the convergence of IHT in the statistical sense (i.e., with high probability) under a coherence assumption on the initial observation matrix A. Although local convergence results (be in the sense of monotonically decreasing [20] or in the statistical sense [3]) may be established for MAP under some conditions, we are not aware of any existing global convergence results mainly due to the nonconvexity of the rank constraint. For the general weighted (1), it appears to be a difficult task to develop a variant of MAP that enjoys both global and local linear convergence properties. We will achieve this through a penalty approach.

Penalty approaches have long been used to develop globally convergent algorithms for problems with rank constraints, see [10, 12, 13, 21, 22, 27, 32, 33]. For example, Gao [12] proposed the penalty function p(X) based on the following observation:

$$\begin{aligned} \text{ rank }(X) \le r \quad \Longleftrightarrow \quad p(X) := \Vert X \Vert _* - \sum _{i=1}^r \sigma _i(X) = 0, \end{aligned}$$

where \(\Vert X\Vert _*\) is the nuclear norm of X and \(\sigma _1(X) \ge \cdots \sigma _n(X)\) are the singular values of X in nonincreasing order. However, the resulting method, as well as those in [21, 22, 27], has nothing to do with MAP any more and its implementation is not trivial.

1.3 Our approach and main contributions

In this paper, we propose a new penalty function and develop a penalized method whose main step is the alternating projections. We call it the penalized MAP (pMAP). The new penalty function is the Euclidean distance function \(d_{{\mathbb {M}_r}}(X)\) from X to \({\mathbb {M}_r}\):

$$\begin{aligned} d_{{\mathbb {M}_r}} (X) := \min \ \{ \Vert X - Z\Vert \ | \ Z \in {\mathbb {M}_r}\} \ \ \text{ and } \text{ define } \ \ g_r(X) := \frac{1}{2} d_{{\mathbb {M}_r}}^2 (X). \end{aligned}$$
(6)

Obviously, the original problem (1) is equivalent to

$$\begin{aligned} \min \ f(X), \quad \text{ s.t. } \ \ d_{{\mathbb {M}_r}} (X) = 0, \ \ X \in \mathbb {H}. \end{aligned}$$

We propose to solve the quadratic penalty problem with \(\rho >0\) being a penalty parameter:

$$\begin{aligned} \min \ F_\rho (X) := f(X) + \rho g_r(X), \ \ \text{ s.t. } \ \ X \in \mathbb {H}. \end{aligned}$$
(7)

By following the standard argument [24, Thm. 17.1] for the classical quadratic penalty method, one can establish that the global solution of (7) converges to that of (1) as \(\rho \) approaches infinity provided the convergence happens. However, in practice, it is probably as difficult to find a global solution of (7) as for the original problems. It is hence important to establish the correspondence between the first-order stationary points of (7) and that of (1). This is done in Theorem 1 under a generalized linear independence condition.

The remaining task is to efficiently compute a stationary point of (7) for a given \(\rho >0\). The key observation is that \(g_r(X)\) can be represented as the difference of two convex functions, which can be easily majorized (later on its meaning) to get a majorization function \(g^{(m)}_r(X, X^\nu )\) of \(g_r(X)\) at the current iterate \(X^\nu \). We then solve the majorized subproblem:

$$\begin{aligned} X^{\nu +1} = \arg \min \ f(X) + \rho g^{(m)}_r(X, X^\nu ), \quad \text{ s.t. } \quad X \in \mathbb {H}. \end{aligned}$$
(8)

We will show that the update takes the following form:

$$\begin{aligned} X^{\nu +1} = \frac{W^{(2)}}{ \rho + W^{(2)} } \circ A + \frac{ \rho }{ \rho + W^{(2)} } \circ \varPi _{\mathbb {H}} ( \varPi _{{\mathbb {M}_r}} (X^\nu ) ), \end{aligned}$$
(9)

where \(W^{(2)} := W \circ W\) and the division \(W^{(2)}/ ( \rho + W^{(2)})\) is taken componentwise. Compared with (5), this update is just a convex combination of the observation matrix A and the MAP iterate in (5). In the special case that \(W \equiv 0\) (which completely ignores the objective in (1)) or \(\rho = \infty \), (9) reduces to MAP. We will analyze the convergence behaviour of pMAP (9). In particular, we will establish the following among others.

  1. (i)

    The objective function sequence \(\{ F(X^\nu , \rho )\}\) will converge and \(\Vert X^{\nu +1} - X^\nu \Vert \) converges to 0. Moreover, any limiting point of \(\{ X^\nu \}\) is an approximate KKT point of the original problem (1) provided that the penalty parameter is above certain threshold (see Theorem 2).

  2. (ii)

    If \(\widehat{X}\) is an accumulation point of the iterate sequence \(\{ X^\nu \}\), then the whole sequence converges to \(\widehat{X}\) at a linear rate provided that \(\sigma _r(\widehat{X}) \gg \sigma _{r+1}(\widehat{X})\) (see Theorem 3).

Our results in (i) and (ii) provide satisfactory justification of pMAP. It is not only globally convergent, but also enjoys a linear convergence rate under reasonable conditions. Furthermore, we can assess the quality of the solution as an approximate KKT of (1) if we are willing to increase the penalty parameter. Of course, balancing the fidelity term f(X) and the penalty term is an important issue that is beyond the current paper. The result in (ii) is practically important too. Existing empirical results show that MAP often terminates at a point whose cut-off singular values (\(\sigma _i, i \ge r+1\)) are significantly smaller than the kept singular values (\(\sigma _i, i\le r\)). Such points are often said to have a numerical rank r, but the theoretical rank is higher than r. This is exactly the situation that was addressed in (ii). Those results are stated and proved for real-valued matrices. We will extend them to the complex case, thanks to a technical result (Proposion 2) that the subdifferential of \(g_r(X)\) in complex domain can also be computed in a similar fashion as in the real domain. To our best knowledge, this is the first variant of MAP that can handle general weights and enjoys both global convergence and locally linear convergence rate under a reasonable condition (i.e., \(\sigma _r \gg \sigma _{r+1}\)).

The paper is organized as follows. In the next section, we will first set up our standard notation and establish the convergence result for the quadratic penalty approach (7) when \(\rho \rightarrow \infty \). Sect. 3 includes our method of pMAP and its convergence results when \(\rho \) is fixed. In Sect. 4, we will address the issue of extension to the complex-valued matrices, which arise from (2) and (4). The key concept used in this section is the Wirtinger calculus, which allows us to extend our analysis from the real case to the complex case. We report extensive numerical experiments in Sect. 5 and conclude the paper in Sect. 6.

2 Quadratic penalty approach

The main purpose of this section is to establish the convergence of the stationary points of the penalized problems (7) to that of the original problem (1) as the penalty parameter \(\rho \) goes to \(\infty \). For the simplicity of our analysis, we focus on the real case. We will extend our results to the complex case in Sect. 4. We first introduce the notation used in this paper.

2.1 Notation

For a nonnegative matrix such as the weight matrix W, \(\sqrt{W}\) is its componentwise square root matrix \((\sqrt{W_{ij}})\). For a given matrix \(X \in \mathbb {C}^{k \times \ell }\), we often use its singular value decomposition (assume \(k \le \ell \))

$$\begin{aligned} X = U \text{ diag }(\sigma _1(X), \ldots , \sigma _k(X) ) V^T, \end{aligned}$$
(10)

where \(\sigma _1(X) \ge \cdots \ge \sigma _n(X)\) are the singular values of X and \(U \in \mathbb {C}^{k \times k}\), \(V \in \mathbb {C}^{ \ell \times \ell }\) are the left and right singular vectors of X. For a given closed subset \({{\mathcal {C}}}\subset \mathbb {C}^{ k \times \ell }\), we define the set of all projections from X to \({{\mathcal {C}}}\) by

$$\begin{aligned} {{\mathcal {P}}}_{{\mathcal {C}}}(X) := \arg \min \{ \Vert X -Z\Vert : \ Z \in {{\mathcal {C}}}\}. \end{aligned}$$

If \({{\mathcal {C}}}\) is also convex, then \({{\mathcal {P}}}_{{\mathcal {C}}}(X)\) is unique. When \({{\mathcal {C}}}= {\mathbb {M}_r}\), \({{\mathcal {P}}}_{{\mathbb {M}_r}}(X)\) may have multiple elements. We define a particular element in \({{\mathcal {P}}}_{{\mathbb {M}_r}}(X)\) that is based on the SVD (10):

$$\begin{aligned} \varPi _{{\mathbb {M}_r}} (X) = U_r \text{ diag }( \sigma _1(X), \ldots , \sigma _r(X)) V_r^T, \end{aligned}$$

where \(U_r\) and \(V_r\) consist of the first r columns of U and V respectively.

Related to the function \(g_r(X)\) defined in (6), the function

$$\begin{aligned} h_r(X) : = \frac{1}{2} \Vert X \Vert _F^2 - g_r(X) , \end{aligned}$$
(11)

has the following properties by the classical result of Eckart and Young [9]:

$$\begin{aligned} \text{ dist}^2(X, {\mathbb {M}_r})= & {} \Vert X - \varPi _{{\mathbb {M}_r}} \Vert ^2 = \sigma ^2_{r+1}(X) + \cdots + \sigma _n^2(X), \\ h_r(X)= & {} \frac{1}{2} \Big ( \sigma _1^2(X) + \cdots + \sigma _r^2(X) \Big ) = \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) \Vert ^2. \end{aligned}$$

It follows from [12, Prop. 2.16] that \(h_r(X)\) is convex and the subdifferentials of \(h_r(X)\) and \(g_r(X)\) in the sense of [26, Def. 8.3] are respectively given by

$$\begin{aligned} \partial h_r(X) = \text{ conv } ({{\mathcal {P}}}_{{\mathbb {M}_r}}(X)) \quad \text{ and } \quad \partial g_r(X) = X - \partial h_r(X), \end{aligned}$$
(12)

where \(\text{ conv }(\varOmega )\) denotes the convex hull of the set \(\varOmega \). Finally, we let \({{\mathcal {B}}}_\epsilon (X)\) denote the \(\epsilon \)-neighbourhood centred at X.

2.2 Convergence of quadratic penalty approach

The classical quadratic penalty methods try to solve a sequence of penalty problems:

$$\begin{aligned} X^\nu = \arg \min \ F_{\rho _\nu }(X), \ \ \text{ s.t. } \ \ X\in \mathbb {H}, \end{aligned}$$
(13)

where the sequence \(\rho _\nu >0\) is increasing and goes to \(\infty \). By following the standard argument (e.g., [24, Thm. 17.1]), one can establish that every limit of \(\{X^\nu \}\) is also a global solution of (1). However, in practice, it is probably as difficult to find a global solution for (13) as for the original problem (1). Therefore, only an approximate solution of (13) is possible. To quantify the approximation, we recall the optimality conditions relating to both the original and penalized problems.

Following the optimality theorem [26, Thm. 8.15], we define the first-order optimality condition of problem (1) and (7):

Definition 1

(First-order optimality condition) \(\widehat{X} \in \mathbb {H}\) satisfies the first-order optimality condition of (1) if

$$\begin{aligned} 0 \in \nabla f(\widehat{X}) + \widehat{\lambda } \partial d_{{\mathbb {M}_r}}(\widehat{X}) + \mathbb {H}^\perp , \end{aligned}$$
(14)

where \(\widehat{\lambda }\) is the Lagrangian multiplier. Similarly, we say \({X^\nu } \in \mathbb {H}\) satisfies the first-order optimality condition of the penalty problem (7) if

$$\begin{aligned} 0 \in \nabla f({X^\nu }) + \rho _\nu \; \partial g_r({X^\nu }) + \mathbb {H}^\perp . \end{aligned}$$
(15)

We generate \(X^\nu \in \mathbb {H}\) such that the condition (15) is approximately satisfied:

$$\begin{aligned} \Vert {{\mathcal {P}}}_{\mathbb {H}} ( \nabla f(X^\nu ) + \rho _\nu (X^\nu - \varPi _{{\mathbb {M}_r}}(X^\nu ) ) ) \Vert \le \epsilon _\nu , \end{aligned}$$
(16)

where \(\epsilon _\nu \downarrow 0\). We can establish the following convergence result.

Theorem 1

We assume the sequence \(\{ \rho _\nu \}\) goes to \(\infty \) and \(\{\epsilon _\nu \}\) decreases to 0. Suppose each approximate solution \(X^\nu \) is generated to satisfy (16).

Let \(\widehat{X}\) be an accumulation point of \(\{ X^\nu \}\) and we assume

$$\begin{aligned} \partial d_{{\mathbb {M}_r}}(\widehat{X}) \cap \mathbb {H}^\perp = \{0\}. \end{aligned}$$
(17)

Then \(\widehat{X}\) satisfies the first-order optimality condition (14).

Proof

Suppose \(\widehat{X}\) is the limiting point of the subsequence \(\{ X^\nu \}_{{{\mathcal {K}}}}\). We consider the following two cases.

Case 1 There exists an infinite subsequence \({{\mathcal {K}}}_1\) of \({{\mathcal {K}}}\) such that \(\text{ rank }(X^\nu ) \le r\) for \(\nu \in {{\mathcal {K}}}_1\). This would imply \(\partial g_r(X^\nu ) = \{ 0\}\), which with (16) implies \(\Vert {{\mathcal {P}}}_{\mathbb {H}}( \nabla f(X^\nu )) \Vert \rightarrow 0\). Hence (14) holds at \(\widehat{X}\) with the choice \(\widehat{\lambda } = 0\).

Case 2 There exists an index \(\nu _0\) such that \(X^\nu \not \in {\mathbb {M}_r}\) for all \(\nu _0 \le \nu \in {{\mathcal {K}}}\). In this case, we assume that there exists an infinite subsequence \({{\mathcal {K}}}_2\) of \({{\mathcal {K}}}\) such that \( \left\{ (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) ) / d_{{\mathbb {M}_r}}(X^\nu ) \right\} \) has the limit \(\mathbf{v}\). We note that \((X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ))/{d_{{\mathbb {M}_r}} (X^\nu ) } \in \partial d_{{\mathbb {M}_r}}(X^\nu )\) for \(\nu \ge \nu _0\) by [26, (8.53)]. Therefore, its limit \(\mathbf{v}\in \partial d_{{\mathbb {M}_r}} (\widehat{X})\) by the upper semicontinuity. By the assumption (17), we have \(\mathbf{v}\not \in \mathbb {H}^\perp \) because \(\mathbf{v}\) has the unit length. Since \(\mathbb {H}\) is a subspace, \({{\mathcal {P}}}_{\mathbb {H}}(\cdot )\) is a linear operator. It follows from (16) that

$$\begin{aligned} \rho _\nu \Vert {{\mathcal {P}}}_{\mathbb {H}} (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu )) \Vert- & {} \Vert {{\mathcal {P}}}_{\mathbb {H}} (\nabla f(X^\nu )) \Vert \\&\le \Vert {{\mathcal {P}}}_{\mathbb {H}} ( \nabla f(X^\nu ) + \rho _\nu (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) ) \Vert \le \epsilon _\nu . \end{aligned}$$

Hence

$$\begin{aligned} \Vert {{\mathcal {P}}}_{\mathbb {H}} (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) ) \Vert \le \frac{1}{\rho _\nu } \Big ( \epsilon _\nu + \Vert {{\mathcal {P}}}_{\mathbb {H}} (\nabla f(X^\nu )) \Vert \Big ) , \end{aligned}$$

which, for \(\nu \ge \nu _0\), is equivalent to

$$\begin{aligned} d_{{\mathbb {M}_r}} (X^\nu ) \Vert {{\mathcal {P}}}_{\mathbb {H}} (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ))/d_{{\mathbb {M}_r}} (X^\nu ) \Vert \le \frac{1}{\rho _\nu } \Big ( \epsilon _\nu + \Vert {{\mathcal {P}}}_{\mathbb {H}} (\nabla f(X^\nu )) \Vert \Big ) . \end{aligned}$$

Taking limits on \(\{ X^\nu \}_{\nu \in {{\mathcal {K}}}_2}\) and using the fact \(\rho _\nu \rightarrow \infty \) leads to \(d_{{\mathbb {M}_r}} (\widehat{X}) \Vert P_{\mathbb {H}}(\mathbf{v}) \Vert = 0 \). Since \(\mathbf{v}\not \in \mathbb {H}^\perp \), we have \(\Vert {{\mathcal {P}}}_{\mathbb {H}}(\mathbf{v}) \Vert > 0\), which implies \(d_{{\mathbb {M}_r}} (\widehat{X}) = 0\). That is, \(\widehat{X}\) is a feasible point of (1). Now let \(\lambda _\nu := \rho _\nu d_{{\mathbb {M}_r}} (X^\nu )\), we then have

$$\begin{aligned} \lambda _\nu \frac{ X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) }{d_{{\mathbb {M}_r}} (X^\nu ) } = - \nabla f(X^\nu ) + \xi ^\nu , \quad \xi ^\nu := \nabla f(X^\nu ) + \rho _\nu ( X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) ). \end{aligned}$$

Projecting on both sides to \(\mathbb {H}\) yields

$$\begin{aligned} \lambda _\nu {{\mathcal {P}}}_{\mathbb {H}} \left( \frac{ X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) }{d_{{\mathbb {M}_r}} (X^\nu ) } \right) = {{\mathcal {P}}}_{\mathbb {H}} (- \nabla f(X^\nu )) + {{\mathcal {P}}}_{\mathbb {H}} (\xi ^\nu ). \end{aligned}$$
(18)

Computing the inner product on both sides with \( {{\mathcal {P}}}_{\mathbb {H}} ( (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu ) )/d_{{\mathbb {M}_r}} (X^\nu ) ) \), taking limits on the sequence indexed by \({{\mathcal {K}}}_2\), and using the fact \( {{\mathcal {P}}}_{\mathbb {H}} (\xi ^\nu ) \rightarrow 0\) due to (16), we obtain

$$\begin{aligned} \lim _{\nu \in {{\mathcal {K}}}_2} \lambda _\nu \Vert \mathbf{v}\Vert ^2 = \langle \mathbf{v}, \; {{\mathcal {P}}}_{\mathbb {H}} (\nabla f(\widehat{X})) \rangle . \end{aligned}$$

We then have

$$\begin{aligned} \widehat{\lambda } = \lim _{\nu \in {{\mathcal {K}}}_2} \lambda _\nu = \frac{1}{\Vert \mathbf{v}\Vert ^2} \langle \mathbf{v}, \; {{\mathcal {P}}}_{\mathbb {H}} (\nabla f(\widehat{X})) \rangle . \end{aligned}$$

Taking limits on both sides of (18) yields

$$\begin{aligned} {{\mathcal {P}}}_{\mathbb {H}} ( \nabla f(\widehat{X}) + \widehat{\lambda } \mathbf{v}) = 0, \end{aligned}$$

which is sufficient for

$$\begin{aligned} 0 \in \nabla f(\widehat{X}) + \widehat{\lambda } \partial d_{{\mathbb {M}_r}} (\widehat{X}) + \mathbb {H}^\perp . \end{aligned}$$

This completes our result. \(\square \)

Remark 1

Condition (17) can be interpreted as that any \(0 \not =\mathbf{v}\in \partial d_{{\mathbb {M}_r}}(\widehat{X})\) is linearly independent of any set of basis of \(\mathbb {H}^\perp \). Therefore, (17) can be seen as a generalization of the linear independence assumption required in the classical quadratic penalty method for a similar convergence result with all the functions involved being assumed continuously differentiable, see [24, Thm. 17.2]. In fact, what we really needed in our proof is that there exists a subsequence \(\{ (X^\nu - \varPi _{\mathbb {M}_r}(X^\nu )) / d_{{\mathbb {M}_r}} (X^\nu ) \}\) in Case (ii) such that its limit \(\mathbf{v}\) does not belong to \(\mathbb {H}^\perp \). That could be much weaker than the sufficient condition (17).

Theorem 1 establishes the global convergence of quadratic penalty method when the penalty parameter approaches infinity, which drives \(g_r(X^\nu )\) smaller and smaller. In practice, however, we often fix \(\rho \) and solve for \(X^\nu \). We are interested in how far \(X^\nu \) is from being a first-order optimal point of the original problem. For this purpose, we introduce the approximate KKT point, which keeps the first-order optimality condition (15) with an additional requirement that \(g_r(X)\) is small enough.

Definition 2

(\(\epsilon \)-approximate KKT point) Consider the penalty problem (7) and \(\epsilon >0 \) is given. We say a point \(\widehat{X}\in \mathbb {H}\) is an \(\epsilon \)-approximate KKT point of (1) if

$$\begin{aligned} 0 \in \nabla f({\widehat{X}}) + \rho \; \partial g_r({\widehat{X}}) + \mathbb {H}^\perp \quad \text{ and } \quad g_r(\widehat{X}) \le \epsilon . \end{aligned}$$

3 The method of pMAP

This section develops a new algorithm that solves the penalty problem (7), in particular for finding an approximate KKT point of (1). The first part is devoted to the construction of a majorization function for the distance function \(\text{ dist }(X, {\mathbb {M}_r})\). We then describe pMAP based on the majorization introduced and establish its global and local convergence.

3.1 Majorization and DC interpretation

We first recall the essential properties that a majorization function should have. Let \(\theta (\cdot )\) be a real-valued function defined in a finite-dimensional space \({{\mathcal {X}}}\). For a given \(\mathbf{y}\in {{\mathcal {X}}}\), we say a function \(\theta ^{(m)}(\cdot , \mathbf{y}): {{\mathcal {X}}}\mapsto \mathrm{I} \mathrm{R}\) is a majorization of \(\theta (\cdot )\) at \(\mathbf{y}\) if

$$\begin{aligned} \theta ^{(m)}(\mathbf{x}, \mathbf{y}) \ge \theta (\mathbf{x}), \ \ \forall \ \mathbf{x}\in {{\mathcal {X}}}\quad \text{ and } \quad \theta ^{(m)}(\mathbf{y}, \mathbf{y}) = \theta (\mathbf{y}). \end{aligned}$$
(19)

The motivation for employing the majorization is that the squared distance function \(g_r(X)\) is hard to minimize when coupled with f(X) under the Hankel matrix structure. It is noted that

$$\begin{aligned} g_r(X) = \frac{1}{2} \Vert X - \varPi _{{\mathbb {M}_r}}(X) \Vert ^2 \le \frac{1}{2} \Vert X - \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 := g_r^{(m)} (X, Z), \ \ \forall \ X, Z \in \mathbb {C}^{k \times \ell }\end{aligned}$$

where the inequality used the fact that \(\varPi _{{\mathbb {M}_r}}(X)\) is a nearest point in \({\mathbb {M}_r}\) to X. It is easy to verify that \(g_r^{(m)}(X, Z)\) is a majorization function of \(g_r(X)\) at Z.

The following way in deriving the majorization is crucial to our convergence analysis. We recall

$$\begin{aligned} h_r(X)= & {} \frac{1}{2} \Vert X \Vert ^2 - g_r(X) = \frac{1}{2} \Vert X \Vert ^2 - \frac{1}{2} \min \left\{ \Vert X-Z\Vert ^2: \ Z\in {\mathbb {M}_r}\right\} \\= & {} \max \left\{ \langle X, Z \rangle - \frac{1}{2} \Vert Z \Vert ^2: \ Z \in {\mathbb {M}_r}\right\} . \end{aligned}$$

Being the pointwise maximum of linear functions when \(Z\in {\mathbb {M}_r}\) is given, \(h_r(X)\) is convex. The convexity of \(h_r(X)\) and (12) yields

$$\begin{aligned} h_r(X) \ge h_r(Z) + \langle M, \; X-Z \rangle , \ \ \forall \ X, Z \in \mathbb {R}^{k \times \ell }, \ \ M \in {{\mathcal {P}}}_{\mathbb {M}_r}(Z) \end{aligned}$$
(20)

which further implies

$$\begin{aligned} g_r(X)= & {} \frac{1}{2} \Vert X\Vert ^2 - h_r(X) \\\le & {} \frac{1}{2} \Vert X\Vert ^2 - h_r(Z) - \langle \varPi _{{\mathbb {M}_r}}(Z), \; X-Z \rangle \\= & {} \frac{1}{2} \Vert X - \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 \\&- \frac{1}{2} \Big ( \underbrace{\Vert \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 - \Vert Z\Vert ^2 + \Vert Z - \varPi _{{\mathbb {M}_r}} (Z)\Vert ^2 - 2 \langle \varPi _{{\mathbb {M}_r}}(Z), \; Z \rangle }_{=0} \Big ) \\= & {} \frac{1}{2} \Vert X - \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 = g_r^{(m)} (X, Z). \end{aligned}$$

In other words, \(g_r(X)\) can be seen as Difference of Convex functions, the so-called DC function. Using a subgradient is a common way to majorize DC functions, see [12].

3.2 The pMAP algorithm

We recall that our main problem is (7). Our first step is to construct a majorized function of \(F_\rho (X)\) at the current iterate \(X^\nu \):

$$\begin{aligned}&F_\rho ^{(m)}(X, X^\nu ) = \frac{1}{2} \Vert W \circ (X-A) \Vert ^2 + {\rho } g_r^{(m)} (X, X^\nu ) \\= & {} \frac{1}{2} \Vert W \circ (X-A) \Vert ^2 + \frac{\rho }{2} \Vert X - \varPi _{{\mathbb {M}_r}} (X^\nu ) \Vert ^2 \\= & {} \frac{1}{2} \Vert W \circ X \Vert ^2 + \frac{\rho }{2} \Vert X\Vert ^2 - \langle W^{(2)} \circ A + \rho \varPi _{{\mathbb {M}_r}} (X^\nu ), \ X \rangle + \frac{1}{2} \Vert W \circ A \Vert ^2 \\&+ \frac{\rho }{2} \Vert \varPi _{{\mathbb {M}_r}} (X^\nu ) \Vert ^2 \\= & {} \frac{1}{2} \Vert \sqrt{ \rho + W^{(2)}} \circ (X - X^\nu _\rho ) \Vert ^2 + \frac{1}{2} \Vert W \circ A \Vert ^2 + \frac{\rho }{2} \Vert \varPi _{{\mathbb {M}_r}} (X^\nu ) \Vert ^2 - \frac{\rho + W^{(2)}}{2} \Vert X^\nu _\rho \Vert ^2, \end{aligned}$$

where

$$\begin{aligned} X^\nu _\rho := \frac{ \rho \varPi _{{\mathbb {M}_r}} (X^\nu ) + W^{(2)}\circ A }{ \rho + W^{(2)}}. \end{aligned}$$
(21)

Note that the division is in the sense of componentwise. The subproblem to be solved at iteration \(\nu \) is

$$\begin{aligned} X^{\nu +1}= & {} \arg \min \ F_\rho ^{(m)} (X, X^\nu ) \quad \text{ s.t. } \ \ X \in \mathbb {H}\nonumber \\= & {} \arg \min \ \frac{1}{2} \Vert \sqrt{ \rho + W^{(2)}} \circ (X - X^\nu _\rho ) \Vert ^2 \quad \text{ s.t. } \ \ X \in \mathbb {H}\nonumber \\= & {} \varPi _{\mathbb {H}} (X^\nu _\rho ), \end{aligned}$$
(22)

where \(X^\nu _\rho \) is defined in (21). The last equation in (22) is due to \(W_\rho := \sqrt{ \rho + W^{(2)}}\) being Hankel and computing \(X^{\nu +1}\) in (22) is equivalent to averaging \(X^\nu _\rho \) along its all anti-diagonals. Since A, \(\rho /(\rho + W^{(2)})\), \(W^{(2)}/(\rho + W^{(2)})\) are all Hankel matrices (due to Assumption 1), \(X^{\nu +1}\) can be calculated through (9).

figure a

Remark 2

Being a direct consequence of employing the majorization technique, the following decreasing property holds:

$$\begin{aligned} F_{\rho } (X^{\nu + 1})\le & {} F^{(m)}_{\rho } ( X^{\nu + 1}, \; X^\nu ) \quad (\text{ property } \text{ of } \text{ majorization } \text{(19) }) \\\le & {} F^{(m)}_{\rho } ( X^{\nu }, \; X^\nu ) \qquad \quad (\text{ because } \text{ of } \text{(22) }) \\= & {} F_{\rho } (X^\nu ) \qquad \qquad \quad \ \ \ (\text{ property } \text{ of } \text{ majorization } \text{(19) }) . \end{aligned}$$

If \(F_\rho \) is coercive (i.e., \(F_\rho (X) \rightarrow \infty \) if \(\Vert X \Vert \rightarrow \infty \), which would be the case if we require \(W>0\)), the sequence \(\{ X^\nu \}\) will be bounded.

A widely used stopping criterion is \(\Vert X^{\nu +1} - X^\nu \Vert \le \epsilon \) for some small tolerance \(\epsilon >0\). We will see below that \(\Vert X^{\nu +1} - X^\nu \Vert \) approaches zero and hence such convergence check will eventually be satisfied. For our theoretical analysis, we assume that pMAP generates an infinite sequence (e.g., let \(\epsilon =0\)).

3.3 Convergence of pMAP

We have more results on the convergence of pMAP.

Theorem 2

Let \(\{ X^\nu \}\) be the sequence generated by pMAP. The following holds.

  1. (i)

    We have

    $$\begin{aligned} F_\rho (X^{\nu + 1}) - F_\rho (X^{\nu }) \le -\frac{\rho }{2} \Vert X^{\nu + 1} - X^\nu \Vert ^2, \ \ k=1,2, \ldots , . \end{aligned}$$

    Furthermore, \(\Vert X^{\nu + 1} - X^\nu \Vert \rightarrow 0\).

  2. (ii)

    Let \(\widehat{X}\) be an accumulation point of \(\{ X^\nu \}\). We then have

    $$\begin{aligned} \nabla f(\widehat{X}) + \rho ( \widehat{X}- \varPi _{{\mathbb {M}_r}}(\widehat{X}) ) \in \mathbb {H}^\perp . \end{aligned}$$

    Moreover, for a given \(\epsilon >0\), if \(X^0 \in {\mathbb {M}_r}\cap \mathbb {H}\) and

    $$\begin{aligned} \rho \ge \rho _{\epsilon } := \frac{ f(X^0)}{\epsilon }, \end{aligned}$$

    then \(\widehat{X}\) is an \(\epsilon \)-approximate KKT point of (1).

Proof

We will use a number of facts to establish (i). The first fact is due to the convexity of f(X):

$$\begin{aligned} f(X^\nu ) - f(X^{\nu +1}) \ge \langle \nabla f(X^{\nu +1}), X^\nu - X^{\nu +1} \rangle \end{aligned}$$
(23)

The second fact is the identity

$$\begin{aligned} \Vert X^{\nu +1}\Vert ^2 - \Vert X^\nu \Vert ^2 = 2\langle X^{\nu +1} - X^{\nu }, X^{\nu +1} \rangle - \Vert X^{\nu +1} - X^\nu \Vert ^2 \end{aligned}$$
(24)

The third fact is due to the convexity of \(h_r(X)\) defined in (11) and \(\varPi _{{\mathbb {M}_r}}(X) \in \partial h_r(X)\):

$$\begin{aligned} h_r(X^{\nu +1}) - h_r(X^\nu )\ge \langle \varPi _{{\mathbb {M}_r}}(X^\nu ), X^{\nu +1} - X^{\nu } \rangle \end{aligned}$$
(25)

The last fact is the optimality condition of problem (22):

$$\begin{aligned} \nabla f(X^{\nu +1}) + \rho ( X^{\nu +1} - \varPi _{{\mathbb {M}_r}}(X^\nu ) ) \in \mathbb {H}^\perp . \end{aligned}$$
(26)

Combining all facts above leads to a sufficient decrease in \(F_{\rho }(X^k)\):

$$\begin{aligned} \begin{aligned}&F_{\rho }(X^{\nu +1})- F_{\rho }(X^{\nu }) \\&= f(X^{\nu +1}) - f(X^\nu ) + \rho g_r(X^{\nu +1}) - \rho g_r(X^\nu )\\&\overset{(23)}{\le } \langle \nabla f(X^{\nu +1}), X^{\nu +1} - X^\nu \rangle + \rho g_r(X^{\nu +1}) - \rho g_r(X^\nu ) \\&= \langle \nabla f(X^{\nu +1}), X^{\nu +1} - X^\nu \rangle + \frac{\rho }{2}(\Vert X^{\nu +1} \Vert ^2 - \Vert X^\nu \Vert ^2) - \rho (h_r(X^{\nu +1}) - h_r(X^\nu )) \\&\overset{(24)}{=} \langle \nabla f(X^{\nu +1}) + \rho X^{\nu +1}, X^{\nu +1} - X^\nu \rangle - \frac{\rho }{2}(\Vert X^{\nu +1} - X^\nu \Vert ^2) \\&\qquad - \rho (h_r(X^{\nu +1}) - h_r(X^\nu ))\\&\overset{(25)}{\le } \langle \nabla f(X^{\nu +1}) + \rho X^{\nu +1} - \rho \varPi _{{\mathbb {M}_r}}(X^\nu ), X^{\nu +1} - X^\nu \rangle - \frac{\rho }{2} \Vert X^{\nu +1} - X^\nu \Vert ^2 \\&\overset{(26)}{\le } - \frac{\rho }{2} \Vert X^{\nu +1} - X^\nu \Vert ^2 \end{aligned} \end{aligned}$$
(27)

In the above we also used the fact that \(X^{\nu +1} - X^\nu \in \mathbb {H}\). Since the sequence \(\{ F_\rho (X^\nu ) \}\) is non-increasing and is bounded from below by 0, we have \( \Vert X^{\nu +1} - X^\nu \Vert ^2 \rightarrow 0\).

(ii) Suppose \(\widehat{X}\) is the limit of the subsequence \(\{ X^\nu \}_{\nu \in {{\mathcal {K}}}}\). It follows from \(\Vert X^{\nu +1} - X^\nu \Vert \rightarrow 0\) that \(\widehat{X}\) is also the limit of \(\{ X^{\nu +1}\}_{\nu \in {{\mathcal {K}}}}\). Taking limits on both sides of (26) and using the upper semi-continuity of the projections \({{\mathcal {P}}}_{{\mathbb {M}_r}}(\cdot )\) yields

$$\begin{aligned} \nabla f(\widehat{X}) + \rho ( \widehat{X}- \varPi _{{\mathbb {M}_r}}(\widehat{X}) ) \in \mathbb {H}^\perp . \end{aligned}$$

we use the fact that \(\{ F_{\rho }(X^\nu ) \}\) is non-increasing to get

$$\begin{aligned} f(X^0) = f(X^0) + \rho g_r(X^0) = F_\rho (X^0)\ge & {} \lim F_\rho (X^\nu ) \\&= F_\rho (\widehat{X}) = f(\widehat{X}) + \rho g_r(\widehat{X}) \ge \rho g_r(\widehat{X}) . \end{aligned}$$

The first equality holds because \(g_r(X^0) = 0\) when \(X^0 \in {\mathbb {M}_r}\). As a result,

$$\begin{aligned} g_r(\widehat{X}) \le \frac{f(X^0)}{\rho } \le \frac{f(X^0)}{\rho _\epsilon } = \epsilon . \end{aligned}$$
(28)

Therefore, \(\widehat{X}\) is an \(\epsilon \)-approximate KKT point of (1). \(\square \)

We note that the first result (i) in Theorem 2 is standard in a majorization-minimization scheme and can be proved in different ways, see, e.g., [32, Thm. 3.7].

3.4 Final rank and linear convergence

This part reports two results. One is on the final rank of the output of pMAP and the rank is always bigger than the desired rank r unless A is already an optimal solution of (1). The other is on the conditions that ensure a linear convergence rate of pMAP. For this purpose, we need the following result.

Proposition 1

[11, Thm. 25] Given the integer \(r>0\) and consider \(\widehat{X}\in \mathrm{I} \mathrm{R}^{k \times \ell }\) of rank \((r+p)\) with \(p \ge 0\). Suppose the SVD of \(\widehat{X}\) is represented as \( \widehat{X}= \sum _{i=1}^{r+p} \sigma _i \mathbf{u}_i \mathbf{v}_i^T, \) where \(\sigma _1(\widehat{X}) \ge \sigma _2(\widehat{X}) \ge \cdots \ge \sigma _{r+p}(\widehat{X})\) are the singular values of \(\widehat{X}\) and \(\mathbf{u}_i, \mathbf{v}_i\), \(i=1, \ldots , r+p\) are the left and right (normalized) eigenvectors. We assume \(\sigma _r(\widehat{X}) > \sigma _{r+1}(\widehat{X})\) so that the projection operator \(\varPi _{{\mathbb {M}_r}}(X)\) is uniquely defined in a neighbourhood of \(\widehat{X}\). Then \(\varPi _{{\mathbb {M}_r}}(X)\) is differentiable at \(\widehat{X}\) and the directional derivative along the direction Y is given by

$$\begin{aligned} \nabla&\varPi _{{\mathbb {M}_r}}(\widehat{X})(Y) = \varPi _{{{\mathcal {T}}}_{{\mathbb {M}_r}}(\widehat{X})}(Y) \\&+\quad \sum _{{\mathop {1 \le j \le p}\limits ^{1 \le i \le r}} } \left[ \frac{\sigma _{r+j} }{ \sigma _i - \sigma _{r+j} } \langle Y, \; \varPhi _{i, r+j}^+ \rangle \varPhi _{i, r+j}^+ - \frac{\sigma _{r+j} }{ \sigma _i + \sigma _{r+j} } \langle Y, \; \varPhi _{i, r+j}^- \rangle \varPhi _{i, r+j}^- \right] \end{aligned}$$

where \({{\mathcal {T}}}_{{\mathbb {M}_r}}(\widehat{X})\) is the tangent subspace of \({\mathbb {M}_r}\) at \(\widehat{X}\) and

$$\begin{aligned} \varPhi _{i, r+j}^\pm = \frac{1}{\sqrt{2}} \left( \mathbf{u}_{r+j} \mathbf{v}_i^T \pm \mathbf{u}_i \mathbf{v}_{r+j}^T \right) . \end{aligned}$$

Theorem 3

Assume that \(W>0\) and \(\widehat{X}\) be an accumulation point of \(\{ X^\nu \}\). The following hold.

  1. (i)

    \(\text{ rank }(\widehat{X}) > r\) unless A is already the optimal solution of (1).

  2. (ii)

    Suppose \(\widehat{X}\) has rank \((r+p)\) with \(p > 0\). Let \(\sigma _1 \ge \sigma _2 \ge \cdots \ge \sigma _k\) be the singular values of \(\widehat{X}\). Define

    $$\begin{aligned} w_0 := \min \{W_{ij} \} >0, \quad \epsilon _0 := \frac{w_0^2}{\rho }, \quad \epsilon _1 := \frac{\epsilon _0}{4+3 \epsilon _0}, \quad c := \frac{1}{ 1 + \epsilon _1} < 1. \end{aligned}$$

    Under the condition

    $$\begin{aligned} \frac{\sigma _r}{\sigma _{r+1} } \ge \frac{8pr}{\epsilon _0} + 1, \end{aligned}$$

    it holds

    $$\begin{aligned} \Vert X^{\nu +1} - \widehat{X}\Vert \le c \Vert X^\nu - \widehat{X}\Vert \quad \ \ \text{ for } \nu \text{ sufficiently } \text{ large }. \end{aligned}$$

    Consequently, the whole sequence \(\{ X^\nu \}\) converges linearly to \(\widehat{X}\).

Proof

  1. (i)

    Suppose \(\widehat{X}\) is the limit of the subsequence \(\{ X^\nu \}_{k \in {{\mathcal {K}}}}\). We assume \(\text{ rank }(\widehat{X}) \le r\). It follows from Theorem 2 that

    $$\begin{aligned} \{ X^{\nu + 1} \}_{k \in {{\mathcal {K}}}} \rightarrow \widehat{X}\quad \text{ and } \quad \lim _{k \in {{\mathcal {K}}}} \varPi _{{\mathbb {M}_r}} (X^\nu ) = \varPi _{{\mathbb {M}_r}}(\widehat{X}) = \widehat{X}. \end{aligned}$$

    Taking limits on both sides of (9) and using the fact that \(\widehat{X}\) is Hankel, we get

    $$\begin{aligned} \widehat{X}= \frac{W^{(2)}}{\rho + W^{(2)}} \circ A + \frac{\rho }{\rho + W^{(2)}} \circ \varPi _{\mathbb {H}} ( \varPi _{{\mathbb {M}_r}} (\widehat{X})) = \frac{W^{(2)}}{\rho + W^{(2)}} \circ A + \frac{\rho }{\rho + W^{(2)}} \circ \widehat{X}. \end{aligned}$$

    Under the assumption \(W>0\), we have \(\widehat{X}= A\). Consequently, \(\text{ rank }(A) \le r\), implying that A is the optimal solution of (1). Therefore, we must have \(\text{ rank }(\widehat{X}) > r\) if the given matrix A is not optimal already.

  2. (ii)

    Let \(\phi (X) := \varPi _{\mathbb {H}} ( \varPi _{{\mathbb {M}_r}}(X) )\). Since \(\varPi _{{\mathbb {M}_r}}(X)\) is differentiable at \(\widehat{X}\), so is \(\phi (X)\). Moreover, the directional derivative of \(\phi (X)\) at \(\widehat{X}\) along the direction Y is given by

    $$\begin{aligned} \nabla \phi (\widehat{X}) Y = \varPi _{\mathbb {H}} ( \nabla \varPi _{{\mathbb {M}_r}}(\widehat{X}) Y) \quad \text{ and } \quad \Vert \nabla \phi (\widehat{X}) Y \Vert \le \Vert \nabla \varPi _{{\mathbb {M}_r}}(\widehat{X}) Y \Vert . \end{aligned}$$
    (29)

The inequality above holds because \(\varPi _{\mathbb {H}}(\cdot )\) is an orthogonal projection to a subspace and its operator norm is 1. The matrices in Proposition 1 have the following bounds.

$$\begin{aligned} \Vert \varPhi _{i, r+j}^\pm \Vert \le \frac{1}{\sqrt{2}} \Big ( \Vert \mathbf{u}_{r+j} \mathbf{v}_i^T\Vert + \Vert \mathbf{u}_i \mathbf{v}_{r+j}^T \Vert \Big ) \le \frac{1}{\sqrt{2}} (1+1) = \sqrt{2}, \\ \Vert \langle Y, \; \varPhi _{i, r+j}^\pm \rangle \varPhi _{i, r+j}^\pm \Vert \le \Vert \varPhi _{i, r+j}^\pm \Vert ^2 \Vert Y\Vert \le 2 \Vert Y\Vert . \end{aligned}$$

Therefore,

$$\begin{aligned}&\left\| \sum _{{\mathop {1 \le j \le p}\limits ^{1 \le i \le r}} } \left[ \frac{\sigma _{r+j} }{ \sigma _i - \sigma _{r+j} } \langle Y, \; \varPhi _{i, r+j}^+ \rangle \varPhi _{i, r+j}^+ - \frac{\sigma _{r+j} }{ \sigma _i + \sigma _{r+j} } \langle Y, \; \varPhi _{i, r+j}^- \rangle \varPhi _{i, r+j}^- \right] \right\| \nonumber \\\le & {} 4 \sum _{{\mathop {1 \le j \le p}\limits ^{1 \le i \le r}} } \frac{\sigma _{r+j} }{ \sigma _i - \sigma _{r+j} } \Vert Y\Vert \le 4pr \frac{\sigma _{r+1} }{ \sigma _r - \sigma _{r+1} }\Vert Y\Vert \le \frac{w_0^2}{2\rho }\Vert Y\Vert = \frac{1}{2} \epsilon _0 \Vert Y\Vert . \end{aligned}$$
(30)

In the above, we used the fact that \(\psi (t) := t/( \sigma _r - t)\) is an increasing function of t for \(t < \sigma _r\). Proposition 1, (29) and (30) imply

$$\begin{aligned} \Vert \nabla \phi (\widehat{X}) Y \Vert \le \Vert \varPi _{{{\mathcal {T}}}_{{\mathbb {M}_r}}(\widehat{X})}(Y) \Vert + \epsilon _0/2 \Vert Y\Vert \le \Vert Y\Vert + \epsilon _0/2 \Vert Y\Vert \le (1+ \epsilon _0/2) \Vert Y\Vert . \end{aligned}$$

The second equality above used the fact that the operator norm of \(\varPi _{{{\mathcal {T}}}_{{\mathbb {M}_r}}(\widehat{X})}\) is not greater than 1 due to \({{\mathcal {T}}}_{{\mathbb {M}_r}}(\widehat{X})\) being a subspace. Since \(\phi (\cdot )\) is differentiable at \(\widehat{X}\), there exists \(\epsilon >0\) such that

$$\begin{aligned} \Vert \phi (X) - \phi (\widehat{X}) - \nabla \phi (\widehat{X}) (X - \widehat{X}) \Vert \le \frac{1}{4} \epsilon _0 \Vert X - \widehat{X}\Vert , \quad \forall \ \ X \in {{\mathcal {B}}}_\epsilon (\widehat{X}). \end{aligned}$$

Therefore,

$$\begin{aligned} \Vert \phi (X) - \phi (\widehat{X}) \Vert\le & {} \Vert \phi (X) - \phi (\widehat{X}) - \nabla \phi (\widehat{X}) (X - \widehat{X}) \Vert + \Vert \nabla \phi (\widehat{X}) (X - \widehat{X}) \Vert \\\le & {} \frac{1}{4} \epsilon _0 \Vert X - \widehat{X}\Vert + (1+\epsilon _0/2) \Vert X - \widehat{X}\Vert = (1+3\epsilon _0/4) \Vert X - \widehat{X}\Vert . \end{aligned}$$

Now we are ready to quantify the error between \(X^\nu \) and \(\widehat{X}\) whenever \(X^\nu \in {{\mathcal {B}}}_\epsilon (\widehat{X})\).

$$\begin{aligned} \Vert X^{\nu +1} - \widehat{X}\Vert= & {} \left\| \frac{\rho }{\rho + W^{(2)}} \circ ( \phi (X^\nu ) - \phi (\widehat{X})) \right\| \le \frac{\rho }{\rho + w_0^2} \Vert \phi (X^\nu ) - \phi (\widehat{X}) \Vert \\\le & {} \frac{1+3 \epsilon _0/4}{1+\epsilon _0} \Vert X^\nu - \widehat{X}\Vert = c \Vert X^\nu - \widehat{X}\Vert . \end{aligned}$$

Consequently, \(X^{\nu +1} \in {{\mathcal {B}}}_\epsilon (\widehat{X})\). Since \(\{X^\nu \}_{\nu \in {{\mathcal {K}}}}\) converges to \(\widehat{X}\), \(X^\nu \) will eventually falls in \({{\mathcal {B}}}_\epsilon (\widehat{X})\), which implies that the whole sequence \(\{X^\nu \}\) will converge to \(\widehat{X}\) and eventually converges at a linear rate. \(\square \)

Remark 3

(Implication on MAP) When the weight matrix \(W =0\), pMAP reduces to MAP according to (9). Theorem 2(i) implies

$$\begin{aligned} \Vert X^{\nu +1} - \varPi _{{\mathbb {M}_r}}(X^{\nu +1}) \Vert ^2 - \Vert X^{\nu } - \varPi _{{\mathbb {M}_r}}(X^{\nu }) \Vert ^2 \le - \Vert X^{\nu +1} - X^\nu \Vert ^2, \end{aligned}$$
(31)

which obviously implies

$$\begin{aligned} \Vert X^{\nu +1} - \varPi _{{\mathbb {M}_r}}(X^{\nu +1}) \Vert \le \Vert X^{\nu } - \varPi _{{\mathbb {M}_r}}(X^{\nu }) \Vert . \end{aligned}$$
(32)

The decrease property (32) was known in [5, Eq.(4.1)] and was used there to ascertain that MAP is a descent algorithm. Our improved bound (31) says a lightly more that the decrease each step in the function \(\Vert X - \varPi _{\mathbb {M}_r}(X) \Vert \) is strict unless the update becomes unchanged. In this case (\(W=0\)), the penalty parameter is just a scaling factor in the objective, hence the KKT result in Theorem 2(ii) does not apply to MAP. This probably explains why it is difficult to establish similar results for MAP.

Remark 4

(On linear convergence) In the general context of matrix completion, Lai and Varghese [20] established a local linear convergence of MAP under the following two assumptions. We describe them in terms of the Hankel matrix completion. (i) The partially observed data \(\mathbf{a}\) can be completed to a rank r Hankel matrix M. (ii) A transversality condition (see [20, Thm. 2]) holds at M. We emphasize that the result of [20] is a local result that requires that the initial point of MAP is close enough to M and the rank r assumption of M is also crucial to their analysis, which also motivated our proof. In contrast, our result is a global one and enjoys a linear convergence rate near the limit under a more realistic assumption \(\sigma _r \gg \sigma _{r+1}\). One may have noticed that the convergence rate c though strictly less than 1 may be close to 1. This is often numerically observed that MAP often converges slowly. But the more important point here is that in such a situation it ensures that the whole sequence converges. This global convergence justifies the widely used stopping criterion \(\Vert X^{\nu +1} - X^\nu \Vert \le \epsilon \).

4 Extension to complex-valued matrices

The results obtained in the previous sections are for real-valued matrices and they can be extended to complex-valued matrices by employing what is known as the Wirtinger calculus [30]. We note that not all algorithms for Hankel matrix optimization have a straightforward extension from the real case to the complex case, see [6] for comments on some algorithms. We explain our extension below.

Suppose \(f: \mathbb {C}^{n}\mapsto \mathrm{I} \mathrm{R}\) is a real-valued function in the complex domain. We write \(\mathbf{z}\in \mathbb {C}^{n}\) as \(\mathbf{z}= \mathbf{x}+ j \mathbf{y}\) with \(\mathbf{x}, \mathbf{y}\in \mathrm{I} \mathrm{R}^n\). The conjugate \(\bar{\mathbf{z}} := \mathbf{x}- j \mathbf{y}\). Then we can write the function \(f(\mathbf{z})\) in terms of its real variables \(\mathbf{x}\) and \(\mathbf{y}\). With a slight abuse of notation, we still denote it as \(f(\mathbf{x}, \mathbf{y})\). In the case where the optimization of \(f(\mathbf{z})\) can be equivalently represented as optimization of f in terms of its real variables, the partial derivatives \( \partial f(\mathbf{x}, \mathbf{y})/\partial \mathbf{x}\) and \( \partial f(\mathbf{x}, \mathbf{y})/\partial \mathbf{y}\) would be sufficient. For other cases where algorithms are preferred to be executed in the complex domain, then the Wirtinger calculus [30] is more convenient to use and it is well explained (and derived) in [19]. The \(\mathbb {R}\)-derivative and the conjugate \(\mathbb {R}\)-derivative of f in the complex domain are defined respectively by

$$\begin{aligned} \frac{\partial f}{\partial \mathbf{z}} = \frac{1}{2} \left( \frac{\partial f}{\partial \mathbf{x}} - \mathbf{j}\frac{\partial f}{\partial \mathbf{y}} \right) , \quad \frac{\partial f}{\partial \bar{\mathbf{z}}} = \frac{1}{2} \left( \frac{\partial f}{\partial \mathbf{x}} + \mathbf{j}\frac{\partial f}{\partial \mathbf{y}} \right) . \end{aligned}$$

The \(\mathbb {R}\)-derivatives in the complex domain play the same role as the derivatives in the real domain because the following two first-order expansions are equivalent:

$$\begin{aligned} f(\mathbf{x}+ \varDelta \mathbf{x}, \mathbf{y}+ \varDelta \mathbf{y})= & {} f(\mathbf{x}, \mathbf{y}) + \langle \partial f / \partial \mathbf{x}, \; \varDelta \mathbf{x}\rangle \nonumber \\&\quad + \langle \partial f / \partial \mathbf{y}, \; \varDelta \mathbf{y}\rangle + o( \Vert \varDelta \mathbf{x}\Vert + \Vert \varDelta \mathbf{y}\Vert ) \nonumber \\ f(\mathbf{z}+ \varDelta \mathbf{z})= & {} f(\mathbf{z}) + 2 \mathrm {\mathbf{Re}}(\langle \partial f / \partial \bar{\mathbf{z}}, \; \varDelta \mathbf{z}\rangle ) + o( \Vert \varDelta \mathbf{z}\Vert ). \end{aligned}$$
(33)

Here, we treat the partial derivatives as column vectors and \(\mathrm {\mathbf{Re}}(\mathbf{x})\) is the real part of \(\mathbf{x}\). Note that in the first-order expansion in \(f(\mathbf{z}+ \varDelta \mathbf{z})\) used the conjugate \(\mathbb {R}\)-derivative. Hence, we define the complex gradient to be \( \nabla f(\mathbf{z}) := 2 \partial f / \partial \bar{\mathbf{z}}, \) when it exists. When f is not differentiable, we can extend the subdifferential of f from the real case to the complex case by generalizing (33).

In order to extend Theorem 1, we need to characterize \(\partial d_{\mathbb {M}_r}(X)\) in the complex domain. We may follow the route of [26] to conduct the extension. For example, we may define the regular subgradient of \(d_{\mathbb {M}_r}(X)\) [26, Def. 8.3] to its complex counterpart by replacing the conjugate-gradient in the first-order expansion in (33) by a regular subgradient. We then define subdifferential through regular subgradients. With this definition in the complex domain, we may extend [26, (8.53)] to derive formulae for \(\partial d_{\mathbb {M}_r}(X)\). What we needed in the proof of Theorem 1 is \((X- \varPi _{{\mathbb {M}_r}}(X) )/d_{{\mathbb {M}_r}}(X) \in \partial d_{{\mathbb {M}_r}}(X)\) when \(X \not \in {\mathbb {M}_r}\). The proof of this result follows a straightforward extension of the corresponding part in [26, (8.53)] and if reproduced here would take up much space. Hence we omit it.

In order to extend the results in Sect. 3, we need the subdifferential of \(h_r(X)\) in order to majorize \(g_r(X)\). Since \(h_r(X)\) is convex, its subdifferential is easy to define. We note that the inner product for the complex matrix space \(\mathbb {C}^{k \times \ell }\) is defined as \(\langle A, \; B \rangle = \text{ Trace }(A^H B)\), where A is the Hermitian conjugate, i.e., \(A^H := \overline{A}^T\).

Definition 3

The subdifferential \(\partial h_r(X)\) is defined as

$$\begin{aligned} \partial h_r(X) = \left\{ S \in \mathbb {C}^{k \times \ell }\ | \ h_r(Z) \ge h_r(X) + \mathrm {\mathbf{Re}}\Big ( \langle {S}, \ Z-X \rangle \Big ) \right\} . \end{aligned}$$

The following result is really what we needed in order to extend the results in Sect. 3 to the complex domain.

Proposition 2

For any \(X \in \mathbb {C}^{k \times \ell }\), we have \( {{\mathcal {P}}}_{{\mathbb {M}_r}} (X) \subset \partial h_r(X). \)

Proof

Let \(\varPi _{{\mathbb {M}_r}}(X)\) stand for any element in \({{\mathcal {P}}}_{{\mathbb {M}_r}}(X)\). It is easy to verify the following identities:

$$\begin{aligned} \Vert X-Z\Vert ^2 = \Vert X\Vert ^2 + \Vert Z \Vert ^2 - 2 \mathrm {\mathbf{Re}}\Big ( \langle {X}, \; Z \rangle \Big ) = \Vert X\Vert ^2 + \Vert Z \Vert ^2 - 2 \mathrm {\mathbf{Re}}\Big ( \langle {Z}, \; X \rangle \Big ). \end{aligned}$$
(34)

We use (11) and (34) to compute

$$\begin{aligned}&h_r(Z) - h_r(X) - \mathrm {\mathbf{Re}}\Big ( { \varPi _{{\mathbb {M}_r}} (X) }, \; Z-X \Big ) \\= & {} \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 - \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) \Vert ^2 + \underbrace{\frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) - Z \Vert ^2 - \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) \Vert ^2 - \frac{1}{2} \Vert Z\Vert ^2 }_{ = - \mathrm {\mathbf{Re}}( \langle {\varPi _{{\mathbb {M}_r}} (X)}, \; Z \rangle ) } \\&+ \underbrace{\frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) \Vert ^2 + \frac{1}{2} \Vert X\Vert ^2 - \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) - X \Vert ^2}_{ = \mathrm {\mathbf{Re}}( \langle {\varPi _{{\mathbb {M}_r}} (X)}, \; X \rangle ) } \\= & {} \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 - \frac{1}{2} \Vert Z\Vert ^2 + \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) - Z \Vert ^2 \\&- \underbrace{ \left( \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) \Vert ^2 - \frac{1}{2} \Vert X\Vert ^2 + \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) - X \Vert ^2 \right) }_{=0} \\= & {} \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 - \frac{1}{2} \Vert Z\Vert ^2 + \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (X) - Z \Vert ^2 \\\ge & {} \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (Z) \Vert ^2 - \frac{1}{2} \Vert Z\Vert ^2 + \frac{1}{2} \Vert \varPi _{{\mathbb {M}_r}} (Z) - Z \Vert ^2 = 0. \end{aligned}$$

This proves the claim. \(\square \)

A direct consequence is that

$$\begin{aligned} \partial g_r(X) = X - \partial h_r(X) \supset {{\mathcal {P}}}_{{\mathbb {M}_r}}(X) \end{aligned}$$

and the majorization \(g_r(X)\) through the subdifferential of \(h_r(X)\) holds. The rest of the extension is straightforward and we do not repeat it here.

5 Numerical experiments

In this section we test two popular problems (time series denoising in real domain and incomplete signal completion in complex domain) to demonstrate the numerical performance of pMAP. The time series denoising problem aims to extract the noiseless data from polluted observations by removing the noise components, while incomplete signal completion problem tries to approximate the missing data in an incomplete complex valued signal.

In both numerical experiments, a solver is terminated when any of the following conditions is met

$$\begin{aligned} \frac{|F_\rho (X^{\nu +1}) - F_\rho (X^\nu )|}{\max \{1, F_\rho (X^\nu )\}} \le ftol , \ \ \ g_r(X^{\nu +1}) \le gtol \ \ \ \text{ or } \ \ \ \frac{\Vert X^{\nu +1} - X^\nu \Vert }{\Vert X^\nu \Vert } \le tol. \end{aligned}$$

here tol, ftol, gtol are set at \(10^{-5}\), \(10^{-7}\) and \(10^{-8}\), respectively. A solver will also be terminated if it reaches the maximum iterations, setting at 200. All codes used were written in MATLAB (2019a) and run on a laptop equipped with a 7th Generation Intel Core i5-7200U CPU and 8GB memory card.

5.1 Time series denoising

5.1.1 Experiment introduction

In the first experiment we compare the proposed pMAP with some leading solvers including Cadzow’s method (Cadzow, [14]) and Douglas-Rachford iterations (DRI, [6]) for real-valued time series de-noising. In this test we randomly generate noiseless time series \(\mathbf{a} = (a_1, a_2,\ldots ,a_n)\) via the following process:

$$\begin{aligned} a_t = \sum _{s=1}^{r} d_s (1 + \alpha _s)^t cos({2\pi t}/{\beta _s} - \tau _s), \quad \text{ for } t = 1,2,\ldots ,n \end{aligned}$$

where all \(d_s\), \(\alpha _s\), \(\beta _s\) and \(\tau _s\) follow uniform distribution as \(d_s \sim U[0, 10^3)\), \(\alpha _s \sim U[-10^{-3}, 10^{-3})\), \(\beta _s \sim U[6, 18)\) and \(\tau _s \sim U[-\pi , \pi )\). It is known that for any \(\{l, k\}\) such that \(l + k - 1 = n\), the rank of Hankel matrix \(A = {{\mathcal {T}}}({\mathbf{a}}) \in \mathbb {R}^{l\times k}\) must be 2r when both l and k are no smaller than 2r. We then construct the noisy time series \(\mathbf{y}\) by adding the noises series \(\epsilon \) to \(\mathbf{a}\) as \(\mathbf{y = a + \varvec{\epsilon }}\), where \(\varvec{\epsilon } = \{ \epsilon _1, \epsilon _2,\ldots , \epsilon _n\}\) is the noise component and \(\epsilon _t = \theta \frac{e_t}{\Vert e\Vert _2}\Vert a\Vert \). Here \(e_t\) is the white noise with mean 0 and variance 1. We considered two scenarios: \(\{n, r\} = \{1000, 10\}\) and \(\{2000,20\}\). For each scenario we test three noise levels at \(\theta = 0.1, 0.2\) and 0.5.

We further consider two weight choices:

  1. 1.

    \(\{W_1\}_{i, j} = 1 \), for \(i = 1,\ldots ,l\) and \(j = 1,\ldots ,k\);

  2. 2.

    \(\{ W_2\}_{i,j} = \frac{1}{i+j-1}\), for \(i = 1,\ldots ,l\) and \(j = 1,\ldots ,k\).

Both weights are standardised for comparison purpose (i.e., \(W/\Vert W\Vert \) was used). Note that Cadzow’s method can only handle \(W_1\).

5.1.2 Demonstration of convergence

Before coming to the numerical comparison, we first demonstrate the convergence behaviour of Algorithm 1 under different updating strategy of \(\rho \). We plot the sequences of \(F_\rho (X^\nu )\) in Fig. 1a with both \(W_1\) and \(W_2\). It can be observed that in both cases, the functional value \(F_\rho (X^\nu )\) decreases and converges. We further plot the sequence of \(\Vert X^{\nu +1} - X^\nu \Vert \) in Fig. 1b using the same example. We find that the sequence of \(\Vert X^{\nu +1} - X^\nu \Vert \) is also decreasing and converges to zero, which is consistent with Theorem 2.

Fig. 1
figure 1

Plot of \(F_\rho (X^\nu )\) and \(\Vert X^{\nu +1}-X^v\Vert \) by pMAP with \(\rho \) fixed

The behaviours of \(\frac{\sigma _{r+1}}{\sigma _r}\) are shown in Fig. 2 with respect to different \(\rho \) updating strategies. In this and later experiments, \(\rho \) is initialised as \(\rho ^0 = 10^{-2} \times m/n^2\) where n denotes the length of a time series and m stands for the amount of known observations, which equals to n in this test. As shown in Fig. 2a, \(\frac{\sigma _{r+1}}{\sigma _r}\) approaches zero with increasing \(\rho \), which means \(g_r(X^\nu )\) goes to zero as well. By contrast if \(\rho \) is fixed as \(\rho ^\nu = \rho ^0\) at each iterate, Fig. 2b shows that \(\frac{\sigma _{r+1}}{\sigma _r}\) decreases much slower than the first strategy. As a result, we will update \(\rho \) by \(\rho ^{\nu +1} = 1.1 \rho ^\nu \) at each iterate when \(\rho ^\nu \le n \times \min (W)\), where \(\min (W)\) is the minimal weights in W. The behaviour of convergence of Algorithm 1 appears consistence for other choices.

Fig. 2
figure 2

Plot of \(\frac{\sigma _{r+1}}{\sigma _r}\) at each iterate by pMAP. \(\rho \) is updated by \(\rho ^{\nu +1} = 1.1\rho ^\nu \) in (a) and fixed without updating in (b)

5.1.3 Numerical results

The numerical results are reported in Table 1 including the number of iterations (Iter), cpu time for computation (Time), root of mean square error (RMSE) for each solver which is calculated as

$$\begin{aligned} \text{ RMSE } := \sqrt{\sum _{i\in {{\mathcal {I}}}} (\hat{x}_i - a_i)^2/|{{\mathcal {I}}}|}. \end{aligned}$$

where \(\hat{\mathbf{x}} = \{\hat{x}_1, \ldots , \hat{x}_n\}\) is obtained as \(\hat{\mathbf{x}} = {{\mathcal {H}}}^{-1}(\widehat{X})\) and \(\widehat{X}\) is the estimated result from a certain solver. \({{\mathcal {I}}}\) denotes the index set of all data to be predicted while \(|{{\mathcal {I}}}|\) stands for the size of \({{\mathcal {I}}}\). Apparently smaller RMSEs indicate better solution qualities. Because the Cadzow method does not allow users to select arbitrary weights apart from \(W_1\), we will not report the numerical results for the Cadzow method under \(W_2\).

We also report success rate (SR) of each solver in Table 1. One instance is successfully de-noised if the relative gap between the RMSE of approximated solution and the best possible RMSE is smaller than a pre-defined threshold, i.e.,

$$\begin{aligned} \frac{\text{ RMSE}_{\text{ approx }} - \text{ RMSE}_{\text{ best }}}{\text{ RMSE}_{\text{ best }}} \le \text{ threshold } \end{aligned}$$

In this experiment we set threshold \(= 10\%\). For any combinations of \(\{n/r/\theta \}\), all data reported in Table 1 are the mean values over 50 randomly generated tests.

Our first observation on Table 1 is that when applying \(W = W_1\), pMAP reports the best results in 3 examples out of 6 while DRI performs the best in the rest 3 examples. In general, Cadzow, DRI and pMAP have very similar performance on estimation accuracy under \(W_1\) because they are MAP-based algorithms.

When the weight matrix is set as \(W_2\), a significant improvement on the estimation accuracy can be observed for DRI and pMAP comparing to the case \(W = W_1\). This result matches our expectation because \(W_2\) assumes that all data have equal importance by sharing the same weight, while \(W_1\) implies that data in the middle of a time series are more weighted than the data at both ends.

Table 1 Experiment results for Cadzow iteration, DRI (Douglas-Rachford iterations) and our proposed pMAP, including iterations (Iter), CPU time in seconds (Time), Root of mean square error (RMSE) and success rate (SR). Results in this table are the average of 50 trials

For all \(\{n/r/\theta \} \) combinations, our proposed solver with \(W_2\) always generated the estimation results with lowest RMSEs. It is also important to mention that our pMAP algorithm enjoys the most robust convergence result among all candidate solvers. As a result, we conclude that our proposed pMAP algorithm is competitive and effective in solving real-valued time series denoising problems.

5.2 Spectral sparse signal recovery

5.2.1 Experiment introduction

In this experiment, we consider the problem of recovering missing values in an incomplete spectral sparse signal. We refer to [2, 4] and the references therein for its background in recovering signals which are spectrally sparse via off-grid methodologies. We follow the suggestions in [2] to generate the experiment data \(\mathbf{a}= \{a_1, a_2, \ldots a_n\}\) where

$$\begin{aligned} a_t = \sum _{s=1}^{r} d_se^{2\pi j \omega _st}, \quad \text{ for } \ \ t \in \{0, 1,\ldots , n\} \end{aligned}$$

where \(j = \sqrt{-1}\), r is the model order, \(\omega _s\) is the frequency of each sinusoid and \(d_s \not =0\) is the weight of each sinusoid. Both \(\omega _s\) and \(d_s\) are randomly sampled following uniform distributions, as \(\omega _s \sim U[0, 1)\) and \(d_s\sim U[0, 2\pi )\). Indexes of missing data are randomly sampled following uniform distribution. In this experiment we introduce three sub-tests focusing on different purposes.

Test a: Incomplete signal recovery without noises In this sub-test we assume only a subset \(\varOmega \) of the sampling points \(\{1,\ldots ,n\}\) are observed and we aim to recover the signal by estimating the missing data. Here all observed data are noiseless. We use success rate (SR) to measure the performance of the candidate methods in incomplete signal recovery. We say the signal is successfully recovered if

where is the estimated signal.

In this test, signal length n is set to 499, 999 or 1999, respectively. The percentage of known observations \(\frac{m}{n}\) is set to 30% or 60% where m stands for the amount of known observations. All combinations of \(\{n/m/r\}\) used in this test can be found in Table 2. To compare the performance of our proposed method, we further introduce three state-of-art solvers in spectral sparse signal recovery including Atomic Norm Minimization (ANM [28]), Fast Iterative Hard Thresholding (FIHT [3]) and Projected Gradient Descent (PGD [2]). W for each solver is defined as

$$\begin{aligned} W_{i,j} = \left\{ \begin{array}{ll} 1/\sqrt{i+j-1} &{} \ \text{ for } \ i+j-1=1, \ldots , k-1 \\ 1/\sqrt{k} &{} \ \text{ for } \ i+j-1=k, \ldots , n-k+1 \\ 1/\sqrt{(n-i-j+2)} &{} \ \text{ for } \ i+j-1 = n-k +2, \ldots , n, \end{array} \right. \end{aligned}$$

if \(a_{i+j-1}\) is observed, and \(W_{i,j} = 0\) if it is missing.

Test b: Incomplete signal recovery with noises This sub-test still aims to recover an incomplete spectral sparse signal, but some observed data are noisy while others are noiseless. Here we follow the signal generating process the same as in Test a, however, \(\frac{1}{3}m\) observations are polluted by random noises, as \(y_i = a_i + \epsilon _i\) where \(\epsilon _i = \theta \frac{e_t}{\Vert e\Vert _2}\Vert a\Vert \). Here \(e_t\) is a complex stranded normal random variable and the noise level \(\theta \) is set at 0.2. We assume the index set of polluted observations is known in advance.

The weight matrix W is set as follows. We give polluted observations very small weights (say 1) and noiseless observations are assigned to much larger weights such as 100. The missing data in the signal are given zero weight. This weighting scheme is more reasonable than other choices as we place higher confidence in noiseless data, less confidence in noisy data and no confidence in missing values. It is worth noting that this weight matrix setting is for pMAP only because FIHT and PGD do not support flexible weight choices. The rest settings of this sub-test such as the definition of \(\{n,m,r\}\) are interpreted the same way as in Test a.

Test c: Recover missing data with inaccurately estimated rank In both Test a and Test b, we assume the objective rank r to be a known parameter. However, obtaining the true information of objective rank is quite challenging in many real applications. So in this sub-test we will examine the performances of candidate solvers in recovering the incomplete spectral sparse signals while the objective rank r is incorrectly estimated. Signal length n is set as 3999 and we assume that 30% data in a signal are randomly or accurately observed without noises. True rank r is set as 15 but assumed to be unknown in this test. It means for each solver, we will try different estimated rank \(\hat{r}\) ranges from 6 to 30. Success rates (SR) over 50 instances are reported to measure the performance of each solver.

5.2.2 Numerical results

Test (a) The numerical results of this test are listed in Table 2 including total iterations (Iter), CPU time in seconds (Time), RMSE and success rate (SR) for each solver. Among all solvers, ANM enjoys the best global convergence result because it is a convex relaxation method. However, the computational cost of ANM is much higher than the rest solvers and it runs out of memories when n is larger than 500. At the same time, it fails to generate better results comparing with other solvers. Hence we do not report its performances in the rest part of this test problem. Although DRI performs slightly better than the Cadzow method in terms of accuracy, both of these two solvers failed to successfully recover any incomplete signals.

Table 2 Numerical results for six different solvers on the incomplete signal recovery experiment including iterations (Iter), computational time (Time), estimation error (RMSE) and success recovery rate (SR). Results in this table are the average of 50 trials. *Experiment results ANM when \(n \ge 999\) are not available because they run out of memory

We further note the fact that in some cases FIHT stopped within a few iteration steps and this behaviour may lead to inferior solutions. Similar behaviours were also reported in another recent research (Fig. 3, [31]). Also with the increasing r, the performance of PGD declines when the ratio between m and r keeps fixed. It is because PGD has some assumptions on the lower bound of m with respect to r (Theorem 2.1, [2]), which may not hold in some cases. On the other hand, pMAP performs the best in 11 cases out of 12 in terms of the SR.

We find that FIHT is more computational efficient than pMAP. This is the result of using the subspace technique in FIHT (Alg. 2 in [3]), which is designed to approximate the matrix projections on to a low rank subspace and therefore reduces the computational time from \(O(n^3)\) (SVD) to \(O(nr^2)\). Although this technique can be adapted to our framework so as to reduce the computational cost, there would be no theoretical guarantee for its convergence results. We leave this potential improvement for future researches.

Some applications require to estimate the coefficients of a spectral sparse signal based on the reconstructed signal, including the amplitude \(d_s\) and the frequency \(\omega _s\) for all \(s\in [1,2,\ldots ,r]\). We use the method in [6, Fig.2] to reconstruct the coefficients of recovered signals in this experiment. The reconstruction results are plotted in Fig. 3 over two randomly selected instances. It is easy to observe that when \(r = 20\), FIHT incorrectly estimated most of coefficients with significant errors while both PGD and pMAP could successfully recover most of them. On the other hand when r increases to 40, PGD performed worse than both FIHT and pMAP. There are several coefficients that were accurately estimated by FIHT and pMAP, but failed to be recovered by PGD (those unrecovered data points were shown by pointing arrows in Fig. 3d).

Fig. 3
figure 3

Spectral sparse signal coefficients reconstruction results by FIHT, PGD and pMAP, setting \(\{n/m/r\} = \{499/150/20\}\) in (a, c, e) and \(\{n/m/r\} = \{499/300/40\}\) in (b, d, f), respectively. Black circles stand for the true locations of coefficients while red stars stand for the estimated locations of coefficients

Test b) Table 3 lists the numerical results for this test including Iter, Time, RMSE and SR for five solvers. Due to the interference of noises, we increase the threshold of success rate to \(10^{-2}\) to make sure the numerical results are comparable and meaningful.

Table 3 Numerical results for Cadzow, FIHT, PGD, DRI and our proposed pMAP on the noisy signal recovery experiment, including iterations (Iter), CPU time in seconds (Time), root of mean square error (RMSE) and success rate (SR). Results in this table are the average of 50 trials

Experimental results show that our proposed pMAP significantly outperforms the rest four candidate solvers in all tests. All the competitive solvers failed to recover any signals successfully in all cases (i.e., \(SR=0\) on average). We observed that those solvers were capable of generating points with RMSE at a level of \(10^{-2}\), but encountered extreme difficulty in driving RMSE below \(10^{-2}\). In contrast, the SR of pMAP is at least 0.92. It might be due to the fact that in DRI, FIHT and PGD, the weight of each observations can not be customised, which means these solvers have to give equal weights to both noisy observations and noiseless observations. As a result, their estimation results are significantly affected by noisy observations. This test clearly demonstrates the advantage of pMAP by allowing customized weighting schemes.

Test (c) The numerical results of recovering missing data with inaccurately estimated rank experiment are plotted in Fig. 4 for each solver. When \(\hat{r}\) is smaller than 15, success rate for all solvers are zero. It indicates that none of these solvers can recover the incomplete signal successfully when there is a lack of coefficients information. With \(\hat{r}\) exactly equals to 15, all three methods including FIHT, PGD and pMAP can achieve 100% recovery rate. However when \(\hat{r} > 15\), one can expect varying performances of the three solvers. The success rates of both PGD and FIHT gradually decline with the increasing \(\hat{r}\) and they finally reach around 40% when \(\hat{r} = 30\). One the other hand, SR of our proposed pMAP stays at 100% for any \(\hat{r}\) no smaller than 15, which indicates that pMAP is more robust to the overestimation of objective rank than the other two solvers. One may wonder what has caused the failure of FIHT and PGD in this case. We explain it below.

Fig. 4
figure 4

SR when the input rank is misappropriated for FIHT, PGD and pMAP. n is set as 3999 and 30% observations are known. True rank r is 15. Results in this figure are the average of 50 trials

Figure 5 compares the PGD, FIHT and pMAP using a random incomplete signal recovery example in terms of the relative error between the two consecutive iterates \( \mathbf{x}^{\nu +1}\) and \(\mathbf{x}^{\nu }\). Figure 5a shows that the failures of both FIHT and PGD in recovering incomplete signal was caused by the non-convergence behaviour. Figure 5b illustrates the distribution of singular values at the final iterates for each algorithm. A noticeable feature is that FIHT terminates too early because it cut off too many singular values that are not negligible, while PGD and pMAP cut off all comparably small eigenvalues. However, PGD suffers from non-monotonic convergence as shown in Fig. 5a. In contrast, the performance of pMAP just fits the situation studied in Theorem 3, which ensures locally linear convergence when the cut-off singular values are negligible comparing to the first 15 largest ones.

Fig. 5
figure 5

Performance comparison for candidate solvers in incomplete signal recovering when the input rank is incorrectly estimated. True rank is 15 and input rank is 21. a Plots the relative gap between \(x^{\nu }\) and \(x^{\nu +1}\) for each solver at each iteration, while b plots the singular values of final solution by each solver. Both figures are with log base 10 scale

6 Conclusion

In this paper, we studied the problem of approximating a low rank Hankel matrix from the noisy and/or incomplete observation matrix under arbitrary weights. It is very challenging to tackle this problem due to its non-convexity. We introduced the framework of majorization minimization method such that this problem can be tackled iteratively with non-increasing objective function values. We further showed that the subproblem enjoys a closed form solution, which can be efficiently computed. We demonstrated the global optimal convergence property of our approach pMAP by assuming that the penalty parameter goes to infinity. We also showed that this method will at least converge to an \(\epsilon \)-approximate KKT point linearly if the penalty parameter \(\rho \) is above a threshold. This method can be extended to tackle complex-valued matrices because the majorization \(g_r(X)\) through the subdifferential of \(h_r(X)\) holds in the complex-valued case. In the computational experiments for both series denoising and signal completion problems, pMAP usually outperforms other state-of-the-art solvers in terms of approximation accuracy within reasonable computing times.

One of the important topics to be further investigated is whether the computing cost of pMAP can be improved. For example, the subspace optimization technique used in [3] can reduce the computing time significantly compared with partial SVD, which is the major source of computing cost in pMAP. However, this technique at its current form is likely to break the established convergence result of pMAP, simply because it can not guarantee a closed form solution of each subproblem. Another interesting future research topic is extending our proposed pMAP framework to other rank-minimization related problems with similar structures, such as robust matrix completion and robust principal component analysis.

One referee suggested to consider the exact penalty using the distance function itself rather than its squared from, and use the majorization-minimization approach to the penalized problem. While we recognized the benefit of using the squared distance function as well surveyed in [18], we also think the suggested approach is worth serious investigation. The focus of the difficulty is on how to design an efficient majorization function for the distance function. We leave this to our next research topic.