3.3.1. Time-Dependent Non-Negative Matrix Factorization Using the Norm
The objective function (
1) measures the difference between the original matrix
and the approximate
in the form of the squared error. Due to the large mean square error of the anomalous data, the objective function is easily dominated by such data. RTNMFFM applies the
norm to quantify the error between the original time series data and the approximation matrix. The
norm of the matrix [
34,
36]
is defined as
Note that it satisfies the triangle inequality as well as the three norm conditions and the
norm is a legitimate norm [
34]. RTNMFFM uses it as a measure of error.
where the first term is the robust formulation of the error function [
37],
is the Frobenius norm for preventing overfitting [
38] or guaranteeing strong convexity [
3], and
and
are the regularization parameters. The robust formulation of the error function is equivalent to
Equation (
4) treats the
norm as a measure of the loss of the reconstruction error. When there are outliers in the data, the data will show a fat-tailed distribution, and the Laplace distribution is one of the statistical methods to analyze the fat-tailed data, which has a low sensitivity to outliers and more robust features. Assume that the observation data vector
is contaminated by noise
, obeying a Laplace distribution with mean zero.
where
is an unobservable truth value that can be considered as a point in a
K-dimensional subspace (
), i.e.,
where
is the projection of
onto the subspace defined by the columns of
. Assuming that
obeys a Laplace distribution with zero mean and scale parameter
b, thus
, and since in general, each vector
in
is independent, the probability distribution of
conditional on
is
the
metric loss function has a rotational invariant property, while the pure
loss function does not have such a desirable property. The literature [
39] emphasizes the importance of rotational invariance in the context of learning algorithms. The reason the
norm possesses rotational invariance is that its inner layer first solves for the
norm of the vector, and rotational invariance is a fundamental property of Euclidean spaces with the
norm [
36]. Since the subspace is not uniquely determined before mapping the data to a lower-dimensional space, it is common to model the data with distributions that satisfy rotational invariance [
36]. By the rotational invariance of the
norm, the literature [
36] generalizes the Laplace distribution to the rotationally invariant Laplace distribution (Equation (
8)).
according to the strategy of maximizing the log-likelihood of the data, it can be obtained that
Maximizing the data likelihood is equivalent to minimizing the summation part of Equation (
9), so by replacing
with
, we obtain Equation (
10),
In Equation (
10), assuming that the fitting error obeys a rotationally invariant Laplace distribution, the maximum likelihood problem is transformed into a non-negative matrix factorization problem with
norm as the loss function by imposing the constraints
,
[
37]. Typically,
norms are ideal for eliminating outliers. The
norm is the number of non-zero elements in a vector, and it can realize the sparsity constraint. However, solving the
norm is an NP-hard problem and it is difficult to optimize. A common method is to solve an approximation problem for the
norm [
11]. Both the
norm and the
norm have the property that the
norm makes the solution results sparse, so we believe that the
norm can achieve the same robustness goal. In contrast to the loss function of the
metric, the
metric is convex and can be easily optimized [
11,
34].
From a computational point of view, the error for each data point in the robust formula is
, which prevents large errors due to outliers from dominating the objective function in the form of squared errors [
34,
37]. We use two-dimensional toy data to show the robustness of the
metric loss function by generating 10 original two-dimensional data points, two of which are outliers. For each data point, we fit the original data points using the
metric loss function and the standard non-negative matrix factorization, respectively. All data projections will be in the one-dimensional subspace. The residuals corresponding to each data point are shown in
Figure 2, where the nonnegative matrix factorization of the
metric loss function has a much smaller error compared to the standard NMF and is minimally affected by the two outliers. This robust approach keeps large errors caused by outliers from dominating the objective function by compressing the larger residual values. However, the model defined by the loss function in Equation (
4) is not yet able to predict the data, and next, we will describe in detail how RTNMFFM constructs temporal correlations and predicts future data.
In general, the latent state of a particular timestamp may be related to the latent state of one or more timestamps preceding that timestamp. The relationship between the columns of the latent temporal factor matrix can be expressed as follows:
where
is a lag set storing the timing relationships between columns in
,
is a vector of timing coefficients for
, and all
are combined into a timing coefficient matrix
. As in the timing structure shown in
Figure 3, the state of each timestamp is then related to the state of the first and third timestamps preceding that timestamp. In this example,
,
. In this example, any column
of the matrix
is a combination of its first previous column
and its third previous column
in the following manner:
Based on the above temporal relationship between the columns of the latent temporal factor matrix
, by introducing an AR temporal regularizer [
3] on top of the non-negative matrix factorization approach
, it learns in an autoregressive manner the matrix of temporal coefficients
, the matrix of latent correlation factor
, and the matrix of latent temporal factor
that are best adapted to the system without having to realize the artificial setting of the timing coefficients
.
The matrix factorization model that incorporates a temporal regularizer establishes a temporal relationship generation mechanism between the columns of
, and at the same time achieves the prediction of high-dimensional time series data. The AR regularizer is
where the first term in Equation (
13),
is a time lag set that indicates temporal correlation topology, the temporal structure indicated by the time lag set can be reflected in the matrix of latent temporal factors.
is the maximum value of the time lag set (
), ⊙ is the element-wise product,
is
coefficient vector, and it is a learnable parameter representation of the autoregressive coefficients, and the last term is the regularization term of the
. The aim is to prevent overfitting. The set of time lags in AR regularizers can be discontinuous to flexibly embed seasonality [
3].
Introducing constraint (
13) into the loss function (
3), we get
The illustration of RTNMFFM is shown in
Figure 4, where RTNMFFM decomposes the data matrix
on the left side into the dimensional feature matrix
and the temporal feature matrix
on the lower right-hand side, while the upper right side denotes the temporal dependence of RTNMFFM mining
and its historical data through the autoregressive regularizer.
It is important to note that the model proposed in this paper can be viewed as a special kind of dynamic factor model (DMF) that
Compared to TRMF, our proposed method is more robust and uses NMF rather than standard MF. The non-negative constraint of NMF on the latent temporal factor favors the generation of sparse encoding (more zero values). In learning autoregressive coefficients, fewer elements are involved in the training, thus reducing overfitting. Although more complex temporal regularizers have been studied and applied [
29,
32], RTNMFFM still chooses to capture the linear relationship of the AR model because NMF can only extract the linear structure of the data, and our study is more focused on robustness.
Where it is generally assumed that to be serially uncorrelated (i.e., i.i.d. , with being a diagonal matrix).
The difference between our proposed method and Equation (
15) is that first, our proposed model assumes that the noise
obeys a Laplace distribution. This assumption is more appropriate for describing the distribution of data in the presence of outliers. Another difference is that most of the existing studies [
40,
41,
42] use principal component analysis (PCA) to obtain the sequence correlation, while our method uses NMF. Compare PCA, which uses a linear combination of all eigenbasis vectors to represent the data. Only the linear structure of the data can be extracted. In contrast, NMF represents data using combinations of different numbers and differently labeled basis vectors, so it can extract the multilinear structure of the data and has some nonlinear data processing capability.
3.3.2. Periodic Smoothing Penalty for Severe Missing Values
The percentage of missing data can be as high as 50% or more of the data collected by real sensors [
10]. Severe missing values can result in inaccurate forecast values. A similar phenomenon was reported by Fernandes et al. [
43], which they refer to as the “misalignment problem in matrix factorization with missing values”. This problem is illustrated in
Figure 5: even though the nonnegative matrix factorization captures accurately the evolution of the time series in the missing gaps, the estimated range of values is far from the actual range of values. An intuitive observation of this problem is that the variation between consecutive timestamps is smooth, whereas this smoothness disappears when data are severely missing.
This unsmooth character carries over into the potential time factor matrix obtained by matrix factorization. When observations are missing, it causes the latent temporal factor matrix derived from the temporal regularization matrix factorization algorithm to lose its smoothing properties. This is manifested numerically in the form of lower numerical values of the latent temporal factor vectors for the time slices with missing data relative to the time slices with no missing time slices, and there is a misalignment of the vectors. The loss of smoothing the latent temporal factor matrix leads to overfitting, which prevents the autoregressive parameters from correctly representing the temporal relationships, and ultimately makes the prediction accuracy worse. Based on the above motivation, we need a way to improve the generalization ability, and our main goal is to create strategies for automatically smoothing these latent temporal factor vectors where the misalignment problem occurs.
The matrix factorization-based temporal regularization framework can be considered a specific example of a dynamic factor model (DFM) [
44] that searches for linear links between high-dimensional observations and hidden states [
30]. Imposing a smoothing penalty constraint on the latent variable can make the two variables similar, and we hope to reduce the effect of missing observations by adding some kind of smoothing penalty to the objective function to make the latent states of time slices with more missing values smoother.
Assuming a seasonal period of
T, seasonality can make latent temporal factors
that are in the same in-phase similar, and in order to account for this, we need to rewrite the objective function to account for:
where
denotes the AR temporal regularizer. The temporal regularization can be formulated as follows:
where
is historical temporal feature vectors
as input and make a forecast of
. The temporal regularizer can be viewed as
doing smoothing with the predicted value
of the autoregressive model, and this smoothing makes
similar to
, and due to seasonality,
will be similar to both
vectors. Based on this result, we envisioned whether we could do a smoothing penalty on the vectors of latent temporal factors that have lost their smoothing due to severe data missing versus the vectors of latent temporal factors with historical seasonality so that these factor vectors would regain their smoothing properties.
Since the missing values result in low numerical values of latent temporal factor vectors, we force smoothing of latent temporal factor vectors with low numerical values and latent temporal factor vectors with high numerical values, where the latent temporal factor vectors with high numerical values are in the same phase as the vectors with low values, by regularization penalties based on the presence of seasonality in the time series. For example, for any
, the latent temporal factor vectors in the same phase as they are
,
, … etc. The rationale is that latent temporal factor vectors that are in the same phase should be similar due to seasonality, as mentioned before. This is done to minimize the misalignment illustrated in
Figure 5.
Based on this motivation, we propose to consider reducing the misalignment induced by missing values by applying a smoothing penalty to latent temporal factor vectors with lower numerical values. This smoothing penalty takes the form
where
is the regularization parameter for the periodic smoothing penalty,
is a vector of low numerical value latent temporal factors that need to be smoothed, and
is a vector of latent temporal factors with smoother, higher numerical valued states of the potential state. This smoothing approach leads to two questions: (1) how to select the vectors that need to be smoothed? and (2) these selected vectors with low numerical values are smoothed with what kind of vectors
?
We propose to control these vectors and whether to use the periodic smoothing penalty based on the energy of the vectors in the temporal factor matrix
at the moment
t, where energy at the moment
t of the time slice
is
We define
as the average of the latent temporal factor vector energies of all latent temporal factors of vector
up to moment
t and with
in the same phase. The form of
is
To prevent overfitting, the periodic smoothing penalty is not applied to all time slices and is only used when the energy
of the current latent temporal factor is smaller than
, which is consistent with the motivation that we mentioned before, i.e., we only need to smooth potential space-time factor vectors that have lower values. The energy
of the potential temporal factor vector with lower values will be smaller. For example, suppose the seasonality is
T = 3 and the time lag set is
. We need to compute
for
. The steps are to first calculate the energies
and
for
and
. After that, find their average value
. Finally, when
is smaller than
, a periodic smoothing penalty on
is required. If
is bigger than
, we will force the regularization parameter
to indicate that x10 does not participate in the periodic smoothing penalty. This judgment prevents higher-energy time slices from participating in the periodic smoothing penalty and prevents normal potential time factor vectors from losing certain features that should be present due to overfitting previous time vectors instead. We utilize the property that the seasonality of the time series leads to similar values of the latent temporal factor vectors, assuming that the current latent temporal factor vector
is too low due to missing data, the energy of this vector will be lower than the average energy of the historical vectors of the same phase, and we pick out vectors that need to be smoothed by this method.
For the second problem, we define the high-confidence latent temporal vector of to be the latent temporal vector that is historically in the same phase as and has the highest energy. The reason for this is that latent temporal factor vectors with lower values need to be smoothed with latent temporal vectors with higher values, which should have relatively higher energies. Secondly, due to seasonality, latent temporal factors that are in the same phase should be similar. For example, if needs to perform a periodic smoothing penalty, we need to find the maximum energy latent temporal factor vector in and and use it as the high-confidence latent temporal vector for .
We give an intuitive explanation of the results of the latent temporal factor matrix decomposed by the temporal regularization matrix factorization after adding the periodic smoothing penalty we proposed, as compared to the latent temporal factor matrix without adding the periodic smoothing penalty, using an example. We experimented with temporal regularization matrix factorization on the Guangzhou Urban Traffic Speed dataset with a data size of 214 × 1380, and we let all the data from
t = 360 to
t = 1200 as missing. We set the rank of the low-rank matrix at
K = 40. In
Figure 6, the blue line is the result after adding the periodic smoothing penalty, and the red line is not added. It can be seen that the latent temporal factors are smoother with the added periodic smoothing penalty, which is very obvious at
t = 360 to
t = 1200. This shows that our proposed periodic smoothing penalty can make the latent temporal factor matrix smoother, more conducive to the learning of the parameters in the temporal regularization term, and more conducive to the final prediction.
The loss function with the addition of the periodic smoothing penalty is
The objective function in Equation (
21) focuses on the severely missing data. For lower-energy time slices, RTNMF actively utilizes high confidence slices, while for higher-energy time slices, no smoothing is performed to prevent overfitting.