1. Introduction
In machine learning and other computer-related areas, the demands for dimensionality reduction methods never vanish owing to the curse of dimensionality [
1,
2,
3]. Generally, the amount of calculation often grows exponentially with the increase in dimensionality, hence the efficiency of machine learning algorithms will drop markedly if the dimension of the input data is enormous [
2,
4]. On account of the limited computing power at present, it is essential to devise dimensionality reduction methods to obtain a sound and reliable result. On the other hand, in many realms, it is also of great interest to reduce high-dimensional data to two or three dimensions for visualization purposes [
5,
6,
7,
8,
9].
For decades, a large number of dimensionality reduction methods have been applied to different tasks, among them are Principal Component Analysis (PCA) [
10,
11,
12], Multidimensional Scaling (MDS) [
13,
14], Sammon Mapping [
15], Isomap [
16], Locally Linear Embedding (LLE) [
17], Laplacian Eigenmaps (LE) [
18,
19,
20], t-Distributed Stochastic Neighbor Embedding (t-SNE) [
21,
22,
23,
24] and so on. It is well known that the first three algorithms mentioned above are linear dimensionality reduction methods, which usually break the inner data structure of real-world datasets, thus yielding a poor visualization map. The others are non-linear and can be concluded to be manifold learning algorithms.
Manifold learning tends to outperform linear dimensionality reduction methods in data visualization. Particularly, t-SNE is amongst the best-known manifold learning algorithms, as it can not only capture much of the local structure of the high-dimensional data but also reveal the associated global structure, such as by presenting clusters at many different scales [
21].
As unprecedented as the performance of t-SNE is, however, several problems still remain to be addressed. Firstly, owing to the assumption in t-SNE that the high-dimensional data are in Gaussian distribution, the distribution of the mapped data in low dimensions is always uniform and loose, and the Kullback–Leibler divergence (KLD) often converges to a high value, which prevents the algorithm from generating a sound low-dimensional map. Secondly, separations between different natural clusters still need to be improved as some of the data are often inclined to be clustered into the wrong groups due to the obscure boundary. Thirdly, both the computation and space complexity of t-SNE increase quadratically with the number of data pairs, which severely limits the application of t-SNE on large datasets in reality.
In this paper, we significantly improve the standard t-SNE scheme by developing a preprocessing strategy for it. In our preprocessing strategy, Laplacian eigenmaps (LE) are first employed on the high-dimensional data. Thus, before they are input into t-SNE, each data cluster can be aggregated first and the data are no longer in Gaussian distribution.
In addition, aiming at magnifying the gaps between different clusters and enlarging the difference between data of different kinds, the K-nearest-neighbor (KNN) algorithm is also introduced into our preprocessing to shrink the Euclidean distance between each neighboring data pair. Moreover, compared to the standard t-SNE, KNN is also expected to reduce the computation and space complexity as only the neighboring data pairs are considered in our strategy, which can offer a balance between performance and complexity.
We apply our method on the MNIST dataset, which contains 70,000 handwritten numeric images that are 28 pixels by 28 pixels in size. The training set contains 60,000 images, while the test set 10,000 images. The experimental results show that our strategy can significantly improve the performance of the standard t-SNE and the recoveries of low-dimensional data structures are also reinforced, while the overall complexity only increases by about 1–2%.
The outline of this paper is as follows:
Section 2 gives a quick review of the basic idea of the standard t-SNE. In
Section 3, a preprocessing manifold learning strategy based on t-SNE, LE and KNN is proposed. The numerical results on the MNIST dataset are presented in
Section 4. Finally, we draw some conclusions in
Section 5.
2. Manifold Learning by Using a t-SNE
Generally speaking, dimensionality reduction methods convert the high-dimensional dataset
into two- or three-dimensional data
that can be displayed in a scatterplot. It is argued in [
25] that a set of similar data is neither randomly nor uniformly distributed in this space, but instead lies on or near a submanifold of much lower dimension. Manifold learning is a sort of non-linear technique that aims to find a non-linear mapping to extract the intrinsic dimensionality of the original data and to realize dimensionality reduction.
For traditional dimensionality reduction methods, such as the Locally Linear Embedding algorithm, the similarity between data is typically modeled by Euclidean distance. Thus, it is difficult for those traditional dimensionality reduction methods to unfold “many-to-one” mappings, in which a single ambiguous object really belongs in several disparate locations in the low-dimensional space [
26]. To solve the problem, the t-SNE algorithm proposed by Maaten and Hinton [
21] employs a probabilistic model to visualize the structure of complex datasets. Specifically, t-SNE converts high-dimensional Euclidean distances between data points into joint probabilities to characterize the similarities between data.
In t-SNE, the conditional probabilities
that a data point
would pick
as its neighbor is given by [
27]
with
. The variance
in the Gaussians centered around
is determined by a binary search procedure. The data density is likely to vary. Thus, in dense regions, a smaller value of
is more appropriate than that in sparse regions.
Let
be the conditional probability distribution over all other data points at a given point
. The entropy of
will grow along with the increase in
. Then, with a fixed perplexity specified by the user, a binary search for
is performed in t-SNE to produce a probability distribution
. In a sense, the perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of t-SNE is fairly robust to changes in the perplexity, which is typically between 5 and 50, depending on the size of the datasets. The joint probabilities
can be obtained easily as follows:
which ensures that
for all data points
. Thus, each data point
can make a significant contribution to the cost function [
21].
In the lower dimension, t-SNE employs a Student t-distribution with one degree of freedom as the heavy-tailed distribution to separate different clusters from each other. The joint probabilities of map points are given by [
27]
KLD is typically adopted to characterize the mismatch between
and
. t-SNE minimizes the sum of KLD over all data points using a gradient descent method. The cost function
C and the gradient of t-SNE are given by [
27]
and
respectively. Once the KLD decreases to an appropriate value, a faithful low-dimensional map will be obtained.
3. The Proposed Preprocessing Manifold Learning Strategy
We noticed that the drawbacks of the t-SNE mentioned above are partly caused by the Gaussian distribution of the high-dimensional data. In other words, due to the algorithm design of t-SNE, the mapped data will be uniformly but loosely distributed in a low dimension given that the high-dimensional data are in Gaussian distribution. Thus, a natural way to handle such a problem is to modify the original data distribution in advance in a reasonable way to fit the standard t-SNE. By reasonable, we mean the updated data distribution tailored to t-SNE also inherits the characteristics of the original data distribution.
Laplacian eigenmaps is another efficient manifold learning method, which maps the high-dimensional data into a lower dimension by solving a generalized eigenvector problem [
19]. It shares some similar properties with LLE, e.g., it also employs weights rather than the possibility to realize dimensionality reduction, hence a tighter map compared with t-SNE could be obtained. Thus, the idea of LE can be introduced here to preprocess the original high-dimensional data. Following the dimensionality reduction strategy of LE [
28], we first find the k-nearest neighbors for each data point by using KNN and characterize their relations with
as follows:
Then the cost function is given by
To minimize the cost function, there are three matrices
,
and
defined in LE, in which
is a diagonal matrix where
.
is composed of
, where
is just the element at row
i and column
j. The Laplacian matrix [
28] is defined as
Thus, the first dimensionality reduction can be accomplished by solving a generalized eigenvector problem, i.e.,
and the m-dimensional mapped data are corresponding to the smallest
m non-zero eigenvalues in Equation (
8) [
28].
It is easy to see that by using LE the data points can be aggregated more tightly with their neighbors and the mapped data are not in Gaussian distribution anymore. Thus, if we take the mapped data that were preprocessed by LE as the input of t-SNE, the loose distribution problem associated with t-SNE can be significantly alleviated.
In
Figure 1 and
Figure 2, we show the visualization of 5000 data selected from the MNIST randomly by using t-SNE and t-SNE with LE as the preprocessing strategy, respectively. The procedure of the latter strategy is as follows: We first use KNN to obtain
neighbors for each data point; then, we reduce the dimensionality of the original data to
using the concept of LE. Finally, we take those
-dimensional data as the input for t-SNE to further accomplish the dimensionality reduction.
Just as we expect, the t-SNE implements t-distribution to solve the crowd problem and a relatively obvious gap between different data clusters can be formed. However, the clusters themselves are still loose to some extent, which undermines the ability of t-SNE to form tight clusters. Worse yet, the entropy of each cluster is too high to yield a small K–L divergence. On the other hand, from
Figure 2, it is evident that not only the gaps between each of the ten data clusters can be formed by using the preprocessing strategy, but each data point also tends to be gathered with its neighboring points.
However, we notice that the gaps between different clusters are not satisfactory, both for the standard t-SNE and the one with LE as the preprocessing strategy, as shown in
Figure 1 and
Figure 2. Moreover, as both the computation and space complexity of t-SNE grow quadratically with the number of data pairs, it will be of great interest to reduce the number of data pairs involved in the final t-SNE.
Here, we continue to preprocess the data that experienced dimensionality reduction by LE before they are processed by t-SNE. Inspired by the sparse approximation strategy proposed in [
29], we again use the KNN algorithm to find out the neighbors for each data point, then aggregate neighboring similar data points and weaken the relationships between dissimilar data pairs. As in [
29], the pairwise similarities between data points,
, is redefined by
In this way, we propose another strategy by which we can decrease the distance between neighboring data in order to increase the possibility that a data point chooses its real neighbor.
After we implement LE, we perform the KNN algorithm again to find the neighbors of each data point, and then we introduce a coefficient
as below
It makes sense and does little harm to the original data structure since we have already implemented Laplacian eigenmaps to preprocess the data. With regard to the MNIST dataset, we set as .
Now, we present our dimensionality reduction strategy with preprocessing as shown in Algorithm 1.
Algorithm 1 A preprocessing manifold learning strategy based on t-SNE. |
Input: The high-dimensional data set X. |
Output: The two-dimensional data Y after the dimensionality reduction. |
- 1:
Compute the Euclidean distances for each of the high-dimensional data pairs; - 2:
Apply a KNN algorithm to find out neighbors for each data; - 3:
Apply Laplacian Eigenmaps on the original data and reduce its dimensionality to ; - 4:
Apply the KNN algorithm again to find out neighbors for each data, and for two data points that are not neighbors , set their conditional probability to 0; - 5:
Decrease the distance between neighboring data points by applying the coefficient on the neighboring data points; - 6:
Compute the joint probabilities with ; - 7:
Get the low-dimensional mapped data ; - 8:
Compute the joint probabilities of the mapped points with ; - 9:
Compute ; - 10:
Compute , where represents the learning rate and the momentum; - 11:
Repeat steps 8, 9 and 10 until t reach the number of iterations to implement t-SNE on the pre-processed data or the corresponding dimensionality of the mapped data is reduced to be 2.
|
In
Figure 3, we show the visualization of 5000 data points selected from MNIST randomly with our preprocessing strategy. The procedure of our preprocessing strategy is as follows: first, we use KNN to obtain
neighbors for each data point; then, we apply LE to reduce the dimensionality of the original data; after that, we implement the KNN algorithm again to find
neighbors for each data point; and finally, we apply t-SNE on the preprocessed data and reduce their dimensionality to 2.
We have also applied our strategies on datasets Coil-100 and the Fashion-MNIST. The former is a collection of color images of 100 objects taken from different angles and comprises 72 images per object in total. The size of each image is uniformly processed to be
. The latter is a clothing image dataset from Zalando, Berlin, Germany, containing a training set of 60,000 samples and a test set of 10,000 samples. Each sample is a
grayscale image associated with 10 categories of labels. The performance of the t-SNE algorithm and our strategy on these datasets are shown in
Figure 4,
Figure 5,
Figure 6 and
Figure 7. According to the simulation results, our strategy has better performance on the MNIST dataset and the Fashion-MNIST dataset. For the Coil-100 dataset, though the performance of our strategy is slightly unsatisfactory due to the high dimensionality of the data, it is better than the standard t-SNE algorithm.
4. Discussion
As in previous simulations, in
Figure 3 we again show the visualization of 5000 data selected from MNIST randomly but by using our approach (Algorithm 1). It can be seen that, compared with
Figure 2, our strategy shows more pronounced benefits since data of the same kind are aggregated far more tightly. In addition, the gaps between different clusters are also enlarged. Thus, it provides many more advantages for the process of extracting each cluster respectively.
Just as we mentioned in the last section, the strategy by which we decrease the distance between neighboring data destroys the data structure in some cases, leading to an increase in KLD to some extent. However, the KLD can be significantly reduced by using the idea of LE in step 3 in Algorithm 1. In
Figure 8, we show the changing trend of KLD along with the increase in iterations. It is evident that by using our preprocessing strategy, the KLD plateaus around “1” after 500 iterations, while the standard t-SNE plateaus at around
and the SNE plateaus at
. In other words, the KLD yielded by our approach is about
lower than the standard t-SNE and almost
lower than the SNE.
We also compare the computation complexity of the three approaches mentioned in the context in terms of running time. As shown in
Table 1, the running time caused by our approach is increased by only
when compared with the standard t-SNE. There are two reasons for this. Firstly, LE is a highly efficient algorithm; when it is introduced into our preprocessing strategy to accomplish the preliminary dimensionality reduction, the extra complexity can almost be ignored. Secondly, though KNN algorithms are used in our approach twice, the number of data pairs considered in t-SNE can also be reduced correspondingly. In other words, the extra complexity caused by KNN can be offset by itself.
For the space complexity, since we have added a preprocessing process to the standard t-SNE algorithm, some extra space is required, which is mainly caused by the Laplacian eigenmaps and the KNN algorithm. Firstly, for the KNN algorithm, each data point is stored as a separate object or array, and its space complexity is , where n represents the number of data points and d represents the dimensionality of each data point. Secondly, Laplacian eigenmaps need to construct the graph Laplacian matrix, which is an matrix, so the space complexity of Laplacian eigenmaps is . As the standard t-SNE needs to compute and store the joint probabilities of mapped points, its space complexity is . In conclusion, our strategy requires approximately twice as much space compared to the standard t-SNE algorithm.
To show the effect of different parameters
,
and
on the performance of our algorithm, we performed simulations with different parameters and gave the corresponding gradient descent process. Since both
and
are the neighbor numbers of the KNN algorithm, we also made them consistent to maintain their consistency. In
Figure 9, we show that a suitable parameter
can bring good performance to our strategy. In
Figure 10, we show that the performance does become better along with the increase in
and
, but at a cost of higher complexity as more neighbor nodes are involved. On the other hand, we find that when
and
are increased to 80, the performance reaches a saturated state.
Moreover, as we can set those parameters (, and ) according to different scenarios, our approach offers a flexible balance between complexity and performance as required. Now, we can safely arrive at the conclusion that our strategy significantly improves the performance of the standard t-SNE while the complexity almost remains the same.