2.2.1. Applying the MDL Principle, NML and SNML Codelengths
Word2vec was derived based on the distributional hypothesis of Harris [
24], which states that words in similar contexts have similar meanings. Therefore, assuming the existence of the true context distribution for given words
, it is reasonable to choose the dimensionality that has the ability to learn the context distribution most similar to the true distribution. The MDL principle [
22] is a powerful solution for model selection, and is considered for the dimensionality selection as per our interest.
The MDL principle states that the best hypothesis (i.e., a model and its parameters) for a given set of data is the one that leads to the best compression of the data, namely the minimum codelength [
22]. Specifically, we consider each dimensionality corresponding to a probability model class
.
Assuming that we are able to encode a data series by a series of only 0 and 1, the length of this binary series is called the codelength of data series . We take the expression as the codelength of data that can be obtained when encoding with the given information about the model class . The MDL principle states that the closer the model class is to the true distribution generated data , the shorter the codelength that can be obtained.
Given a model class, there are many methods to estimate the shortest codelength of a given dataset such as: two-part codelength, Bayesian codelength [
25], NML or SNML codelength. Therein, the NML codelength is the best-known codelength in the MDL literature to achieve the minimax regret [
23]. The formula for the NML codelength is given below.
where
is known as Parametric Complexity (PC);
denotes all possible data series with the length of
n.
However, the PC term involves extensive computations and is not realistic to implement. Instead, we apply the SNML codelength [
26] in this study to reduce the computation cost using the formula seen below:
where
denotes data series
and
. The SNML codelength is calculated as the total codelength where the codelength for each datum is sequentially calculated such as the NML codelength every time it is input. It is known that the SNML codelength is a good approximation to the NML codelength [
27]. Since the SNML codelength is sequentially calculated, its computational cost at each step is much lower than that of the NML codelength. Based on the assumption of independence between the data records, the training process of the word2vec model comes after data records have been shuffled. Under this independence assumption, the independent process of SNML does not depend on the order of data.
In addition, the SNML codelength function
can be applied to oGS and SGNS in the forms seen below.
where
is a set of all possible one-hot vectors of
dimensions.
2.2.2. Some Heuristics Associated with SNML Codelength Calculation
The computation of the SNML codelength still costs times to execute the maximum likelihood estimation for each data record , which is also not realistic. We introduce two techniques for saving the computational costs for SNML: heuristic comparison and importance sampling on the SNML codelength.
A simple observation reveals that if the codelength of data obtained with model class is the shortest, then only some part of the data can also be achieved with the shortest codelength compressed with the same model class. Therefore, instead of computing the codelength for all n records of data, we can use the codelength of a small set of data. In fact, the results of our experiments show that focusing on the last several thousand records of data is sufficient to compare model classes.
Figure 1 demonstrates the differences in SNML codelengths of different dimensionalities compared with the dimensionality that achieves the shortest codelength on the data. The vertical axis shows the difference of data codelengths obtained by two different dimensionalities shown in the legends (e.g.,
d1 vs.
d2
dim); specifically, it is calculated by
where
is the codelength function;
is data;
and
are dimensionality; while the value of horizontal axis shows the number of records of
.
To facilitate comparisons among dimensionalities that are markedly different from one another (such as 30 dimensions versus 65 dimensions in
Figure 1(1), or 200 dimensions versus 130 dimensions in
Figure 1(2)), it is sufficient to use only 6000 data records to provide information about the best dimensionality to be chosen. Therefore, adding data thereafter simply increases the SNML codelength but does not change our answer substantially. However, for similar dimensionalities, such as 60, 65, and 70 dimensions in
Figure 1(3), the first one million data records cannot help us identify the optimal dimensionality. This phenomenon leads to confusion when the codelengths between two model classes are not too different. Furthermore, the number of data records required to determine the best dimensionality comes from the nature of the dataset and the tasks themselves. For example, in the case of word2vec, when SGNS randomizes only a few samples for the negative label from a large context set, the codelength of each data record will vary more than the codelength in oSG, which does not randomly sample negative samples.Therefore, SGNS needs more data records to determine the difference between candidate dimensionality.
To ensure that the correct model is chosen, we need to increase the number of records to estimate the SNML codelength so as to allow a better comparison of these two dimensionalities. However, a small dimension error in the dimensionality selection of word2vec does not affect the final performance considerably. Therefore, the trade-off between the computing time and model selection accuracy is determined by the number of records beyond those required to estimate the SNML codelength with sufficient finality.
Since the size of the context set is large (approximately 30,000–100,000 or above, according to the training dataset), the computation of PC for SNML in oSG is still very expensive. We apply the importance sampling method to approximately estimate the SNML description length for each data record. In detail, if a distribution Q on the context set satisfies , the following formula can be applied.
Let
, then
where
: set of samples draw from distribution Q.
This estimation asymptotes to the true value as m (the number of samples) increases, and distribution Q is similar to function . In our experiment, the uniform distribution is the best choice for distribution Q, and the sampling size is chosen to be 1/10 the size of the context set to balance the computation time and sampling error.