In this section, we describe the procedure of the experiment, including setup, baseline, results, and discussion.
4.1. Setup
The dataset we used is ASAP, a Kaggle competition dataset sponsored by the William and Flora Hewlett Foundation (Hewlett Foundation) in 2012. Many researchers have done the AES study on this dataset; choosing this dataset will help us to compare it with the previous experimental results. It contains eight prompts, each of which is a different genre. It was described in
Table 2.
We take Stanford’s publicly available GloVe 50-dimensional embedding [
40] as pre-trained word embedding instead of training it ourselves. Because we think that using the third party pre-trained word embedding makes the model more generally and more opening, the data is tokenized with an Natural Language Toolkit (NLTK,
http://www.nltk.org/) tokenizer. For those words that can’t be found in pre-trained word embedding, we replace them with UNKNOW. In addition, we adopt QWK mentioned in
Section 3.2 to measure the output results and use 5-fold cross-validation to evaluate our model.
The software environment in the experimental program run is under Windows 10, Python 3.6, TensorFlow-gpu 1.4, and hardware is CPU: Intel(R) Xeon(R) L5640 @2.27GHz 2.26GHz; RAM: 16G; HDD:100G; GPU:GTX1080i.
4.3. Results and Discussion
The results are listed in
Table 3. Our model SBLSTMA outperforms both baseline models (LSTM-CNN-att and SKIPFLOW) by approximately
on average QWK (quadratic weighted Kappa). The results are statistically significant with
by 1-tailed
t-test.
From
Table 3, we know that the empirical results have been significantly improved. We think that this is because the knowledge of the rating criteria–distance information plays a very significant role. To explain it, we further decompose the model SBLSTMA to another two submodels. As described in
Section 3.3, the model SBLSTMA consists of module Ma, Mb, and Mc, in which we can get three combined models: Ma + Mc, Mb + Mc, and Ma + Mb + Mc. Ma + Mc means the model receives the essay only without receiving the rating criteria information, and, during the training, it also computes the inner-feature information in the essay. Mb + Mc receives the distance information; during the training, it computes the inner-feature information in the distance information. Ma + Mb + Mc receives an essay and sample during the training; it computes inner-feature information and cross-feature information. We give the experimental results in
Table 4, where the sample sets used were listed in
Table 5.
The information distance is based on the sample set described in
Section 3.1. It is directly related to the quality of the experimental results. We need to find the samples that could reflect the rating criteria as accurately as possible. The maximum element of sample set depends on the range of essay’s score, but we can’t select all the different score essays as the samples, especially for the essays with a large score range; if so, the training will be very time-consuming, and the results are not necessarily good. Empirical results show that, usually, for the dataset that has a narrow score range, we can take all the samples with different scores as a sample set, such as prompts 3, 4, 5, and 6; for the dataset that has a large score range, we can make some of the samples as a sample set, such as prompts 1, 2, 7, and 8. The way we get a sample set for the dataset that has a large score range according to the steps as follows:
➀ According to Equation (
6), we compute all the samples
of each prompt.
➁ For each in a prompt, make a pre-training under Mb + Mc and gives a sort, of which the order is sorted by the quality of Kappa value of the training results.
➂ Take the first sample in the sort gives in step ➁ as the initial sample set. If the training results are less than the threshold (the result expectation was initialized before), then continue to add the second sample in the sort into the samples set,…, until the results are greater than the threshold or all the samples are added into the sample set.
Take prompt 4, for example: the scores are
, and the corresponding samples are
. By pre-training, we get a sort of
which means that the training result of
is the best one,
is the second one, and so on. Then, we first take sample set
as the initial sample set,
as the second one, and so on.
Table 5 shows the samples that we used in the experiment.
The results of each decomposed submodel listed in
Table 4 shows that the Kappa value under model Mb + Mc is better than Ma + Mc. It means that the distance information as an input is useful for training. Such an input based on rating criteria contains more rating information, and it does reflect a certain distance between the essay and sample. For a more intuitive explanation, we provide the Kappa value diagrams of the first 100 epochs of all eight prompts under Ma + Mc and Mb + Mc shown in
Figure 3.
Figure 3 intuitively shows that the Kappa value under Mb + Mc is better than the value under Ma + Mc. Furthermore,
Table 6 shows the mean value and standard deviation value under Ma + Mc, Mb + Mc, and Ma + Mb + Mc. The mean value reflects how good the training results are, while standard deviation indicates the size of training space and the training stability. It is obvious that, based on a greater mean value, the greater standard deviation, the better the results.
From
Table 6, we can conclude that the training under Mb + Mc is better than the training under Ma + Mc, and the training under Ma + Mb + Mc is much more stable than the other two.
Table 6 also tells the mean value and standard deviation of prompt 8 are relatively worse for the first 100 epochs. We consider this due to the fewest number of essays and the longest essay length and the largest size of the score range of prompt 8. For the other prompts, we can increase the number of samples set to improve the training effect, but, for prompt 8, we are not able to do this. When increasing the number of the sample set of prompt 8, the training process is not stable and is hard to converge. Therefore, in the experiment, the sample set number of prompt 8 is the smallest one.
Furthermore, from
Table 4, we know that the results under Ma + Mb + Mc are the best. The average Kappa value of Ma + Mb + Mc is 0.44 greater than that of Mb + Mc. In particular, prompt 2 and prompt 3, which have the worst Kappa value in baseline models, were improved obviously in our model. We think that the input under this model contains more information: essay, distance information and self-feature mechanism, which are good for rating. The value of parameter
, which denotes the length of sentence defined in
Section 3.3.4, was fed as 10. To explain it clearly, we take prompt 2 and prompt 3 for example. We give these two prompts’ Kappa value diagrams at the first 100 epochs under Ma + Mc, Mb + Mc, and Ma + Mb + Mc shown in
Figure 4. From the figure, we can easily see that model Ma + Mb + Mc made a further improvement than model Ma + Mc and Mb + Mc.