An intelligent algorithm for fast machine translation of long English sentences

Hengheng He

doi:10.1515/jisys-2022-0257

Open Access Published by De Gruyter March 29, 2023

An intelligent algorithm for fast machine translation of long English sentences

Hengheng He

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2022-0257

Abstract

Translation of long sentences in English is a complex problem in machine translation. This work briefly introduced the basic framework of intelligent machine translation algorithm and improved the long short-term memory (LSTM)-based intelligent machine translation algorithm by introducing the long sentence segmentation module and reordering module. Simulation experiments were conducted using the public corpus and the local corpus containing self-collected linguistic data. The improved algorithm was compared with machine translation algorithms based on a recurrent neural network and LSTM. The results suggested that the LSTM-based machine translation algorithm added with the long sentence segmentation module and reordering module effectively segmented long sentences and translated long English sentences more accurately, and the translation was more grammatically correct.

Keywords: long English sentence; machine translation; long short-term memory; long sentence segmentation

AMS mathematics subject classification number: 68W40

1 Introduction

With the development of globalization, there is more and more international communication. In the process of communication, language is crucial. Using a common language that can be mutually understood can avoid misunderstandings during communication and enhance the efficiency of labor division. English is one of the most widely used common languages. For non-native English speakers, the cost of learning a new language is also high, and achieving free communication is difficult [1]. In formal situations and when a large amount of information needs to be exchanged [2], single human translation can no longer meet the growing demand. Simultaneous interpretation, for example, requires a high level of attention from interpreters, so they usually cannot work for a long time. Thus, a translation tool is needed to replace human translation [3]. In the actual communication process, long sentences in English are common [4]. The grammar of English and other languages is different. If the machine translation algorithm translates long English sentences according to the one-to-one correspondence method, the problem of ungrammatical translation will occur, and in severe cases, it may even lead to translation errors. The emergence of intelligent algorithms provides a new approach to English machine translation. Relevant studies about improving the efficiency and quality of English machine translation are reviewed below. Lin et al. [5] proposed a neural machine translation based on a novel beam search evaluation function and found that the method effectively improved the quality of English-Chinese translation. Luong and Manning et al. [6] proposed an end-to-end neural machine translation-based minimal-risk training method and verified the effectiveness of the method through experiments. Choi et al. [7] contextualized the word embedding vectors with a nonlinear bag-of-words representation of the source sentence. They performed experiments and found that the proposed contextualization and symbolization methods greatly improved the translation quality of neural machine translation systems. This work described the basic framework of the intelligent machine translation algorithm and optimized the long short-term memory (LSTM)-based intelligent machine translation algorithm by introducing the long sentence segmentation module and the reordering module. Simulation experiments were conducted using the public corpus and the local corpus collected by the authors.

2 Intelligent machine translation algorithms for long English sentences

The intelligent machine translation algorithm utilizes a neural network algorithm. The basic framework of this machine translation algorithm is shown in Figure 1, and the main structure includes an encoder and a decoder. The intelligent machine translation algorithm first converts the original text into a coded string that is neither the original text nor the translation by the neural network algorithm in the encoder and then converts the coded string into the translation by the neural network algorithm in the decoder [8].

Figure 1

Basic framework of the intelligent machine translation algorithm.

Compared with the traditional translation method, which translates word-by-word with the help of dictionaries, the intelligent neural network-based machine translation algorithm acquires hidden laws with grammatical rules through corpus training samples in the training process [9]; thus, it is better in translation. However, the translation quality decreases when the intelligent machine translation algorithm translates long English sentences. The reason for this is that when the encoder in the intelligent machine translation algorithm converts the original text into vector code [10], the size of the vector code is fixed regardless of the length of the original text, i.e., the longer the original text, the more information is compressed when it is converted into vector code, and therefore, more semantic information is lost [11].

In order to improve the translation quality of long English sentences by intelligent machine translation algorithms, the long sentence segmentation module and reordering module were added. The function of the long sentence segmentation module is to split a long English sentence into multiple short sentences according to some rules; then, the translation algorithm combines the multiple short sentences after translating them alone, reducing the loss of semantic information due to the fixed-size vector code [12]. The function of the reordering module is to reorder the short sentences obtained from the segmentation so that the translation of the short sentences will be more consistent with the language order after the combination [13]. The flow of the improved intelligent machine translation algorithm is shown in Figure 2.

A source text is input, i.e., the long English sentence.
The long sentence segmentation module segments the long English sentence. It predicts the probability of every word in a long sentence as a segmentation word with the maximum entropy classifier [14] and uses the word with the highest probability as the segmentation word to realize the long sentence segmentation. The probability calculation formula is:
(1) p ( y ∣ x ( w ) ) = exp ∑ i ω i g i ( y , x ( w ) ) ∑ y ′ exp ∑ i ω i g i ( y ′ , x ( w ) ) ,
where p ( y ∣ x ( w ) ) is the probability that word w is used as the segmentation word in the long sentence, x ( w ) is the contextual information of w containing the word w , y is the label of the segmentation word, y ′ is the set of “segmentation” and “non-segmentation” labels, g i ( y , x ( w ) ) is the i -th characteristic function between x ( w ) and y , which is 1 if there is a connection between them and 0 otherwise, and ω i is the weight parameter of g i ( y , x ( w ) ) .
The reordering module reorders the segmented short sentences and predicts the probability of the order of the original adjacent short sentences with the maximum entropy classifier. The corresponding calculation formula is as follows:
(2) p ( o ∣ C m s , C s n ) = exp ∑ j ω j g j ( o , C m s , C s n ) ∑ o ′ exp ∑ j ω j g j ( o ′ , C m s , C s n ) ,
where C m s and C s n are the phrase before and after the segmented word s , respectively, o is the “order” label, indicating that C m s comes before C s n after reordering, o ′ is the set of “order” and “reversed order” labels, p ( o ∣ C m s , C s n ) is the probability of order, g j ( o , C m s , C s n ) is the i -th characteristic function between C m s , C s n , and o , and ω j is the weight parameter of g j ( o , C m s , C s n ) .
The segmented and reordered short English sentences are input into the encoder. The encoder uses a LSTM model LSTM to encode the source text, and the corresponding calculation formula is as follows:
(3) f t = σ ( b f + U f x t + W f h t − 1 ) , s t = f t s t − 1 + g t σ ( b + U x t + W h t − 1 ) , g t = σ ( b g + U g x t + W g h t − 1 ) , h t = tanh ( s t ) q t , q t = σ ( b q + U q x t + W q h t − 1 ) ,
where f t is the forgetting gate output, b f , U f , and W f are the bias term, input term weight, and forgetting gate weight in the forgetting gate [15], respectively, s t is the cyclic gate output, b , U , and W are the bias term, input term weight, and cyclic gate weight in the cyclic gate, respectively, g t is the external input gate unit, b g , U g , and W g are the bias term, input term weight, and input gate weight in the input gate, respectively [16], q t is the output gate unit, and b q , U q , and W q are the output gate bias term, input term weight, and output gate weight, respectively.
After the vector code is obtained by the encoder, the decoder, which adopts LSTM, decodes it [17]. The probability distribution of the translated characters is obtained after forward calculation of the input vector code by the LSTM. Finally, the cluster search algorithm [18] is used to find out the translated characters with the best probability from the probability distribution. The final translation is obtained after arranging the characters in order.

Figure 2

Translation process of the improved intelligent machine translation algorithm for long English sentences.

3 Simulation experiments

3.1 Experimental data

The English-Chinese parallel corpus used for simulation experiments was UM-Corpus [19], which provides two million English-Chinese aligned bilingual corpora from eight different text domains, including education, law, Weibo, news, science, speaking, subtitles, and essay. Ten thousand sentences were randomly selected as the training set, and 5,000 sentences were randomly selected as the test set.

In addition to the datasets, this study collected 3,000 English corpora from newspapers and movie reviews to test the improved intelligent machine translation algorithm.

3.2 Experimental setup

In the improved intelligent machine translation algorithm, the relevant parameters of the LSTM in the encoder are as follows. Four hidden layers were set, the number of nodes in every layer was set as 1,024, and the activation function in the hidden layer was the sigmoid function. The parameters of the LSTM in the decoder are as follows. There were 2 hidden layers, 1,024 nodes per layer. The sigmoid function was the activation function of the hidden layer. The cluster search algorithm was used to transform the calculated probability distribution of characters into the translation, and the window size of the “cluster” was 10. When the algorithm was trained with the training set, the stochastic gradient descent method was used to adjust the parameters; the learning rate was set as 0.1, and the maximum number of learning was 1,000.

In addition to testing the improved intelligent machine translation algorithm, the machine translation algorithm without adding the long sentence segmentation module and the reordering module was also tested. The encoder and decoder in this machine translation algorithm also adopted the LSTM algorithm, and the parameter settings were consistent with the improved machine translation algorithm.

The machine translation algorithm that adopted the recurrent neural network (RNN) algorithm in the encoder and decoder was also tested. The relevant parameters of the RNN algorithm are as follows. Four hidden layers were set for the encoder, and 2 hidden layers were set for the decoder; 1,024 nodes were set in every hidden layer, and the sigmoid function was used as the activation function. The training was performed using the training set in the same way as the improved machine translation algorithm.

3.3 Evaluation criteria

First, the long sentence segmentation in the improved intelligent machine algorithm was tested. The long sentence segmentation can be regarded as the sequence annotation of words in the sentence, so the segmentation effect was measured using the precision, recall rate, and F value.

The effect of segmenting long sentences was evaluated by the confusion matrix of binary classification. Table 1 shows the corresponding confusion matrix of binary classification. The calculation formulas of the evaluation indexes are as follows:

(4) P = TP TP + FN , R = TP TP + FP , F = 2 ⋅ P ⋅ R P + R ,

where P is the precision, R is the recall rate, and F is the combined value of precision rate and recall rate.

Table 1

Confusion matrix of binary classification

	Positive actually	Negative actually
Determined as positive	TP	FN
Determined as negative	FP	TN

The performance of the machine translation algorithm for English word translation was initially evaluated by the word error rate [20], and the calculation formula is as follows:

(5) WER = X + Y + Z P ∗ 100 % ,

where X is the number of substituted words, Y is the number of deleted words, Z is the number of inserted words, and P is the number of all words in the test set. When using a machine translation algorithm to translate English, long sentences or even long texts are translated in most cases, and when translating long sentences or English texts, besides the accuracy of word translation, it is also necessary to pay attention to the differences in grammar between the source text and the translated text, so the bilingual evaluation understudy (BLEU) index was also used to evaluate the translated text as a whole, and its formula is as follows:

(6) BLEU = B ⋅ exp ∑ n = 1 N ω n log p n , B = 1 , c > r , exp 1 − r c , c ≤ r ,

where N is the maximum order of the n-gram grammar, ω n is the weight of the n-gram grammar, p n is the percentage of short sentences in the n-gram grammar, B is the penalty factor, c is the number of words in the machine-translated translation, and r is the number of identical words in the machine-translated translation and the reference translation. The order of the n-gram grammar is determined by the one to be evaluated, and the weight of the corresponding grammar is set according to experience.

3.4 Experimental results

Table 2 shows the partial results of segmenting long English sentences with the improved intelligent machine translation algorithm and the overall segmentation performance. “(,)” in the segmentation result of a long English sentence in Table 2 indicated the segmentation boundaries given by the segmentation module. It was found that the long sentence segmentation module segmented the long sentence well, and the short sentences segmented by the module contained conjunctions, which meet the grammatical segmentation criteria. The precision, recall rate, and F value of segmentation for the corpora in the test set were 97.6%, 96.9%, and 97.2%, respectively, which indicated that the improved intelligent machine translation algorithm segmented long English sentences quite well.

Table 2

Partial results of segmenting long English sentences with the improved intelligent machine translation algorithm and the overall segmentation performance

The source text of an English example	(1) The weather is good today, which is suitable for outdoor sports, picnics, or leisure activities at home
	(2) The lunch in the canteen today is roast meat with potatoes
	(3) The test is coming tomorrow. Have you finished reviewing?
Segmentation	(1) The weather is good today (,) which is suitable for outdoor sports, picnics (,) or leisure activities (,) at home
	(2) The lunch (,) in the canteen today (,) is roast meat (,) with potatoes
	(3) The test (,) is coming (,) tomorrow(.) Have you (,) finished reviewing?
The precision of segmentation	97.6%
The recall rate of segmentation	96.9%
The F value of segmentation	97.2%

Due to space limitation, only some results of the translation of the English source text by the three machine translation algorithms are shown here, as shown in Table 3. After comparing the translations obtained by the three machine translation algorithms with the reference translation, it was found that the translation obtained by the improved machine translation algorithm based on LSTM was the closest to the reference translation, and the translation obtained by the LSTM-based algorithm had some differences. Although the translation obtained by the RNN-based machine translation algorithm conveyed the meaning consistent with the reference translation, its grammar did not conform to the regular rules, which led to obvious discomfort when reading the translation.

Table 3

Partial translation results of three machine translation algorithms

English source text	(1) The weather is good today, which is suitable for outdoor sports, picnics, or leisure activities at home
	(2) The lunch in the canteen today is roast meat with potatoes
	(3) The test is coming tomorrow. Have you finished reviewing?
Reference translation	(1) 今日天气良好，适合户外运动，也可在外野餐，当然也可以在家休闲活动。
	(2) 今天食堂中提供的午餐是土豆烧肉。
	(3) 明天就要测试了，复习完成了吗？
RNN-based machine translation	(1) 天气很好今天，适合在外运动、野餐或者休闲活动在家。
	(2) 午餐在餐厅今天，是烧肉和土豆。
	(3) 测试要来了明天，你有完成复习吗？
LSTM-based machine translation	(1) 今日天气良好，适合户外运动，也可在外野餐或在家休闲活动、
	(2) 今天在食堂的午餐是烧肉和土豆。
	(3) 明天测试要来了。你完成复习了吗？
Improved LSTM-based machine translation	(1) 今日天气良好，适合户外运动，也可在外野餐，当然也可以在家休闲活动。
	(2) 今天食堂中提供的午餐是土豆烧肉。
	(3) 明天就要测试了，复习完成了吗？

Figure 3 shows the word error rates in the translation results of the three machine translation algorithms for the corpus data test set and the self-collected local data test set. The word error rate of the RNN-based machine translation algorithm was 2.6% when translating the corpus data test set and 2.8% when translating the local data test set. The word error rate of the LSTM-based machine translation algorithm was 1.5 and 1.7%, while the word error rate of the improved LSTM-based machine translation algorithm was 0.9 and 1.1%, respectively. It was found from the comparison that the translation of the improved LSTM-based machine translation algorithm had the lowest word error rate.

Figure 3

Word error rates of three machine translation algorithms for translating the corpus data test set and local data test set.

Figure 4 shows the BLEU of the translations obtained by the three machine translation algorithms for the corpus and local data test sets. The BLEU of the RNN-based machine translation algorithm was 22.3 for the corpus data test set and 22.4 for the local data test set. The BLEU of the LSTM-based machine translation algorithm was 26.7 for the corpus data test set and 26.1 for the local data test set, while the BLEU of the improved LSTM-based machine translation algorithm was 31.3 for the corpus data test set and 30.9 for the local data test set. It was seen from the comparison that the translation of the improved LSTM-based machine translation algorithm had the highest BLEU, which suggested that this algorithm had the best translation performance.

Figure 4

BLEU of three machine translation algorithms for two test sets.

4 Discussion

As international communication becomes increasingly frequent, English as a common language is difficult to learn for non-native English speakers. Especially when a large number of texts need to be translated on formal occasions, manual translation alone is less efficient. With the improvement in computer technology, machine translation algorithms have been gradually applied to the rapid translation of a large number of English texts. However, in the actual use process, as the grammar differs between different languages, if the machine translation algorithm still translates long English sentences according to the one-to-one correspondence, the translation may have grammar mistakes, leading to difficult reading. In this work, the intelligent translation algorithm cut a long English sentence into a collection of phrases, reordered the phrases in the collection following the translation grammar, and translated the reordered English phrases in the collection by machine using the encoder and decoder of the LSTM. Then, the intelligent translation algorithm was simulated and compared with the machine translation algorithm using RNN as the coder (decoder) and the LSTM algorithm without adding the long sentence segmentation module. The final results are discussed in Section 3.

In the test of English long sentence segmentation, the proposed intelligent translation algorithm had a good performance. The subsequent comparison experiments also verified that the proposed intelligent translation algorithm had the best translation performance for long English sentences, the LSTM algorithm without the long sentence segmentation module was the second best, and the RNN algorithm without the long sentence segmentation module was the worst. The reasons are as follows. Although the RNN algorithm in the RNN-based machine translation algorithm is suitable for processing data with sequential characteristics, it is easy to fall into conditions such as gradient explosion or gradient disappearance when processing long data, so its performance in translating long English sentences was poor. The LSTM algorithm in the LSTM-based machine translation algorithm originates from RNN. The introduced forgetting mechanism reduced the impact caused by gradient explosion or disappearance due to long sentences, so the performance of the LSTM-based machine translation algorithm was better than that of the RNN-based machine translation algorithm. The improved machine translation algorithm based on LSTM was added with the long sentence segmentation module and the reordering module. The segmentation of long sentences reduced the loss of semantic information. The reordering module improved the grammatical order of the translation, so the performance of the improved machine translation algorithm was better than that of the LSTM-based machine translation algorithm.

5 Conclusion

This study briefly introduced the basic framework of the intelligent machine translation algorithm, introduced the long sentence segmentation module and reordering module to improve the LSTM-based intelligent machine translation algorithm, and conducted simulation experiments using two corpora. The results are concluded below. (1) The improved intelligent machine translation algorithm had a good segmentation effect on long English sentences. (2) The translation obtained by the improved LSTM-based machine translation algorithm was closest to the reference translation, the LSTM-based machine translation algorithm was the second, and the translation obtained by the RNN-based machine translation algorithm did not conform to the conventional grammar. (3) The translation obtained by the improved LSTM-based machine translation algorithm had the lowest word error rate, while the RNN-based machine translation algorithm had the highest error rate. (4) The translations obtained by the improved LSTM-based machine translation algorithm for both test sets had the highest BLEU, the LSTM-based machine translation algorithm had the second highest, and the RNN-based machine translation algorithm had the lowest.

Conflict of interest: The author declares no conflict of interests.

References

[1] Lin L, Liu J, Zhang X, Liang X. Automatic translation of spoken English based on improved machine learning algorithm. J Intell Fuzzy Syst Appl Eng Technol. 2021;40:2385–95.10.3233/JIFS-189234Search in Google Scholar

[2] Zhang G. Research on the efficiency of intelligent algorithm for English speech recognition and sentence translation. Inform An Int J Comput Inform. 2022;45:309–14.10.31449/inf.v45i2.3564Search in Google Scholar

[3] Wen H. Intelligent English translation mobile platform and recognition system based on support vector machine. J Intell Fuzzy Syst Appl Eng Technol. 2020;38:7095–106.10.3233/JIFS-179788Search in Google Scholar

[4] Dandapat S, Federmann C. Iterative data augmentation for neural machine translation: a low resource case study for English-Telugu. Proceedings of the 21st Annual Conference of the European Association for Machine Translation, (Alacant, Spain), European Association for Machine Translation; 2018, May 28–30. p. 287–92.Search in Google Scholar

[5] Lin X, Liu J, Zhang J, Lim S. A novel beam search to improve neural machine translation for English-Chinese. Comput Mater Contin (Engl). 2020;65(1):387–404.10.32604/cmc.2020.010984Search in Google Scholar

[6] Luong MT, Manning CD. Achieving open vocabulary neural machine translation with hybrid word-character models. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, (Berlin, Germany), Association for Computational Linguistics; 2016 August. p. 1054–63.10.18653/v1/P16-1100Search in Google Scholar

[7] Choi H, Cho K, Bengio Y. Context-dependent word representation for neural machine translation. Comput Speech Lang. 2017;45:149–60.10.1016/j.csl.2017.01.007Search in Google Scholar

[8] Khan S, Mir U, Shreem S, Al Amari S. Translation divergence patterns handling in English to Urdu machine translation. Int J Artif Intell Tools Archit Lang Algorithms. 2018;27:1–19.10.1142/S0218213018500173Search in Google Scholar

[9] Zhang L, Zhou Z, Ji P, Mei A. Application of attention mechanism with prior information in natural language processing. Int J Artif Intell Tools Archit Lang Algorithms. 2022;31:1–18.10.1142/S0218213022400085Search in Google Scholar

[10] Bi S. Intelligent system for English translation using automated knowledge base. J Intell Fuzzy Syst. 2020;39:1–10.10.3233/JIFS-179991Search in Google Scholar

[11] Malviya P, Rao G. A model literature analysis on machine translation system finding research problem in English to Hindi translation-systems. Int J Comput Intell Theory Pract. 2020;15:127–35.Search in Google Scholar

[12] Chakrawarti RK, Mishra H, Bansal P. Review of machine translation techniques for idea of Hindi to English idiom translation. Int J Comput Intell Res. 2017;13:1059–71.Search in Google Scholar

[13] Kipyatkova I. Experimenting with hybrid TDNN/HMM acoustic models for Russian speech recognition. International Conference on Speech and Computer (Springer, Cham), Singapore: Springer Nature; 2017, August 13.10.1007/978-3-319-66429-3_35Search in Google Scholar

[14] Yoshioka T, Karita S, Nakatani T. Far-field speech recognition using CNN-DNN-HMM with convolution in time. IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, QLD, Australia: IEEE; 2015, April 19–24. p. 4360–4.10.1109/ICASSP.2015.7178794Search in Google Scholar

[15] Wang Y, Bao F, Zhang H, Gao G. Research on Mongolian speech recognition based on FSMN. Nat Lang Process Chin Comput. 2017;243–54.10.1007/978-3-319-73618-1_21Search in Google Scholar

[16] Song HJ, Heo TS, Kim JD, Park CY, Kim YS. Sentence similarity evaluation using Sent2Vec and siamese neural network with parallel structure. J Intell Fuzzy Syst. 2021;40:1–10.10.3233/JIFS-189593Search in Google Scholar

[17] Yun H, Hwang Y, Jung K. Improving context-aware neural machine translation using self-attentive sentence embedding. Proc AAAI Conf Artif Intell. 2020;34:9498–506.10.1609/aaai.v34i05.6494Search in Google Scholar

[18] Alam MJ, Gupta V, Kenny P, Dumouchel P. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation. Eurasip J Adv Signal Process. 2015;2015:1–13.10.1186/s13634-015-0238-6Search in Google Scholar

[19] Tian L, Wong DF, Chao L, Quaresma P, Oliveira F, Li S, et al. UM-Corpus: A large english-chinese parallel corpus for statistical machine translation. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association; 2014, May 26–31.Search in Google Scholar

[20] Hammami N, Bedda M, Nadir F. The second-order derivatives of MFCC for improving spoken Arabic digits recognition using Tree distributions approximation model and HMMs. International Conference on Communications and Information Technology. Hammamet, Tunisia: IEEE; 2012, June 26—28. p. 1–5.10.1109/ICCITechnol.2012.6285769Search in Google Scholar

Received: 2022-11-06

Revised: 2022-12-19

Accepted: 2023-02-08

Published Online: 2023-03-29

This work is licensed under the Creative Commons Attribution 4.0 International License.

An intelligent algorithm for fast machine translation of long English sentences

Abstract

1 Introduction

2 Intelligent machine translation algorithms for long English sentences

3 Simulation experiments

3.1 Experimental data

3.2 Experimental setup

3.3 Evaluation criteria

3.4 Experimental results

4 Discussion

5 Conclusion

References

Journal and Issue

Articles in the same Issue