Word-Graph2vec: An Efficient Word Embedding Approach on Word Co-occurrence Graph Using Random Walk Technique

Li, Wenting; Xue, Jiahong; Zhang, Xi; Chen, Huacan; Chen, Zeyu; Huang, Feijuan; Cai, Yuanzhe

doi:10.1007/978-981-99-7254-8_68

Wenting Li¹²,
Jiahong Xue¹²,
Xi Zhang¹²,
Huacan Chen¹²,
Zeyu Chen¹²,
Feijuan Huang¹³ &
…
Yuanzhe Cai¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14306))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1482 Accesses
2 Citations

Abstract

Word embedding has become ubiquitous and is widely used in various natural language processing (NLP) tasks, such as web retrieval, web semantic analysis, and machine translation, and so on. Unfortunately, training the word embedding in a relatively large corpus is prohibitively expensive. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the limited vocabulary, huge idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large-scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Word2vec four to five times in terms of efficiency and two to three times than FastText, while the error generated by the random walk technique is small.

The authors acknowledge funding from 2023 Special Fund for Science and Technology Innovation Strategy of Guangdong Province (Science and Technology Innovation Cultivation of College Students), the Shenzhen High-Level Hospital Construction Fund (4001020), and the Shenzhen Science and Technology Innovation Committee Funds (JSGG20220919091404008).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-order Proximity Graph Structure Embedding

Using word embedding to detect keywords in texts modeled as complex networks

Article 09 June 2024

A core-periphery structure-based network embedding approach

Article 07 February 2022

Notes

1.
The vocabulary of the New Oxford Dictionary is around 170,000, but some of the words are old English words, so that in actual training, the word co-occurrence graph contains about 100,000 to 130,000 nodes.
2.
http://mattmahoney.NET/dc/text8.zip.
3.
https://www.kaggle.com/datasets/alexrenz/one-billion-words-benchmark.
4.
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
5.
The detailed information and the source code shows on this website https://github.com/kudkudak/word-embeddings-benchmarks.

References

Blanco, R., Lioma, C.: Graph-based term weighting for information retrieval. Inf. Retrieval 15(1), 54–92 (2012)
Article Google Scholar
Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E., Smith, N.A.: Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016)
Google Scholar
Hassan, S., Mihalcea, R., Banea, C.: Random walk term weighting for improved text classification. Int. J. Semant. Comput. 1(04), 421–439 (2007)
Article Google Scholar
Jastrzebski, S., Leśniak, D., Czarnecki, W.M.: How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. arXiv preprint arXiv:1702.02170 (2017)
Liu, R., Krishnan, A.: Pecanpy: a fast, efficient and parallelized python implementation of node2vec. Bioinformatics 37(19), 3377–3379 (2021)
Article Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)
MATH Google Scholar
Metz, C.E.: Basic principles of roc analysis. In: Seminars in Nuclear Medicine, vol. 8, pp. 283–298. Elsevier (1978)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
MultiMedia, L.: Large text compression benchmark (2009)
Google Scholar
Myers, J.L., Well, A.D., Lorch, R.F., Jr.: Research Design and Statistical Analysis. Routledge, Milton Park (2013)
Book Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
Google Scholar
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 298–307 (2015)
Google Scholar
Si, Y., Wang, J., Xu, H., Roberts, K.: Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26(11), 1297–1304 (2019)
Article Google Scholar
Wang, Z.W., Wang, S.K., Wan, B.T., Song, W.W.: A novel multi-label classification algorithm based on k-nearest neighbor and random walk. Int. J. Distrib. Sens. Netw. 16(3), 1550147720911892 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Shenzhen Technology University, Shenzhen, Guangdong, China
Wenting Li, Jiahong Xue, Xi Zhang, Huacan Chen, Zeyu Chen & Yuanzhe Cai
Shenzhen Institute of Translational Medicine, Shenzhen Second People’s Hospital, The First Affiliated Hospital of Shenzhen University, Shenzhen, Guangdong, China
Feijuan Huang

Authors

Wenting Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiahong Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huacan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Feijuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanzhe Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Feijuan Huang or Yuanzhe Cai .

Editor information

Editors and Affiliations

Renmin University of China, Beijing, China
Feng Zhang
Victoria University, Footscray, VIC, Australia
Hua Wang
Qatar University, Doha, Qatar
Mahmoud Barhamgi
Swinburne University of Technology, Hawthorn, Australia
Lu Chen
Swinburne University of Technology, Hawthorn, Australia
Rui Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W. et al. (2023). Word-Graph2vec: An Efficient Word Embedding Approach on Word Co-occurrence Graph Using Random Walk Technique. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_68

Download citation

DOI: https://doi.org/10.1007/978-981-99-7254-8_68
Published: 21 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7253-1
Online ISBN: 978-981-99-7254-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Word-Graph2vec: An Efficient Word Embedding Approach on Word Co-occurrence Graph Using Random Walk Technique

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-order Proximity Graph Structure Embedding

Using word embedding to detect keywords in texts modeled as complex networks

A core-periphery structure-based network embedding approach

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Word-Graph2vec: An Efficient Word Embedding Approach on Word Co-occurrence Graph Using Random Walk Technique

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-order Proximity Graph Structure Embedding

Using word embedding to detect keywords in texts modeled as complex networks

A core-periphery structure-based network embedding approach

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation