short-paper

Efficient Computation of Co-occurrence Based Word Relatedness

Authors:

Jie Mei,

Evangelos E. MiliosAuthors Info & Claims

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

Pages 43 - 46

https://doi.org/10.1145/2682571.2797088

Published: 08 September 2015 Publication History

Get Access

Abstract

Measuring document relatedness using unsupervised co-occurrence based word relatedness methods is a processing-time and memory consuming task. This paper introduces the application of compact data structures for efficient computation of word relatedness based on corpus statistics. The data structure is used to efficiently lookup: (1) the corpus statistics for the Common Word Relatedness Approach, (2) the pairwise word relatedness for the Algorithm Specific Word Relatedness Approach. These two approaches significantly accelerate the processing time of word relatedness methods and reduce the space cost of storing co-occurrence statistics in memory, making text mining tasks like classification and clustering based on word relatedness practical.

References

[1]

D. Bollegala, Y. Matsuo, and M. Ishizuka. A web search engine-based approach to measure semantic similarity between words. Knowledge and Data Engineering, IEEE Trans. on, 23(7):977--990, 2011.

Digital Library

Google Scholar

[2]

T. Brants and A. Franz. Web 1T 5-gram corpus version 1.1. Technical report, Google Research, 2006.

Google Scholar

[3]

K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990.

Digital Library

Google Scholar

[4]

R. Cilibrasi and P. Vitanyi. The Google similarity distance. Knowledge and Data Engineering, IEEE Trans. on, 19(3):370--383, March 2007.

Digital Library

Google Scholar

[5]

T. H. Cormen, C. E. Leiserson, et al. Introduction to algorithms, volume 2. 2001.

Digital Library

Google Scholar

[6]

E. Iosif and A. Potamianos. Unsupervised semantic similarity computation between terms using web documents. Knowledge and Data Engineering, IEEE Trans. on, 22(11):1637--1647, Nov 2010.

Digital Library

Google Scholar

[7]

A. Islam and D. Inkpen. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data, 2(2):10:1--10:25, July 2008.

Digital Library

Google Scholar

[8]

A. Islam and D. Inkpen. Managing the Google web 1t 5-gram data set. In Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on, pages 1--5. IEEE, 2009.

Crossref

Google Scholar

[9]

A. Islam, E. Milios, and V. Keŝelj. Comparing word relatedness measures based on Google n-grams. In COLING 2012, 24th International Conference on Computational Linguistics, 2012, pages 495--506, 2012.

Google Scholar

[10]

A. Islam, E. Milios, and V. Keŝelj. Text similarity using Google tri-grams. In Advances in Artificial Intelligence, volume 7310, pages 312--317. Springer, 2012.

Digital Library

Google Scholar

[11]

M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi. The similarity metric. Information Theory, IEEE Trans. on, 50(12):3250--3264, Dec 2004.

Digital Library

Google Scholar

[12]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.

Digital Library

Google Scholar

[13]

P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European Conference on Machine Learning, EMCL '01, pages 491--502, London, UK, UK, 2001. Springer-Verlag.

Digital Library

Google Scholar

Cited By

View all

Wang WIslam AMoh’d ASoto AMilios E(2020)Nonuniform language in technical writing: Detection and correctionNatural Language Engineering10.1017/S1351324920000133(1-22)Online publication date: 6-Mar-2020
https://doi.org/10.1017/S1351324920000133
Chen SMoh'd ANourashrafeddin SMilios E(2018)Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query DocumentsProceedings of the ACM Symposium on Document Engineering 201810.1145/3209280.3209532(1-10)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3209280.3209532

Index Terms

Efficient Computation of Co-occurrence Based Word Relatedness
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Large-scale learning of word relatedness with constraints
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs ...
The research of word sense disambiguation method based on co-occurrence frequency of Hownet
CLPW '00: Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12

Word sense disambiguation (WSD) is a difficult problem in natural language processing. In this paper, a sememe co-occurrence frequency based WSD method was introduced. In this method, Hownet was used as our information source, and a co-occurrence ...
Do Important Words in Bag-of-Words Model of Text Relatedness Help?
TSD 2015: Proceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 9302

We address the question of how Bag-of-Words BoW models of text relatedness can be improved by using important words in the text-pair instead of all the words. To find important words in a text, we use a new approach based on word relatedness. We use two ...

Comments

Information & Contributors

Information

Published In

DocEng '15: Proceedings of the 2015 ACM Symposium on Document Engineering

September 2015

248 pages

ISBN:9781450333078

DOI:10.1145/2682571

General Chair:
Christine Vanoirbeek
EPFL, Switzerland
,
Program Chair:
Pierre Genevès
CNRS, France

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council of Canada (NSERC)
Boeing Company

Conference

DocEng '15

Sponsor:

SIGWEB

DocEng '15: ACM Symposium on Document Engineering 2015

September 8 - 11, 2015

Lausanne, Switzerland

Acceptance Rates

DocEng '15 Paper Acceptance Rate 11 of 31 submissions, 35%;

Overall Acceptance Rate 194 of 564 submissions, 34%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
164
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang WIslam AMoh’d ASoto AMilios E(2020)Nonuniform language in technical writing: Detection and correctionNatural Language Engineering10.1017/S1351324920000133(1-22)Online publication date: 6-Mar-2020
https://doi.org/10.1017/S1351324920000133
Chen SMoh'd ANourashrafeddin SMilios E(2018)Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query DocumentsProceedings of the ACM Symposium on Document Engineering 201810.1145/3209280.3209532(1-10)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3209280.3209532

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Large-scale learning of word relatedness with constraints

The research of word sense disambiguation method based on co-occurrence frequency of Hownet

Do Important Words in Bag-of-Words Model of Text Relatedness Help?

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations