research-article

Free access

Improving text classification by a sense spectrum approach to term expansion

Authors:

Sándor Darányi,

Chew Lim TanAuthors Info & Claims

CoNLL '09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning

Pages 183 - 191

Published: 04 June 2009 Publication History

Abstract

Experimenting with different mathematical objects for text representation is an important step of building text classification models. In order to be efficient, such objects of a formal model, like vectors, have to reasonably reproduce language-related phenomena such as word meaning inherent in index terms. We introduce an algorithm for sense-based semantic ordering of index terms which approximates Cruse's description of a sense spectrum. Following semantic ordering, text classification by support vector machines can benefit from semantic smoothing kernels that regard semantic relations among index terms while computing document similarity. Adding expansion terms to the vector representation can also improve effectiveness. This paper proposes a new kernel which discounts less important expansion terms based on lexical relatedness.

References

[1]

E. Agirre and O. L. De Lacalle. 2003. Clustering Word-Net word senses. In Proceedings of RANLP-03, 4th International Conference on Recent Advances in Natural Language Processing, pages 121--130.

[2]

E. Agirre, E. Alfonseca, and O. L. de Lacalle. 2004. Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures. In Proceedings of GWC-04, 2nd Global WordNet Conference, pages 15--22.

[3]

R. Basili, M. Cammisa, and A. Moschitti. 2005. Effective use of WordNet semantics via kernel-based learning. In Proceedings of CoNLL-05, 9th Conference on Computational Natural Language Learning, pages 1--8.

Digital Library

[4]

S. Bloehdorn, R. Basili, M. Cammisa, and A. Moschitti. 2006. Semantic kernels for text classification based on topological measures of feature similarity. Proceedings of ICDM-06, 6th IEEE International Conference on Data Mining.

Digital Library

[5]

A. Budanitsky and G. Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1):13--47.

Digital Library

[6]

N. Cristianini, J. Shawe-Taylor, and H. Lodhi. 2002. Latent semantic kernels. Journal of Intelligent Information Systems, 18(2):127--152.

Digital Library

[7]

D. A. Cruse. 1986. Lexical semantics.

[8]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.

[9]

C. Dorrer, P. Londero, M. Anderson, S. Wallentowitz, and IA Walmsley. 2001. Computing with interference: all-optical single-query 50-element database search. In Proceedings of QELS-01, Quantum Electronics and Laser Science Conference, pages 149--150.

[10]

E. Gabrilovich and S. Markovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of IJCAI-05, 19th International Joint Conference on Artificial Intelligence, volume 19.

Digital Library

[11]

E. Hoenkamp. 2003. Unitary operators on the document space. Journal of the American Society for Information Science and Technology, 54(4):314--320.

Digital Library

[12]

A. Hotho, S. Staab, and G. Stumme. 2003. WordNet improves text document clustering. In Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval.

[13]

J. Hu, L. Fang, Y. Cao, H. J. Zeng, H. Li, Q. Yang, and Z. Chen. 2008. Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of SIGIR-08, 31st ACM International Conference on Research and Development in Information Retrieval, pages 179--186.

Digital Library

[14]

J. J. Jiang and D. W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics, pages 19--33.

[15]

T. Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142.

Digital Library

[16]

D. Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing Systems, volume 98, pages 768--773.

Digital Library

[17]

D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, M. Theobald, and G. Weikum. 2005. Word sense disambiguation for exploiting hierarchical thesauri in text classification. Proceedings of PKDD-05, 9th European Conference on the Principles of Data Mining and Knowledge Discovery, pages 181--192.

Digital Library

[18]

S. Mohammad and G. Hirst. 2005. Distributional measures as proxies for semantic relatedness.

[19]

H. Paijmans. 1997. Gravity wells of meaning: detecting information-rich passages in scientific texts. Journal of Documentation, 53(5):520--536.

[20]

M. Palmer, H. T. Dang, and C. Fellbaum. 2006. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(02):137--163.

[21]

V. V. Raghavan and S. K. M. Wong. 1986. A critical analysis of vector space model for information retrieval. Journal of the American Society for Information Science, 37(5):279--287.

[22]

P. Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of IJCAI-95, 14th International Joint Conference on Artificial Intelligence, volume 1, pages 448--453.

Digital Library

[23]

J. Rodd, G. Gaskell, and W. Marslen-Wilson. 2002. Making sense of semantic ambiguity: Semantic competition in lexical access. Journal of Memory and Language, 46(2):245--266.

[24]

M. D. E. B. Rodriguez and J. M. G. Hidalgo. 1997. Using WordNet to complement training information in text categorisation. In Procedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing.

[25]

J. Shawe-Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis.

Digital Library

[26]

S. Shi, J. R. Wen, Q. Yu, R. Song, and W. Y. Ma. 2005. Gravitation-based model for information retrieval. In Proceedings of SIGIR-05, 28th ACM International Conference on Research and Development in Information Retrieval, pages 488--495. ACM New York, NY, USA.

Digital Library

[27]

G. Siolas and F. d'Alché Buc. 2000. Support vector machines based on a semantic kernel for text categorization. In Proceedings of IJCNN-00, IEEE International Joint Conference on Neural Networks.

Digital Library

[28]

C. J. van Rijsbergen. 2004. The Geometry of Information Retrieval.

Digital Library

[29]

P. Wittek and S. Darányi. 2007. Representing word semantics for IR by continuous functions. In S. Dominich and F. Kiss, editors, Proceedings of ICTIR-07, 1st International Conference of the Theory of Information Retrieval, pages 149--155.

[30]

P. Wittek, C. L. Tan, and S. Darányi. 2009. An ordering of terms based on semantic relatedness. In H. Bunt, editor, Proceedings of IWCS-09, 8th International Conference on Computational Semantics.

Digital Library

[31]

S. K. M. Wong, W. Ziarko, and P. C. N. Wong. 1985. Generalized vector space model in information retrieval. In Proceedings of SIGIR-85, 8th ACM International Conference on Research and Development in Information Retrieval, pages 18--25.

Digital Library

Cited By

Fu XLiu LGong TTao L(2011)Improving text classification with concept index terms and expansion termsProceedings of the 8th international conference on Advances in neural networks - Volume Part III10.5555/2009463.2009524(485-492)Online publication date: 29-May-2011
https://dl.acm.org/doi/10.5555/2009463.2009524
Ševce OTvarožek JBieliková M(2010)Term ranking and categorization for ad-hoc navigationProceedings of the 14th international conference on Artificial intelligence: methodology, systems, and applications10.5555/1885962.1885972(71-80)Online publication date: 8-Sep-2010
https://dl.acm.org/doi/10.5555/1885962.1885972
Wittek PDarányi SDobreva M(2010)Matching evolving Hilbert spaces and language for semantic access to digital librariesProceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries10.5555/1875689.1875734(262-263)Online publication date: 21-Jun-2010
https://dl.acm.org/doi/10.5555/1875689.1875734

Index Terms

Improving text classification by a sense spectrum approach to term expansion
1. Applied computing
  1. Document management and text processing
2. Computing methodologies

Recommendations

A Term Weighting Scheme Approach for Vietnamese Text Classification
FDSE 2015: Proceedings of the Second International Conference on Future Data and Security Engineering - Volume 9446

The term weighting scheme, which is used to convert the documents to vectors in the term space, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance. There have been extensive ...
Multi-term web query expansion using wordnet
DEXA'06: Proceedings of the 17th international conference on Database and Expert Systems Applications

In this paper, we propose a method for multi-term query expansions based on WordNet. In our approach, Hypernym/Hyponymy and Synonym relations in WordNet is used as the basic expansion rules. Then we use WordNet Lexical Chains and WordNet semantic ...
Improving text classification with concept index terms and expansion terms
ISNN'11: Proceedings of the 8th international conference on Advances in neural networks - Volume Part III

Feature selection methods are widely employed to improve classification accuracy by removing redundant and noisy features. However, removing terms from documents may damage the integrity of content. To bridge the gap between the integrity of documents ...

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

CoNLL '09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning

June 2009

243 pages

ISBN:9781932432299

Conference Chairs:
Suzanne Stevenson
University of Toronto
,
Xavier Carreras
MIT

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 04 June 2009

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
318
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)12

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fu XLiu LGong TTao L(2011)Improving text classification with concept index terms and expansion termsProceedings of the 8th international conference on Advances in neural networks - Volume Part III10.5555/2009463.2009524(485-492)Online publication date: 29-May-2011
https://dl.acm.org/doi/10.5555/2009463.2009524
Ševce OTvarožek JBieliková M(2010)Term ranking and categorization for ad-hoc navigationProceedings of the 14th international conference on Artificial intelligence: methodology, systems, and applications10.5555/1885962.1885972(71-80)Online publication date: 8-Sep-2010
https://dl.acm.org/doi/10.5555/1885962.1885972
Wittek PDarányi SDobreva M(2010)Matching evolving Hilbert spaces and language for semantic access to digital librariesProceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries10.5555/1875689.1875734(262-263)Online publication date: 21-Jun-2010
https://dl.acm.org/doi/10.5555/1875689.1875734

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten