Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1631850.1631861dlproceedingsArticle/Chapter ViewAbstractPublication PagesdeeplaConference Proceedingsconference-collections
research-article
Free access

Approximate searching for distributional similarity

Published: 30 June 2005 Publication History

Abstract

Distributional similarity requires large volumes of data to accurately represent infrequent words. However, the nearest-neighbour approach to finding synonyms suffers from poor scalability. The Spatial Approximation Sample Hierarchy (SASH), proposed by Houle (2003b), is a data structure for approximate nearest-neighbour queries that balances the efficiency/approximation trade-off. We have intergrated this into an existing distributional similarity system, tripling efficiency with a minor accuracy penalty.

References

[1]
John R. L. Bernard, editor. 1990. The Macquarie Encyclopedic Thesaurus. The Macquarie Library, Sydney, Australia.
[2]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, December.
[3]
Lou Burnard, editor. 1995. Users Reference Guide British National Corpus Version 1.0. Oxford University Computing Services.
[4]
Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, and José L. Marroquín. 2001. Searching in metric spaces. ACM Computing Surveys, 33(3):273--321, September.
[5]
Stephen Clark and David Weir. 2001. Class-based probability estimation using a semantic hierarchy. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics, pages 95--102, Pittsburgh, PA USA, 2--7 June.
[6]
James R. Curran and Marc Moens. 2002a. Improvements in automatic thesaurus extraction. In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 59--66, Philadelphia, USA, 12 July.
[7]
James R. Curran and Marc Moens. 2002b. Scaling context space. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 231--238, Philadelphia, USA, 7--12 July.
[8]
James R. Curran. 2004. From Distributional to Semantic Similarity. Ph.D. thesis, University of Edinburgh.
[9]
Christiane Fellbaum, editor. 1998. WordNet: an electronic lexical database. The MIT Press, Cambridge, MA USA.
[10]
Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston, USA.
[11]
Michael E. Houle. 2003a. Navigating massive data sets via local clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547--552, Washington, DC, USA, 24--27 August.
[12]
Michael E. Houle. 2003b. SASH: a saptial approximation sample hierarchy for similarity search. Technical Report RT0517, IBM Reasearch, Tokyo Research Laboratory, Yamato Kanagawa, Japan, March.
[13]
Guido Minnen, John Carroll, and Darren Pearce. 2000. Robust applied morphological generation. In Proceedings of the First International Natural Language Generation Conference, pages 201--208, Mitzpe Ramon, Israel, 12--16 June.
[14]
Marius Pasca and Sanda Harabagiu. 2001. The informative role of wordnet in open-domain question answering. In Proceedings of the Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, pages 138--143, Pittsburgh, PA USA, 2--7 June.
[15]
Darren Pearce. 2001. Synonymy in collocation extraction. In Proceedings of the Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, pages 41--46, Pittsburgh, PA USA, 2--7 June.
[16]
Adwait Ratnaparkhi. 1996. A maximum entropy part-of-speech tagger. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133--142, 17--18 May.
[17]
Peter Roget. 1911. Thesaurus of English words and phrases. Longmans, Green and Co., London, UK.
[18]
Grady Ward. 1996. Moby Thesaurus. Moby Project.

Cited By

View all
  • (2006)Automatically extracting nominal mentions of events with a bootstrapped probabilistic classifierProceedings of the COLING/ACL on Main conference poster sessions10.5555/1273073.1273095(168-175)Online publication date: 17-Jul-2006
  • (2006)Scaling distributional similarity to large corporaProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics10.3115/1220175.1220221(361-368)Online publication date: 17-Jul-2006

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
DeepLA '05: Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition
June 2005
114 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 30 June 2005

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)7
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2006)Automatically extracting nominal mentions of events with a bootstrapped probabilistic classifierProceedings of the COLING/ACL on Main conference poster sessions10.5555/1273073.1273095(168-175)Online publication date: 17-Jul-2006
  • (2006)Scaling distributional similarity to large corporaProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics10.3115/1220175.1220221(361-368)Online publication date: 17-Jul-2006

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media