research-article

Query segmentation revisited

Authors:

Matthias Hagen,

Martin Potthast,

Christof BräutigamAuthors Info & Claims

WWW '11: Proceedings of the 20th international conference on World wide web

Pages 97 - 106

https://doi.org/10.1145/1963405.1963423

Published: 28 March 2011 Publication History

Abstract

We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented.

The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50,000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now.

References

[1]

O. Alonso and S. Mizzaro. Can We Get Rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment. In Proceedings of the SIGIR 2009 Workshop on The Future of IR Evaluation.

[2]

M. Bendersky, W. B. Croft, and D. Smith. Two-stage Query Segmentation for Information Retrieval. In J. Allan, J. A. Aslam, M. Sanderson, C. Zhai, and J. Zobel, editors, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, USA, July 20-24, 2009, pages 810--811.

Digital Library

[3]

M. Bendersky, W. B. Croft, and D. Smith. Structural Annotation of Search Queries Using Pseudo-Relevance Feedback. In J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, pages 1537--1540.

Digital Library

[4]

S. Bergsma and Q. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 819--826.

[5]

T. Brants and A. Franz. Web 1T 5-gram Version 1. Linguistic Data Consortium LDC2006T13, Philadelphia, 2006.

[6]

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large Language Models in Machine Translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 858--867.

[7]

D. Brenes, D. Gayo-Avello, and R. Garcia. On the Fly Query Entity Decomposition Using Snippets. In Proceedings of the First Spanish Conference on Information Retrieval, CERI 2010, June 15-16, 2010, Madrid, Spain.

[8]

W. B. Croft, M. Bendersky, H. Li, G. Xu. Query Representation and Understanding Workshop. SIGIR Forum, 44 (2): 48--53, 2010.

Digital Library

[9]

J. Guo, G. Xu, H. Li, and X. Cheng. A Unified and Discriminative Model for Query Refinement. In S. Myaeng, D. Oard, F. Sebastiani, T. Chua, and M. Leong, editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pages 379--386.

Digital Library

[10]

M. Hagen, M. Potthast, B. Stein, and C. Bräutigam. The Power of Naïve Query Segmentation. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 797--798.

Digital Library

[11]

J. Huang, J. Gao, J. Miao, X. Li, K. Wang, and F. Behr. Exploring Web Scale Language Models for Search Query Processing. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 451--460.

Digital Library

[12]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating Query Substitutions. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, Proceedings of the 15th International Conference on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 23-26, 2006, pages 387--396.

Digital Library

[13]

J. Kiseleva, Q. Guo, E. Agichtein, D. Billsus, and W. Chai. Unsupervised Query Segmentation Using Click Data: Preliminary Results. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 1131--1132.

Digital Library

[14]

N. Mishra, R. Roy, N. Ganguly, S. Laxman, and M. Choudhury. Unsupervised Query Segmentation Using Only Query Logs. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28-April 1, 2011.

Digital Library

[15]

G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006, article 1.

Digital Library

[16]

M. Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 789--790.

Digital Library

[17]

M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. An Evaluation Framework for Plagiarism Detection. In C.-R. Huang and D. Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, Beijing, China, August 23-27, 2010, pages 997--1005.

Digital Library

[18]

K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003. Posters.

[19]

B. Tan and F. Peng. Unsupervised Query Segmentation Using Generative Language Models and Wikipedia. In J. Huai, R. Chen, H. Hon, Y. Liu, W. Ma, A. Tomkins, and X. Zhang, editors, Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008, pages 347--356.

Digital Library

[20]

X. Yu and H. Shi. Query Segmentation Using Conditional Random Fields. In M. T. Özsu, Y. Chen, and L. Chen, editors, Proceedings of the First International Workshop on Keyword Search on Structured Data, KEYS 2009, Providence, Rhode Island, USA, June 28, 2009, pages 21--26.

Digital Library

[21]

C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query Segmentation Based on Eigenspace Similarity. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2009, August 2-7, 2009, Singapore. Short papers, pages 185--188.

Digital Library

Cited By

Fröbe MScells HElstner TAkiki CGienapp LReimer JMacAvaney SStein BHagen MPotthast MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657886
Zhu JLong TWang WMemon ARoychoudhury ACadar CKim M(2022)Improving ML-based information retrieval software with user-driven functional testing and defect class analysisProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558941(1291-1301)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3558941
Kiesel JMeyer LPotthast MStein B(2021)Meta-Information in Conversational SearchACM Transactions on Information Systems10.1145/346886839:4(1-44)Online publication date: 16-Aug-2021
https://dl.acm.org/doi/10.1145/3468868
Show More Cited By

Index Terms

Query segmentation revisited
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

The power of naive query segmentation
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

We address the problem of query segmentation: given a keyword query submitted to a search engine, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve good segmentation performance on a gold standard ...
Towards optimum query segmentation: in doubt without
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search ...
Improving unsupervised query segmentation using parts-of-speech sequence information
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

We present a generic method for augmenting unsupervised query segmentation by incorporating Parts-of-Speech (POS) sequence information to detect meaningful but rare n-grams. Our initial experiments with an existing English POS tagger employing two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '11: Proceedings of the 20th international conference on World wide web

March 2011

840 pages

ISBN:9781450306324

DOI:10.1145/1963405

General Chairs:
S. Sadagopan
IIIT-Bangalore, India
,
Krithi Ramamritham
IIT-Bombay, India
,
Arun Kumar
IBM Research, India
,
M. P. Ravindra
Infosys E & R, India
,
Program Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Kumar
Yahoo! Research, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
The International Institute of Information Technology Bangalore: The International Institute of Information Technology Bangalore

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '11

WWW '11: 20th International World Wide Web Conference

March 28 - April 1, 2011

Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
593
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fröbe MScells HElstner TAkiki CGienapp LReimer JMacAvaney SStein BHagen MPotthast MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657886
Zhu JLong TWang WMemon ARoychoudhury ACadar CKim M(2022)Improving ML-based information retrieval software with user-driven functional testing and defect class analysisProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3558941(1291-1301)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3558941
Kiesel JMeyer LPotthast MStein B(2021)Meta-Information in Conversational SearchACM Transactions on Information Systems10.1145/346886839:4(1-44)Online publication date: 16-Aug-2021
https://dl.acm.org/doi/10.1145/3468868
Li ZDing DZou PGong YChen XZhang JGao JWu YDuan Y(2021)Distant Supervision for E-commerce Query Segmentation via Attention NetworkIntelligent Processing Practices and Tools for E-Commerce Data, Information, and Knowledge10.1007/978-3-030-78303-7_1(3-19)Online publication date: 27-May-2021
https://doi.org/10.1007/978-3-030-78303-7_1
Wang X(2020)Query Segmentation and TaggingQuery Understanding for Search Engines10.1007/978-3-030-58334-7_3(43-67)Online publication date: 2-Dec-2020
https://doi.org/10.1007/978-3-030-58334-7_3
Deng HChang Y(2020)An Introduction to Query UnderstandingQuery Understanding for Search Engines10.1007/978-3-030-58334-7_1(1-13)Online publication date: 2-Dec-2020
https://doi.org/10.1007/978-3-030-58334-7_1
Lioma CLarsen BIngwersen P(2018)To Phrase or Not to Phrase – Impact of User versus System Term Dependence upon RetrievalData and Information Management10.2478/dim-2018-00012:1(1-14)Online publication date: Jun-2018
https://doi.org/10.2478/dim-2018-0001
Balog KBalog K(2018)Understanding Information NeedsEntity-Oriented Search10.1007/978-3-319-93935-3_7(225-267)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_7
Hagen MPotthast MGohsen MRathgeber AStein BKando NSakai TJoho HLi Hde Vries AWhite R(2017)A Large-Scale Query Spelling Correction CorpusProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080749(1261-1264)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080749
Shirakawa MHara TNishio S(2017)IDF for Word N-gramsACM Transactions on Information Systems10.1145/305277536:1(1-38)Online publication date: 5-Jun-2017
https://dl.acm.org/doi/10.1145/3052775
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents