research-article

Extracting structured information from user queries with semi-supervised conditional random fields

Authors:

Alex AceroAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 572 - 579

https://doi.org/10.1145/1571941.1572039

Published: 19 July 2009 Publication History

Abstract

When search is against structured documents, it is beneficial to extract information from user queries in a format that is consistent with the backend data structure. As one step toward this goal, we study the problem of query tagging which is to assign each query term to a pre-defined category. Our problem could be approached by learning a conditional random field (CRF) model (or other statistical models) in a supervised fashion, but this would require substantial human-annotation effort. In this work, we focus on a semi-supervised learning method for CRFs that utilizes two data sources: (1) a small amount of manually-labeled queries, and (2) a large amount of queries in which some word tokens have derived labels, i.e., label information automatically obtained from additional resources. We present two principled ways of encoding derived label information in a CRF model. Such information is viewed as hard evidence in one setting and as soft evidence in the other. In addition to the general methodology of how to use derived labels in semi-supervised CRFs, we also present a practical method on how to obtain them by leveraging user click data and an in-domain database that contains structured documents. Evaluation on product search queries shows the effectiveness of our approach in improving tagging accuracies.

References

[1]

S. Abney. Understanding the Yarowsky algorithm. Association for Computational Linguistics, 30(3):365--395, 2004.

Digital Library

[2]

A. Arasu and H. Garcia-Molina. Extracting structured data from webpages. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003.

Digital Library

[3]

C. Barr, R. Jones, and M. Regelson. The linguistic structure of English web-search queries. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1021--1030, 2008.

Digital Library

[4]

J. Bilmes. On soft evidence in Bayesian networks. Technical Report UWEETR-2004-0016, University of Washington, 2004.

[5]

S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing, pages 827--836, 2007.

[6]

G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pages 595--602, 2008.

Digital Library

[7]

Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, 2004.

[8]

T. Grenager, D. Klein, and C. Manning. Unsupervised learning of field segmentation models for information extraction. In Proceedings of the 43rd Annual Meeting of the ACL, pages 371--378, 2005.

Digital Library

[9]

F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL, pages 209--216, 2006.

Digital Library

[10]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, pages 282--289, 2001.

Digital Library

[11]

X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graph. In SIGIR'08: Proceedings of the 31st Annual International ACM SIGIR conference on Research and Development in Information Retrieval, July 2008.

Digital Library

[12]

G.S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Proceedings of Association of Computational Linguistics, 2008.

[13]

D. Pinto, A. McCallum, X. Wei, and W.B. Croft. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, 2003.

Digital Library

[14]

A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. IEEE Transaction on Pattern Analysis and Machine Intellegence, 29(10):1848--1852, 2007.

Digital Library

[15]

J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of the 46th Annual Meeting of the ACL: Human Language Technologies, pages 665--673, 2008.

[16]

P. Viola and M. Narasimhand. Learning to extract information from semi-structured text using a discriminative context free grammar. In SIGIR'05: Proceedings of the 28th Annual International ACM SIGIR conference on Research and development in information retrieval, pages 330--337, 2005.

Digital Library

[17]

T.-L. Wong, W. Lam, and T.-S. Wong. An unsupervised framework for extracting and normalizing product attributes from multiple web sites. In Proceedings of the 31st Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pages 35--42, 2008.

Digital Library

[18]

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189--196, 1995.

Digital Library

[19]

C. Zhao, J. Mahmud, and I. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining, 2008.

[20]

J. Zhu, B. Zhang, Z. Nie, J.-R. Wen, and H.-W. Hon. Webpage understanding: an integrated approach. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 903--912, 2007.

Digital Library

Cited By

Bassani EPasi G(2022)Evaluating the Use of Synthetic Queries for Pre-training a Semantic Query TaggerAdvances in Information Retrieval10.1007/978-3-030-99739-7_5(39-46)Online publication date: 10-Apr-2022
https://dl.acm.org/doi/10.1007/978-3-030-99739-7_5
Bassani EPasi GDiaz FShah CSuel TCastells PJones RSakai T(2021)Semantic Query Labeling Through Synthetic Query GenerationProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463071(2278-2282)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463071
Maji SPatel PThakarar BKumar MTripathi K(2020)A Regularised Intent Model for Discovering Multiple Intents in E-Commerce Tail QueriesAdvances in Information Retrieval10.1007/978-3-030-45439-5_43(651-665)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45439-5_43
Show More Cited By

Index Terms

Extracting structured information from user queries with semi-supervised conditional random fields
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Aspects of semi-supervised and active learning in conditional random fields
ECMLPKDD'11: Proceedings of the 2011th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part III

Conditional random fields are among the state-of-the art approaches to structured output prediction, and the model has been adopted for various real-world problems. The supervised classification is expensive, since it is usually expensive to produce ...
Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
1,070
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bassani EPasi G(2022)Evaluating the Use of Synthetic Queries for Pre-training a Semantic Query TaggerAdvances in Information Retrieval10.1007/978-3-030-99739-7_5(39-46)Online publication date: 10-Apr-2022
https://dl.acm.org/doi/10.1007/978-3-030-99739-7_5
Bassani EPasi GDiaz FShah CSuel TCastells PJones RSakai T(2021)Semantic Query Labeling Through Synthetic Query GenerationProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463071(2278-2282)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3404835.3463071
Maji SPatel PThakarar BKumar MTripathi K(2020)A Regularised Intent Model for Discovering Multiple Intents in E-Commerce Tail QueriesAdvances in Information Retrieval10.1007/978-3-030-45439-5_43(651-665)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45439-5_43
Lee WChoi J(2019)Precursor-induced conditional random fields: connecting separate entities by induction for improved clinical named entity recognitionBMC Medical Informatics and Decision Making10.1186/s12911-019-0865-119:1Online publication date: 15-Jul-2019
https://doi.org/10.1186/s12911-019-0865-1
Ruiz ARudovic OBinefa XPantic M(2018)Multi-Instance Dynamic Ordinal Random Fields for Weakly Supervised Facial Behavior AnalysisIEEE Transactions on Image Processing10.1109/TIP.2018.283018927:8(3969-3982)Online publication date: Aug-2018
https://doi.org/10.1109/TIP.2018.2830189
Balog KBalog K(2018)Understanding Information NeedsEntity-Oriented Search10.1007/978-3-319-93935-3_7(225-267)Online publication date: 3-Oct-2018
https://doi.org/10.1007/978-3-319-93935-3_7
Keyaki AMiyazaki JShin SShin DLencastre M(2017)Part-of-speech tagging for web search queries using a large-scale web corpusProceedings of the Symposium on Applied Computing10.1145/3019612.3019694(931-937)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3019612.3019694
Wong TXie HLam WWang F(2017)A learning framework for information block search based on probabilistic graphical models and Fisher KernelInternational Journal of Machine Learning and Cybernetics10.1007/s13042-017-0657-99:9(1473-1487)Online publication date: 28-Mar-2017
https://doi.org/10.1007/s13042-017-0657-9
Chiang FAndritsos PMiller R(2016)Data Driven Discovery of Attribute DictionariesTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090180(69-96)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.5555/3090176.3090180
Zhai KKozareva ZHu YLi QGuo WPerego RSebastiani FAslam JRuthven IZobel J(2016)Query to KnowledgeProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911495(255-264)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2911495
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten