Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1341531.1341553acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Personal name classification in web queries

Published: 11 February 2008 Publication History

Abstract

Personal names are an important kind of Web queries in Web search, and yet they are special in many ways. Strategies for retrieving information on personal names should therefore be different from the strategies for other types of queries. To improve the search quality for personal names, a first step is to detect whether a query is a personal name. Despite the importance of this problem, relatively little previous research has been done on this topic. Since Web queries are usually short, conventional supervised machine-learning algorithms cannot be applied directly. An alternative is to apply some heuristic rules coupled with name-term dictionaries. However, when the dictionaries are small, this method tends to make false negatives; when the dictionaries are large, it tends to generate false positives. A more serious problem is that this method cannot provide a good trade-off between precision and recall. To solve these problems, we propose an approach based on the construction of probabilistic name-term dictionaries and personal name grammars, and use this algorithm to predict the probability of a query to be a personal name. In this paper, we develop four different methods for building probabilistic name-term dictionaries in which a term is assigned with a probability value of the term being a name term. We compared our approach with baseline algorithms such as dictionary-based look-up methods and supervised classification algorithms including logistic regression and SVM on some manually labeled test sets. The results validate the effectiveness of our approach, whose F1 value is more than 79.8%, which outperforms the best baseline by more than 11.3%

References

[1]
E. Alpaydin. Introduction to Machine Learning. The MIT Press., 2004.
[2]
J. Artiles, J. Gonzalo, and F. Verdejo. A testbed for people searching strategies in the www. In SIGIR'05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 569--570, New York, NY, USA, 2005. ACM Press.
[3]
D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 407--416, New York, NY, USA, 2000. ACM Press.
[4]
H.-H. Chen and G.-W. Bian. White page construction from web pages for finding people in internet. International Journal of Computational Linguistics and Chinese Language Processing, 3(1):75--100, 1998.
[5]
Z. Chen, L. Wenyin, and F. Zhang. A new statistical approach to personal name extraction. In ICML'02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 67--74, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
[6]
H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 160--163, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[7]
H. K. Dai, L. Zhao, Z. Nie, J.-R. Wen, L. Wang, and Y. Li. Detecting online commercial intention (oci). In WWW'06: Proceedings of the 15th international conference on World Wide Web, pages 829--837, 2006.
[8]
C. Dozier. Assigning belief scores to names in queries. In HLT '01: Proceedings of the first international conference on Human language technology research, pages 1--5, Morristown, NJ, USA, 2001. Association for Computational Linguistics.
[9]
R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classifier combination. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 168--171, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[10]
W. Ho, A. Smailagic, D. P. Siewiorek, and C. Faloutsos. An adaptive two-phase approach to wifi location sensing. In PerCom Workshops, pages 452--456, 2006.
[11]
T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In ECML, pages 137--142, 1998.
[12]
V. Krishnan and C. D. Manning. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In ACL-COLING'06: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 2006.
[13]
U. Lee, Z. Liu, and J. Cho. Automatic identification of user goals in web search. In WWW'05: Proceedings of the 14th international conference on World Wide Web, pages 391--400, 2005.
[14]
C. Manning and H. Schĺźtze. Foundations of Statistical Natural Language Processing. MIT Press., Cambridge, MA, 1999.
[15]
F. Peng, D. Schuurmans, and S. Wang. Augmenting naive bayes classifiers with statistical language models. Inf. Retr., 7(3-4):317--345, 2004.
[16]
J. Platt. Probabilities for sv machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--74. MIT Press, 1999.
[17]
D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Query enrichment for web-query classification. ACM Transaction on Information System., 24(3):320--352, 2006.
[18]
C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6--12, 1999.
[19]
W. tau Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In WWW'06: Proceedings of the 15th international conference on World Wide Web, pages 213--222, New York, NY, USA, 2006. ACM Press.
[20]
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition edition, 1979.
[21]
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.
[22]
X. Wan, J. Gao, M. Li, and B. Ding. Person resolution in person search results: Webhawk. In CIKM'05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 163--170, New York, NY, USA, 2005. ACM Press.
[23]
J.-R. Wen, J.-Y. Nie, and H.-J. Zhang. Clustering user queries of a search engine. In WWW'01: Proceedings of the 10th international conference on World Wide Web, pages 162--168, 2001.
[24]
Y. Yang. An evaluation of statistical approaches to text categorization. Inf. Retr., 1(1-2):69--90, 1999.

Cited By

View all
  • (2017)MC4WEPSLanguage Resources and Evaluation10.1007/s10579-016-9365-451:3(805-832)Online publication date: 1-Sep-2017
  • (2017)Person Name Disambiguation in the Web Using Adaptive Threshold ClusteringJournal of the Association for Information Science and Technology10.1002/asi.2381068:7(1751-1762)Online publication date: 1-Jul-2017
  • (2012)Research on optimizing the merging results of multiple independent retrieval systems by a discrete particle swarm optimizationJournal of Electronics (China)10.1007/s11767-012-0751-929:1-2(111-119)Online publication date: 5-Jun-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining
February 2008
270 pages
ISBN:9781595939272
DOI:10.1145/1341531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. personal name classification
  2. probabilistic dictionaries
  3. web query
  4. web search

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)MC4WEPSLanguage Resources and Evaluation10.1007/s10579-016-9365-451:3(805-832)Online publication date: 1-Sep-2017
  • (2017)Person Name Disambiguation in the Web Using Adaptive Threshold ClusteringJournal of the Association for Information Science and Technology10.1002/asi.2381068:7(1751-1762)Online publication date: 1-Jul-2017
  • (2012)Research on optimizing the merging results of multiple independent retrieval systems by a discrete particle swarm optimizationJournal of Electronics (China)10.1007/s11767-012-0751-929:1-2(111-119)Online publication date: 5-Jun-2012
  • (2011)Joint annotation of search queriesProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 110.5555/2002472.2002486(102-111)Online publication date: 19-Jun-2011
  • (2010)Structural annotation of search queries using pseudo-relevance feedbackProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871666(1537-1540)Online publication date: 26-Oct-2010
  • (2009)Improving web search relevance with semantic featuresProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 210.5555/1699571.1699597(648-657)Online publication date: 6-Aug-2009
  • (2009)Understanding user's query intent with wikipediaProceedings of the 18th international conference on World wide web10.1145/1526709.1526773(471-480)Online publication date: 20-Apr-2009
  • (2008)Query suggestion using hitting timeProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458145(469-478)Online publication date: 26-Oct-2008
  • (2008)Large scale learning and recognition of faces in web videos2008 8th IEEE International Conference on Automatic Face & Gesture Recognition10.1109/AFGR.2008.4813381(1-7)Online publication date: Sep-2008

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media