Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Characteristics of character usage in Chinese Web searching

Published: 01 January 2009 Publication History

Abstract

The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n=3-6) had similar structures with @b-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines.

References

[1]
Analysis of the query logs of a web site search engine. Journal of the American Society for Information Science and Technology. v56 i13. 1363-1376.
[2]
Web searching in Chinese: A study of a search engine in Hong Kong. Journal of the American Society for Information Science and Technology. v58 i7. 1044-1054.
[3]
Chau, M., Qin, J., Zhou, Y., Tseng, C., & Chen, H. (2005). SpidersRUs: Automated development of vertical search engines in different domains and languages. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries, Denver, Colorado, USA, June 7-11.
[4]
PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval. Information Processing and Management. v35. 501-521.
[5]
Croft, W. B., Cook, R., & Wilder, D. (1995). Providing government information on the internet: Experiences with THOMAS. In Proceedings of the digital Libraries'95 conference, Austin, Texas (pp. 19-24).
[6]
Da, J. (2004). Chinese text computing. <http://lingua.mtsu.edu/> Accessed 05.11.05.
[7]
A Zipfian model of an automatic bibliographic system: An application to MEDLINE. Journal of American Society of Information Science. i33. 223-232.
[8]
Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., &amp; Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th international conference on computational linguistics (pp. 315-320).
[9]
Extension of Zipf's law to word and character n-grams for English and Chinese. Computational linguistics and Chinese Language Processing. v8 i1. 77-102.
[10]
Hölscher, C. (1998). How internet experts search for information on the web. In Proceedings of the world conference of the world wide web, internet, and intranet, Orlando, Florida, USA.
[11]
Relevant term suggestion in interactive web search based on contextual information in query session logs. Journal of the American Society of Information Science and Technology. v54 i7. 638-649.
[12]
Huang, C. K., Oyang, Y. J., &amp; Chien, L. F. (2001). A contextual term suggestion mechanism for interactive search. In Proceedings of the first web intelligence conference (WI'2001), Japan (pp. 272-281).
[13]
Web user studies: A review and framework for future work. Journal of the American Society of Information Science and Technology. v52 i3. 235-246.
[14]
Real life information retrieval: A study of user queries on the web. ACM SIGIR Forum. v32 i1. 5-17.
[15]
Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management. i36. 207-227.
[16]
Jones, S., Cunningham, S. J., &amp; McNam, R. (1998). Usage analysis of a digital library. In Proceedings of the 3rd ACM conference on digital libraries, Pittsburgh, PA, USA, June (pp. 293-294).
[17]
An information theory of the statistical structure of language. In: Willis, Jackson (Ed.), Communication theory, Academic Press, New York. pp. 486-502.
[18]
Stochastic models for the distribution of index terms. Journal of Documentation. v45 i3. 227-237.
[19]
The use of the maximum likelihood criterion in language modelling. In: Ponting, K. (Ed.), Computational models of speech pattern processing, Springer, Berlin, Germany. pp. 259-279.
[20]
A weighted average n-gram model of natural language. Computer Speech and Language. i8. 337-349.
[21]
Subject categorization of query terms for exploring web users' search interests. Journal of the American Society for Information Science and Technology. v53 i8. 617-630.
[22]
Analysis of a very large web search engine query log. ACM SIGIR Forum. v33 i1. 6-12.
[23]
Storing and retrieving word phrases. Information Processing &amp; Management. v21 i3. 215-224.
[24]
From E-sex to E-sommerce: Web search changes. IEEE Computer. v35 i3. 107-109.
[25]
Characteristics of question format web queries: An exploratory study. Information Processing and Management. i38. 453-471.
[26]
Web searching for sexual information: An exploratory study. Information Processing and Management. i40. 113-123.
[27]
Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology. v52 i3. 226-234.
[28]
Tsai, C.-H. (1996). Frequency and stroke counts of Chinese characters. <http://technology.chtsai.org/charfreq/> Accessed 20.05.05.
[29]
Mining longitudinal web queries: Trends and patterns. Journal of the American Society for Information Science and Technology. v54 i8. 743-758.
[30]
Applying informetric characteristics of databases to IR system file design. Part I. Informetric models. Information Processing and Management. v28 i1. 121-133.
[31]
Vox Populi: The public searching of the web. Journal of the American Society for Information Science and Technology. v52 i12. 1073-1074.
[32]
Human behavior and the principle of least effort. Addison-Wesley, Cambridge.

Cited By

View all
  • (2021)Meaningfulness and Unit of Zipf’s Law: Evidence from Danmu CommentsChinese Computational Linguistics10.1007/978-3-030-84186-7_16(239-253)Online publication date: 13-Aug-2021
  • (2018)Cross-lingual analysis of English and Chinese web searchInternational Journal of Web and Grid Services10.5555/3292946.329294914:4(376-399)Online publication date: 1-Jan-2018
  • (2017)Query Reformulation Patterns of Mixed Language Queries in Different Search IntentsProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval10.1145/3020165.3022126(249-252)Online publication date: 7-Mar-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 45, Issue 1
January, 2009
175 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 January 2009

Author Tags

  1. Character usage
  2. Chinese
  3. Search log analysis
  4. Web mining
  5. Zipf distribution

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Meaningfulness and Unit of Zipf’s Law: Evidence from Danmu CommentsChinese Computational Linguistics10.1007/978-3-030-84186-7_16(239-253)Online publication date: 13-Aug-2021
  • (2018)Cross-lingual analysis of English and Chinese web searchInternational Journal of Web and Grid Services10.5555/3292946.329294914:4(376-399)Online publication date: 1-Jan-2018
  • (2017)Query Reformulation Patterns of Mixed Language Queries in Different Search IntentsProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval10.1145/3020165.3022126(249-252)Online publication date: 7-Mar-2017
  • (2016)Analysis of chinese-english mixed language query reformulation strategies and patterns during web searchingProceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & Technology10.5555/3017447.3017533(1-5)Online publication date: 14-Oct-2016
  • (2016)Power Law Distributions in Information RetrievalACM Transactions on Information Systems10.1145/281681534:2(1-37)Online publication date: 16-Feb-2016
  • (2016)Analysis of chinese‐english mixed language query reformulation strategies and patterns during web searchingProceedings of the Association for Information Science and Technology10.1002/pra2.2016.1450530108653:1(1-5)Online publication date: 27-Dec-2016
  • (2014)Recent and robust query auto-completionProceedings of the 23rd international conference on World wide web10.1145/2566486.2568009(971-982)Online publication date: 7-Apr-2014
  • (2014)Character n-gram application for automatic new topic identificationInformation Processing and Management: an International Journal10.1016/j.ipm.2014.06.00550:6(821-856)Online publication date: 1-Nov-2014
  • (2013)Character usage in Chinese short message service SMSInternational Journal of Mobile Communications10.1504/IJMC.2013.05695411:5(429-445)Online publication date: 1-Oct-2013
  • (2013)Exploiting query term correlation for list caching in web search enginesProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2507870(1817-1820)Online publication date: 27-Oct-2013
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media