Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2184751.2184865acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Web community analysis and its application to language specific crawling

Published: 20 February 2012 Publication History

Abstract

This paper proposes a novel metric for web community analysis, called language homogeneity. The language homogeneity of a community measures the ratio of web pages in a specific language within the community. This simple web community analysis can provide additional insights on the characteristics of web communities. We analyze web communities extracted from large Thai web datasets in the following aspects: (1) community size distribution, (2) similarity with a web directory, and (3) Thai language homogeneity. Interestingly, we found that most Thai web communities are linguistically homogeneous. Web pages inside the same community tend to be written in the same language. Based on these analysis results, we argue that the linguistic homogeneity of web communities can be used to enhance language specific crawling. Towards this end, we point out current limitations of a language specific crawler and suggest possible ways for exploiting communities' language homogeneity to improve the performance of language specific crawling.

References

[1]
Caminero, R. C., & Mikami, Y. (2008). The Link Structure of Language Communities and its Implication for Language-specific Crawling. In The 6th Workshop on Asian Language Resources.
[2]
Chakrabarti, S., Berg, M., & Dom, B. (1999). Focused Crawling: a new approach to topic-specific Web resource discovery. In Proceedings of the 8th International World Wide Web Conference, 1623--1640.
[3]
Flake, G. W., Lawrence, S., & Gile, C. L. (2000). Efficient Identification of Web Communities. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[4]
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web Communities from Link Topology. In Proceedings of 9th ACM Conference on Hypertext and Hypermedia.
[5]
Kleinberg, J. (1998). Authoritative Sources in a Hyperlinked Environment. In Proceedings of 9th ACM-SIAM Symposium on Discrete Algorithms.
[6]
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the Web for Emerging Cyber-Communities. In Proceedings of the 8th International Conference on World Wide Web.
[7]
Tamura, T., Somboonviwat, K., & Kitsuregawa, M. (2007). A method for language-specific web crawling and its evaluation. Systems and Computers in Japan, 38(2), 10--20.
[8]
Somboonviwat, K., Suzuki, S., & Kitsuregawa, M. (2008). Connectivity of the Thai Web Graph. In Proceedings of the 10th Asia-Pacific Web Conference on Progress in WWW research and development, Springer, 613--624.
[9]
Somboonviwat, K., Tamura, T., & Kitsuregawa, M. (2006). Simulation Study of Language Specific Web Crawling. In Proceedings of the 21st International Conference on Data Engineering Workshops (ICDEW'05).
[10]
Toyoda, M., & Kitsuregawa, M. (2001). Creating a Web Community Chart for Navigating Related Communities. In Proceedings of the 12th ACM Conference on Hypertext and Hypermedia (HT'01).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
February 2012
852 pages
ISBN:9781450311724
DOI:10.1145/2184751
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. language homogeneity
  2. language specific web crawling
  3. web community mining and analysis

Qualifiers

  • Research-article

Conference

ICUIMC '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 251 of 941 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 110
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media