Abstract
The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from non-article pages, e.g. table of contents, advertisements. Then, the selected web articles are compared in a fine-grained manner in order to find informative structures. Both steps of information extraction utilize the structural features of web documents rather than the extracted keywords or terms. Thus, the inherent errors of word segmentation, one of the major problems in Thai text processing, do not affect to this method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cooley, R., Mobasher, B., Srivastava, J.: Web mining: Information and pattern discovery on the world wide web. In: ICTAI, pp. 558–567 (1997)
Sun, A., Lim, E.P., Ng, W.K.: Web classification using support vector machine. In: Chiang, R.H.L., Lim, E.P. (eds.) WIDM, pp. 96–99. ACM, New York (2002)
Holden, N., Freitas, A.A.: Web page classification with an ant colony algorithm. In: Yao, X., et al. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 1092–1102. Springer, Heidelberg (2004)
An, A., Huang, Y., Huang, X., Cercone, N.: Feature selection with rough sets for web page classification. In: Peters, J.F., Skowron, A., Dubois, D., Grzymała-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 1–13. Springer, Heidelberg (2004)
He, J., Tan, A.H., Tan, C.L.: Machine learning methods for chinese web page categorization. In: ACL 2000 2nd Workshop on Chinese Language Processing, Hongkong, China, pp. 93–100 (2000)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: [12], pp. 577–582
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998)
Wong, W.C., Fu, A.W.C.: Finding structure and characteristics of web documents for classification. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)
Tombros, A., Ali, Z.: Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: [12], pp. 296–305
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C.: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. In: KDD, Washington, DC, USA, August 24 - 27, 2003. ACM, New York (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tongchim, S., Sornlertlamvanich, V., Isahara, H. (2006). Classification of News Web Documents Based on Structural Features. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_17
Download citation
DOI: https://doi.org/10.1007/11816508_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)