Abstract
Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.
Similar content being viewed by others
References
Ali F, Khan P, Riaz K, et al., 2017. A fuzzy ontology and SVM-based web content classification system. IEEE Access, 5:25781–25797. https://doi.org/10.1109/ACCESS.2017.2768564
Baskin II, Marcou G, Horvath D, et al., 2017. Bagging and boosting of classification models. In: Varnek A (Ed.), Tutorials in Chemoinformatics, Wiley Online Library, p.241–247. https://doi.org/10.1002/9781119161110.ch15
Cai D, Yu SP, Wen JR, et al., 2003. Extracting content structure for web pages based on visual representation. Asia-Pacific Web Conf, p.406–417. https://doi.org/10.1007/3-540-36901-5_42
Elsalmy F, Ismail R, Abdelmoez W, 2017. Enhancing web page classification models. Int Conf on Advanced Intelligent Systems and Informatics, p.742–750. https://doi.org/10.1007/978-3-319-48308-5_71
Gers FA, Schmidhuber J, Cummins F, 2000. Learning to forget: continual prediction with LSTM. Neur Comput, 12(10): 2451–2471. https://doi.org/10.1162/089976600300015015
Gogar T, Hubacek O, Sedivy J, 2016. Deep neural networks for web page information extraction. IFIP Int Conf on Artificial Intelligence Applications and Innovations, p.154–163. https://doi.org/10.1007/978-3-319-44944-9_14
Heinrich G, 2017. Evaluation of a distribution-based web page classification. In: Friedrichsen M, Kamalipour Y (Eds.), Digital Transformation in Journalism and News Media. Springer, Cham, p.55–68. https://doi.org/10.1007/978-3-319-27786-8_6
Kumari KP, Reddy AV, 2012. Performance improvement of web page genre classification. Int J Comput Appl, 53(10): 24–27. https://doi.org/10.5120/8457-2265
Li HK, Xu Z, Li T, et al., 2017. An optimized approach for massive web page classification using entity similarity based on semantic network. Fut Gener Comput Syst, 76: 510–518. https://doi.org/10.1016/j.future.2017.03.003
Mikolov T, Chen K, Corrado G, et al., 2013. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
Onan A, 2015. Artificial immune system based web page classification. In: Silhavy R, Senkerik R, Oplatkova Z, et al. (Eds.), Software Engineering in Intelligent Systems. Springer, Cham, p.189–199. https://doi.org/10.1007/978-3-319-18473-9_19
Onan A, 2016. Classifier and feature set ensembles for web page classification. J Inform Sci, 42(2):150–165. https://doi.org/10.1177/0165551515591724
Panchekha P, Torlak E, 2016. Automated reasoning for web page layout. ACM SIGPLAN Not, 51(10):181–194. https://doi.org/10.1145/3022671.2984010
Pritsos DA, Stamatatos E, 2013. Open-set classification for automated genre identification. European Conf on Information Retrieval, p.207–217. https://doi.org/10.1007/978-3-642-36973-5_18
Qi XG, Davison BD, 2006. Knowing a web page by the company it keeps. Proc 15th ACM Int Conf on Information and Knowledge Management, p.228–237. https://doi.org/10.1145/1183614.1183650
Qi XG, Davison BD, 2009. Web page classification: features and algorithms. ACM Comput Surv, 41(2):12. https://doi.org/10.1145/1459352.1459357
Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE, 105(12):2295–2329. https://doi.org/10.1109/JPROC.2017.2761740
Wei YL, Wang W, Wang BL, et al., 2017. A method for topic classification of web pages using LDA-SVM model. Chinese Int Automation Conf, p.589–596. https://doi.org/10.1007/978-981-10-6445-6_64
Xue WM, Bao H, Huang WM, et al., 2006. Web page classification based on SVM. 6th World Congress on Intelligent Control and Automation, p.6111–6114. https://doi.org/10.1109/WCICA.2006.1714255
Zhu J, Xie Q, Yu SI, et al., 2016. Exploiting link structure for web page genre identification. Data Min Knowl Discov, 30(3):550–575. https://doi.org/10.1007/s10618-015-0428-8
Author information
Authors and Affiliations
Contributions
Ji-zhong SHEN and Xin DU designed the research. Li DENG processed the data and drafted the manuscript. Ji-zhong SHEN helped organize the manuscript. Li DENG, Xin DU, and Ji-zhong SHEN revised and finalized the paper.
Corresponding author
Ethics declarations
Li DENG, Xin DU, and Ji-zhong SHEN declare that they have no conflict of interest.
Additional information
Project supported by the National Natural Science Foundation of China (No. 61471314) and the Welfare Technology Research Project of Zhejiang Province, China (No. LGG18F010003)
Rights and permissions
About this article
Cite this article
Deng, L., Du, X. & Shen, Jz. Web page classification based on heterogeneous features and a combination of multiple classifiers. Front Inform Technol Electron Eng 21, 995–1004 (2020). https://doi.org/10.1631/FITEE.1900240
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1900240