Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Web page classification based on heterogeneous features and a combination of multiple classifiers

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Ji-zhong SHEN and Xin DU designed the research. Li DENG processed the data and drafted the manuscript. Ji-zhong SHEN helped organize the manuscript. Li DENG, Xin DU, and Ji-zhong SHEN revised and finalized the paper.

Corresponding author

Correspondence to Ji-zhong Shen.

Ethics declarations

Li DENG, Xin DU, and Ji-zhong SHEN declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61471314) and the Welfare Technology Research Project of Zhejiang Province, China (No. LGG18F010003)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, L., Du, X. & Shen, Jz. Web page classification based on heterogeneous features and a combination of multiple classifiers. Front Inform Technol Electron Eng 21, 995–1004 (2020). https://doi.org/10.1631/FITEE.1900240

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1900240

Key words

CLC number