A semantic based Web page classification strategy using multi-layered domain ontology

Saleh, Ahmed I.; Al Rahmawy, Mohammed F.; Abulwafa, Arwa E.

doi:10.1007/s11280-016-0415-z

A semantic based Web page classification strategy using multi-layered domain ontology

Published: 26 October 2016

Volume 20, pages 939–993, (2017)
Cite this article

World Wide Web Aims and scope Submit manuscript

Ahmed I. Saleh¹,
Mohammed F. Al Rahmawy² &
Arwa E. Abulwafa¹

894 Accesses
16 Citations
Explore all metrics

Abstract

World Wide Web is a continuously growing giant, and within the next few years, Web contents will surely increase tremendously. Hence, there is a great requirement to have algorithms that could accurately classify Web pages. Automatic Web page classification is significantly different from traditional text classification because of the presence of additional information, provided by the HTML structure. Recently, several techniques have been arisen from combinations of artificial intelligence and statistical approaches. However, it is not a simple matter to find an optimal classification technique for Web pages. This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO). It employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology. In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains. CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC). The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning. Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective Probabilistic Model for Webpage Classification

Multi-layer Filtering Webpage Classification Method Based on SVM

Improving Vietnamese Web Page Classification by Combining Hybrid Feature Selection and Label Propagation with Link Information

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Alamelu Mangai, J., Milind Wagle, S., Santhosh Kumar, V.: A Novel Web page classification model using an improved k nearest neighbor algorithm. 3rd International Conference on Intelligent Computational Systems, Singapore, pp. 49–53 (2013)
Asirvatham, A. P., Ravi, K. K.: Web page classification based on document structure. Awarded Second Prize in National Level Student Paper Contest conducted by IEEE India Council., (2001)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web search engines. In: Proceedings of International Conference on World Wide Web, pp. 757–766 (2007)
Cardoso-Cachopo, A.; Improving methods for single-label text categorization. PhD thesis, Technical University of Lisbon (2007)
Chen, R.-C., Hsieh, C.-H.: Web page classification based on a support Vector machine using a weighted vote schema. Expert Syst. Appl. 31(2), 427–435 (2006)
Article Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Article Google Scholar
Cios, K., Swiniarski, R., Pedrycz, W., Kurgan, L.: Unsupervised learning: association rules. In: Data Mining: A Knowledge Discovery Approach, chapter 10, pp. 289–306. Springer-Verlag New York, Inc., Secaucus, NJ (2007)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Domingue, J., Fensel, D., Hendler, J. A.: Handbook of semantic Web technologies. Springer-Verlag Berlin Heidelberg (2011)
Eilbeck, K., Lewis, S.E., Mungall, C. J., Yandell, M., Stein, L., Durbin, R., Ashburner, M.: The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 6(5), (2005)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001)
Article Google Scholar
Gruber, T.R.: A translation approach to portable ontology specification. Knowl. Acquis. 5(2), 199–220 (1993)
Article Google Scholar
Holden, N., Freitas, A. A.: Web page classification with an ant colony algorithm. Parallel Problem Solving from Nature, LNCS, Springer, vol. 3242, pp. 1092–1102 (2004)
Hsu, C.,Chang, C., Lin, C.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, (2003)
Hu, R., Hu, W.: A novel framework for Web pages classification. In: Proceeding of The 3rd International Conference on Multimedia Technology, ICMT, pp. 1061–1068 (2013)
Jurisica, I., Mylopoulos, J., Yu, E.: Ontologies for knowledge management: an information systems perspective. Knowl. Inf. Syst. 6, 380–401 (2004)
Article Google Scholar
Kaur, P., Kaur, R.: A survey of optimization algorithms for Web page classification. Int. J. Comput. Sci. Technol. IJCST 5(2), 71–75 (2014)
MATH Google Scholar
Kwon, O.-W., Lee, J.-H.:“Web page classification based on k-nearest neighbor approach. Proceedings of the 5th International Workshop on Information Retrieval with Asian languages, pp. 9–15. ACM Press, Hong Kong, China (2000)
Lim, E.H.Y., Liu, J.N.K., Lee, R.S.T.: Knowledge seeker – ontology modeling for information search and management. Intelligent Systems Reference Library,, vol. 8. Springer-Verlag Berlin Heidelberg (2011)
Lin, Y., Jiang, J., Lee, S.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)
Article Google Scholar
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd edn. Springer-Verlag Berlin Heidelberg, (2007)
Liu, Y., Liu, M., Xiang, L., Yang, Q.: Entity-based classification of Web page in search engine. ICADL, LNCS, vol. 5362, pp. 411–412 (2008)
Madsen, R.E., Hansen, L.K., Winther, O.: Singular value decomposition and principal component analysis. Neural Netw. 1, 1–5 (2004)
Google Scholar
Mangai, J. A., Wagle, S. M., Kumar, V. S.: A novel Web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of the 3rd International Conference on Intelligent Computational Systems, Singapore (2013)
Meshkizadeh, S., Rahmani, A. M., Dezfuli, M. A.: Web page classification based on URL features and features of sibling pages. IJCSIS 8(2) (2010)
Meusel, R., Petrovski, P., Bizer, C.: The Web data commons microdata, RDFa and microformat dataset series. In: Proceedings of the 13th International Semantic Web Conference (ISWC 2014), pp. 277–292. Springer Berlin Heidelberg, Italy (2014)
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R.: Enabling technology for knowledge sharing. AI Mag. 12(3), 36–56 (1991)
Google Scholar
Patil, A.S., Pawar, B.V.: Automated classification of Web sites using Naive Bayesian algorithm. In: Proceeding of the International Multi Conference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2012)
Peng, X., Choi, B.: Automatic Web page classification in a dynamic and hierarchical way. In: Proceedings of Second IEEE International Conference on Data Mining, Washington DC, IEEE Computer Society, pp. 386–393 (2002)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Qiang, G.: An effective algorithm for improving the performance of Naive Bayes for text classification. In: Proceedings of the 2nd International Conference on Computer Research and Development, IEEE, pp. 699–701 (2010)
Saleh, A.I., El Desouky, A.I., Ali, S.H.: Promoting the performance of vertical recommendation systems by applying new classification techniques. Knowl.-Based Syst. 75, 192–223 (2015)
Article Google Scholar
Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., Ma, W.-Y.:“Web-page classification through summarization. In the Proceedings of the 27th annual international ACM SIGIR 04, conference on. Research and Development in Information Retrieval, New York, ACM Press, pp. 242–249, (2004)
Shen, D., Yang, Q., Chen, Z.: Noise reduction through summarization for Web-page classification. Inf. Process. Manag. 43(6), 1735–1747 (2007)
Article Google Scholar
Shibu, S., Vishwakarma, A., Bhargava, N.: A combination approach for Web page classification using page rank and feature selection technique. Int. J. Comput. Theory Eng. 2(6), 897–900 (2010)
Article Google Scholar
Sun, A., Lim, E.-P., Ng, W.-K.:“Web classification using support vector machine. Proceedings of the 4th International Workshop on Web Information and Data Management, pp. 96–99. ACM Press, New York (2002)
Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)
Article Google Scholar
Zhang, J.-B., Xu, Z.-M., Xiu, K.-l., Pan, Q.-S.: A Web site classification approach based on its topological structure. Int. J. Asian Lang. Process. 20(2), 75–86 (2012)
Google Scholar
Zhi Sam, L., Maarof, M. A., Selamat, A.: Automated Web pages classification with independent component analysis. In: Proceeding of The 2nd Postgraduate Annual Research Seminar, vol. 1, pp. 466–269 (2006)
Zhou, H., Guo, J., Wang, X., Duan, W., Wang, P., Cao, W.: A Web page classification algorithm based on feature selection. J. Inf. Comput. Sci. 12(4), 1549–1556 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering & Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt
Ahmed I. Saleh & Arwa E. Abulwafa
Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
Mohammed F. Al Rahmawy

Authors

Ahmed I. Saleh
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed F. Al Rahmawy
View author publications
You can also search for this author in PubMed Google Scholar
Arwa E. Abulwafa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arwa E. Abulwafa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saleh, A.I., Al Rahmawy, M.F. & Abulwafa, A.E. A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20, 939–993 (2017). https://doi.org/10.1007/s11280-016-0415-z

Download citation

Received: 03 February 2016
Revised: 13 August 2016
Accepted: 29 August 2016
Published: 26 October 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11280-016-0415-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A semantic based Web page classification strategy using multi-layered domain ontology

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effective Probabilistic Model for Webpage Classification

Multi-layer Filtering Webpage Classification Method Based on SVM

Improving Vietnamese Web Page Classification by Combining Hybrid Feature Selection and Label Propagation with Link Information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A semantic based Web page classification strategy using multi-layered domain ontology

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effective Probabilistic Model for Webpage Classification

Multi-layer Filtering Webpage Classification Method Based on SVM

Improving Vietnamese Web Page Classification by Combining Hybrid Feature Selection and Label Propagation with Link Information

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation