Authors:
António Videira
and
Nuno Goncalves
Affiliation:
University of Coimbra, Portugal
Keyword(s):
Web Page Classification, Feature Extraction, Feature Selection, Machine Learning.
Related
Ontology
Subjects/Areas/Topics:
Searching and Browsing
;
Web Information Systems and Technologies
;
Web Interfaces and Applications
Abstract:
There is a constantly increasing requirement for automatic classification techniques with greater classification
accuracy. To automatically classify and process web pages, the current systems use the text content of those
pages. However, little work has been done on using the visual content of a web page. On this account, our
work is focused on performing web page classification using only their visual content. First a descriptor is
constructed, by extracting different features from each page. The features used are the simple color and edge
histograms, Gabor and Tamura features. Then two methods of feature selection, one based on the Chi-Square
criterion, the other on the Principal Components Analysis are applied to that descriptor, to select the top
discriminative attributes. Another approach involves using the Bag of Words (BoW) model to treat the SIFT
local features extracted from each image as words, allowing to construct a dictionary. Then we classify web
pages based on
their aesthetic value, their recency and type of content. The machine learning methods used
in this work are the Naive Bayes, Support Vector Machine, Decision Tree and AdaBoost. Different tests are
performed to evaluate the performance of each classifier. Finally, we thus prove that the visual appearance of
a web page has rich content not explored by current web crawlers based only on text content.
(More)