Tarek Amr Abdallah
Beatriz de la Iglesia
University of East Anglia, United Kingdom
Language Models, Information Retrieval, Web Classification, Web Mining, Machine Learning.
Artificial Intelligence
Clustering and Classification Methods
Computational Intelligence
Evolutionary Computing
Knowledge Discovery and Information Retrieval
Knowledge-Based Systems
Machine Learning
Methodologies and Technologies
Mining Text and Semi-Structured Data
Operational Research
Soft Computing
Symbolic Systems
Web Mining
This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task.
Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.