A novel web page recommender using data automatic clustering and Markov process

Darbandi Monfared, Fereshteh

doi:10.1007/s42452-019-1719-2

A novel web page recommender using data automatic clustering and Markov process

Research Article
Published: 28 November 2019

Volume 1, article number 1719, (2019)
Cite this article

Download PDF

SN Applied Sciences Aims and scope Submit manuscript

A novel web page recommender using data automatic clustering and Markov process

Download PDF

Fereshteh Darbandi Monfared¹

1492 Accesses
Explore all metrics

Abstract

As the web expands and data grow in the networks, finding useful knowledge from the World Wide Web becomes a major challenge. As intelligent systems, the recommender systems help users to find their favorite resource among a large body of information. This research aimed at providing a new page recommender system, which enhances the accuracy of the suggested pages based on the user interest, Automatic Clustering-based Genetic Algorithm (ACGA), and all modified 3th-order Markov Model. Therefore, the frequency and page visiting time by each user have been extracted from the log file of the web server, while the user interest in the page during a session has been computed by a linear combination. ACGA involved 2 phases of the cluster based on (1) similarity measure and (2) Genetic Algorithm. The novel cluster partitioned the vectorization sessions in separate clusters through a new fitness function. Moreover, all modified 3th-order Markov Model used all orders of Markov Model simultaneously, and predicted the next page in the page recommender system. The research did these experiments on a real CTI dataset. Experimental results and their comparison with other approaches showed superiority in the accuracy of the proposed system.

An Effective Clustering-Based Web Page Recommendation Framework for E-Commerce Websites

Article 15 June 2021

An effective Web page recommender using binary data clustering

Article 28 April 2015

An Evolutionary Scheme for Improving Recommender System Using Clustering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With regard to the huge volume of information resources, the World Wide Web is considered as a great environment to distribute information. According to the studies conducted so far, information overflow takes place when a search is done by a browser. In addition, information management is provided by the recommender systems (RSs) for user’s access. Results showed that RS is a tool that offers useful products among many possible options to the users [25]. One of the most fundamental goals of the recommender systems is to collect various information about the user’s interests and items in the system. It has been found that there are different resources and methods for gathering such information, including explicit and implicit methods [19]. According to the results, the user explicitly expresses their interests in the explicit method of information collection. Put differently, an implicit method is slightly difficult; that is, the system should find the users’ interests by controlling and following their behaviors and activities [5]. Some researchers divided RSs into 5 main categories of content-based techniques, collaborative filtering techniques, demographic filtering techniques, knowledge-based techniques, and hybrid filtering techniques [2]. As reported by the researchers, these systems are used by most large companies such as Facebook and Google [28].

It is widely accepted that clustering is an algorithm with the highest utilization in the RS [10]. Therefore, clustering technique is applied to cluster data session into a number of similar clusters, which influences the RSs performance. In a web browser, the users’ activities are expressed as a session in order to navigate the desired pages. Hence, a user session is comprised of a number of pages that show the user’s sequential pattern. Modeling the users’ session is performed by clustering. A lot of RSs have been provided, which use k-means and modified K-means technique for clustering sessions [1, 8, 20, 23, 30,31,32]. According to the researchers, these algorithms require the number of clusters K as an input. Then, they randomly choose K number of initial seeds (cluster centers) from a dataset. However, it can be difficult to guess the number of cluster K in a big dataset with no prior knowledge. Moreover, it has high sensitivity to the quality of initial seeds. Therefore, this research proposed a novel clustering method in order to find the number of clusters automatically, and used high quality initial seeds. A main contribution of the novel RS includes vectorization of web user sessions using a linear combination. Web server log files have important information about user’s surfing on the web, and the research used frequency and duration of viewing pages by the user for calculating the users’ interest. Another major contribution of the RS is the clustering session based on GA called ACGA, which is capable of automatically finding clusters. As an optimization problem, ACGA intends to find the best value for the validation index and yields the most acceptable result. With the help of a novel fitness function, this research intended to find clusters with high similarities. This method could be applied as a fitness function and cluster evaluation technique. As it is known, the fitness function regards a clustering solution that has compact clusters and big separations or gaps among the clusters. However, since GA spends many times for running big data, this research proposed a cluster-based similarity for finding initial cluster to resolve the problem. The GA uses the average data of each cluster as the input. However, the output of the ACGA consists of a number of clusters that do not provide information about the probability of viewing pages in the future. Thus, it is necessary to use discovery models of sequential patterns, including Markov Model (MM) to predict the web pages. Markov models represent a method for discovering sequential patterns. Researchers have used various orders of MM for predicting the next pages in RSs [21, 30]. They found that MMs (1st- or 2nd-order) have low coverage and accuracy in prediction. Moreover, high orders require a large memory and have numerous states. All k-th order MM overcomes these problems. Authors [7, 16, 22] used this model in the prediction phase but they did not use different orders of MM simultaneously. Hence, the present research used a modified all k-th order MM for increasing the prediction accuracy by taking advantage of all orders of MM concurrently. This method highlighted the probability of the same pages in all order of MM.

This research compared its novel RS with Harmony Session Clustering Recommender (HSCR) [8], Harmony K-means Session Clustering Recommender (HKSCR) [8], Interleaved Harmony K-means Session Clustering Recommender (IHKSCR) [8], K-means, all k-th order MM (KMM), K-means, MM, and Popularity and Similarity-based Page Rank (KMMPSPR) [30] on CTI dataset for 3 evaluation criteria called accuracy, coverage, and F-measure. The proposed RS has better performance than other techniques in CTI dataset for evaluating the criteria. Moreover, the results of the ACGA have been compared with the Automatic Clustering-based on Differential Evolution (ACDE) [29] and K-means [14] algorithms through the Chou-Su (CS) [4] and novel cluster evaluation criteria of Compactness and Separation Measure of Clustering (CSMC), which indicated that it has a high performance.

Section 2 reviews and analyzes the previous studies in the field. Section 3 deals with the proposed algorithm and its steps. Section 4 discusses time complexity of the proposed algorithms. Section 5 evaluates and analyzes the output of the proposed recommender system. Section 6 concludes the research and provides a number of suggestions for further research.

2 Literature review

Clustering the web sessions is one of the important parts in the web usage mining, which is a technique to discover clusters from the web data. Sessions in web are represented in two vector and non-vector models. Many authors used vector model by binary representation for web sessions, while several web session clustering used non-vector model for clustering sessions. Discovering sequential patterns enclosed in the sessions has been used to predict the probability of the page viewing by various Markov models. The main goal of this research has been to propose a non-vector model for sessions. Since the frequency and page visiting time are connotations of the users’ interest, their combination could affect the recommendation accuracy. Another main objective of the research has been to design an automatic clustering session and use a modified all k-th order MM for predicting pages. To cluster sessions, ACGA method has been used for clustering through ACGA as an optimization problem that aimed to find an optimal part of the solution space, which optimized function and cluster sessions. Each solution in the population had a fitness value that relied on the function to be optimized. This research took the cluster evaluation criterion as fitness function. The CSMC method has been defined as a new validation index, and covered both compactness and separation factors in the clusters. As a sequential decision process, the recommendation process should be implemented by modifying all k-th order MM. This MM combined probability pages in different orders of MM and presented a recommendation with high probability. Here, the research provides a literature review of the web page recommender systems, clustering techniques based on the metaheuristic algorithm, and Markov model.

2.1 Web page recommender systems

The RS is a kind of decision-support and intelligent system. It could be used for suggesting the web pages to be visited next in a big website in order to guide the web users to find the relevant information they needed. This type of RS is called web page recommender system. These systems used various techniques for recommendation of pages for a user.

There are numerous studies related to the recommender systems that employed K-means algorithms for clustering sessions [1, 24, 31, 32]. Researches used them because of their simplicity; however, they had generally high sensitivity to the number of clusters and initial seeds in a huge dataset.

Selvi et al. [27] proposed a new RS that originates the Collaborative filtering approach in “A novel optimization algorithm for recommender system using modified fuzzy c-means clustering approach”. The rating users have been clustered with minimal error rate using a proposed modified fuzzy c-means (MFCM) clustering approach. Users in the cluster have been further optimized through the proposed modified cuckoo search (MCS) algorithm with less number of iterations. Using the combining MCS with MFCM, the presented RS reduced the recommendation error rate and provided a list of recommendations with high accuracy.

In their research under “Web user session clustering using modified K-means algorithm”, Poornalatha et al. [23] presented a new clustering method aiming at the clustering web users’ sessions. Since the users’ sessions had variable lengths, a new distance criterion defined as the Variable Length vector Distance (VLVD) has been defined. The proposed method modified the K-means method for clustering web sessions. This approach showed that the number of the iterations of this algorithm is less than that of the regular K-means, and the sessions are placed in appropriate clusters according to the similarity between them. However, two problems of the K-means technique have been not considered in this research.

Mishra et al. [17] conducted a study titled “A web recommendation system considering sequential information” and presented a recommender system using the similarity functions of Sequence and Set Similarity Measure (S³M) and Singular Value Decomposition (SVD). After identifying different users, user clicks have been created by separate sequences, which have been then attained through the S³M similarity function. Upon the entrance of a new user, the first M cluster with the most similarity has been found, and then the response matrix has been formed according to the clusters. The weight vector for the current user has been created according to the location of its pages. Before the page suggestions, the SVD algorithm first reduced the size, after which the pages with the highest weight have been displayed by the output of this step. Moreover, Mishra et al. considered sequential information in web navigation along with content information.

Thew published the “Web page access prediction based on an integrated approach” and provided a recommender system based on KMMPSPR [30]. The main idea in this research has been to solve the problem of the Markov model in providing suggestions when pages had the same probability. After clearing data and identifying users and sessions, this algorithm used the k-means algorithm for clustering sessions. Then, 1st- and 2nd-order Markov model has been applied on each cluster. Here, when the pages identified by the Markov model were likely to be identical PSPR applied to the proposed one, the algorithm decided which one to propose based on their rank. The limitation of this research is that, it considers only low orders of Markov model for predicting pages.

KMM is a method, which uses K-means and all-3th order MM for recommendation. In this algorithm, the user session is initially clustered by the K-means algorithm. Then, MMs 1 to 3 are applied to each of the clusters. Afterwards, in the prediction phase, the first cluster is determined upon the entrance of a new user, and the pages are suggested using the Markov model with all 3th order. This model for vectoring sessions uses only frequency. Moreover, the number of the clusters is defined by users as an input in K-means, but novel RS uses information in log file and does not need defining the number of clusters.

In their paper titled “An effective web page recommender using binary data clustering”, Forsati et al. [8] proposed a new recommender system called harmonic session clustering. The main contribution of this research is clustering user sessions as zero and one vectors using harmony search optimization algorithm. Algorithms such as HSC, HKSC, and IHKSC have been presented in this research. In the HSC, users’ sessions have been clustered by harmony search algorithm. The objective function has been the error least-squares in the HSC. In HKSC, the centers have been introduced as problem solutions, and the K-means algorithm has been implemented on each of them. The algorithm ultimately represented the best response as the top solution. The IHKSC consisted of two stages. Firstly, clustering has been performed by the HSC algorithm after N iterations. Then, the best generation, which is superior over others, as cluster centers, has been taken from a harmonic algorithm and introduced as an input to the K-means algorithm, which performed the clustering. The criterion for assessing the clustering system has been the Average Distance of Sessions to the Cluster (ADSC) and Visit-Coherence (VC). The criteria for evaluating the recommender system included accuracy, coverage, and F-measure. In each algorithm, the number of the clusters has been considered as input and they did not use information in the log file and instead only considered whether a page has been seen by the user or not.

Selvi et al. provided “Personal recommender system based on user interest community in social network model” as a novel time weighted score matrix, in which the users and items with higher correlation are clustered into the same community by using differential equations [3]. Firstly, users’ interest is calculated based on the rounding-Forgetting. Time and score matrix are used as inputs in this function. Secondly, difference equation is performed to follow the clustering evolutionary process and cluster the users and items into several communities. Finally, a recommendation list is proposed according to the predicted scores. Results indicated that the system could improve the recommender system based Collaborative filtering.

2.2 Clustering technique based on Metaheuristic algorithm

Maulik et al. [15] published “Genetic algorithms-based clustering technique” and designed a genetic algorithm for clustering data. They found that the number of the clusters as input is received from the user. Moreover, the K cluster centers have been randomly chosen from dataset. Each chromosome in the population represented the center of clusters. The researchers used inverses of Sum Squared Error (SSE) for Fitness function. This method could be run in artificial and real datasets, and has been compared by K-means algorithm, which showed more acceptable results.

Lin et al. [11] reported “An efficient GA-based clustering technique”, and proposed a GA-based clustering technique, which chooses cluster centers directly from the dataset so that the number of clusters is defined by the user. According to these researchers, the length of the chromosome equals to the size of the dataset, in which the ith gene of the chromosome represents the ith data point in the dataset. Therefore, for a data point of index i to be the candidate for the center of a cluster, the value of the corresponding ith gene in the chromosome is set to “1”; otherwise, it is set to “0”. Thus, each chromosome in the population is evaluated by inverses of Davies–Bouldin (DB) measure as the fitness function. This algorithm was tested on an artificial dataset.

Nonetheless, limitations in the “Genetic algorithms-based clustering technique” by Maulik et al. [15] and “An efficient GA-based clustering technique” by Lin et al. [11] include their requirement for a user input on the number of clusters and random selection of the initial seeds of dataset.

Rahman et al. [26] presented a new clustering technique in “A hybrid clustering technique combining a novel genetic with K-means”. This technique is a combination of k-means and GA together called GenClust technique. It aimed at reaching more acceptable quality clusters without any necessity for user inputs for the number of clusters. The genes are found by genetic algorithm both randomly and deterministically in the initial population. GenClust has been successful to automatically find the right number of clusters and identify the right genes via a novel initial population determination approach. To achieve even a higher quality clustering resolution, centers have been given K-means as starter seeds permitting the adjustment of initial seeds according to the requirements.

GenClust suffers from the following limitations. Firstly, its initial population selection procedure has a time complexity of O(n²), which may be problematic for large datasets. Secondly, it uses a set of user defined radii of clusters to obtain the initial population. It is widely accepted that actual clusters possibly have radii values that vary from one dataset to another that is likely dependent on several factors, including the dataset dimension.

Swagatam et al. [29] developed a novel automatic clustering called ACDE in their study titled “Automatic clustering using an improved differential evolution algorithm”. This method employs an activation threshold (control genes) in the range between 0 and 1 for active or nonnative centers. Such an idea finds the number of the clusters automatically. Moreover, the cluster centroids are randomly fixed between X_max and X_min, which implies the highest and lowest numerical values of any property of the dataset under test. Furthermore, the response applied DB and CS measures for evaluation. According to the results, this method is capable of finding the number of clusters in various runs; however, accuracy in the clusters would not be perfect.

Thus, for evaluating the clustering strategies, the analysis would be done for standard dataset, namely Wifi-localization.^{Footnote 1}

Figure 1 depicts a visual view of the performance of 4 clustering techniques for the Wifi-localization dataset.

Table 1 compares the outputs of various clustering techniques on the basis of experimental tests.

Table 1 Comparison of clustering methods for Wifi-localization dataset

A novel web page recommender using data automatic clustering and Markov process

Abstract

Similar content being viewed by others

An Effective Clustering-Based Web Page Recommendation Framework for E-Commerce Websites

An effective Web page recommender using binary data clustering

An Evolutionary Scheme for Improving Recommender System Using Clustering

1 Introduction

2 Literature review

2.1 Web page recommender systems

2.2 Clustering technique based on Metaheuristic algorithm

2.3 Markov model

3 The proposed algorithm

3.1 Data pre-processing

3.2 Making session vectors

3.3 Automatic web session clustering

3.3.1 Clustering-based similarity

3.3.2 Clustering-based genetic algorithm

3.3.2.1 Representation of the chromosomes

3.3.2.2 Fitness function

3.3.2.3 Primary population production

3.3.2.4 Choosing parents

3.3.2.5 Crossover operator

3.3.2.6 Mutation operator

3.3.2.7 Updating population

3.4 The all k-th order Markov model for the clusters

3.5 Suggestion of pages based on user behavior

4 Time complexity analyses

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

5 Evaluation and simulation results

5.1 The CTI dataset

5.2 Effective parameters in evaluation

5.3 Results of clustering and evaluation of suggestion pages

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation