1. Introduction
With the rapid growth of the Internet, most research on the Internet has revealed some very hot topics, such as social networks [
1,
2], web mining, and so on. In web mining, there are three categories: web content mining, web structure mining and web usage mining. In Web Usage Mining (WUM), also known as web access, web access pattern tracking can be defined as the web page history; the mining task is a process of extracting interesting patterns from web access logs. Web usage mining is still a popular research area in data mining. With the rapid growth of the Internet, more and more useful information is hidden in web log data. It is essential to learn about the favorite web pages of web users and to cluster web users in order to understand the structures that they use.
Many techniques in web usage mining have been proposed [
3,
4,
5,
6,
7,
8], and this field is still a hot topic for research in data mining. Most existing web mining techniques are performed based on association rule mining or frequent pattern mining, and these methods aim to find relationships among web pages or predict the behavior of web users. It is difficult to find certain groups of web users with similar favorite web pages. Furthermore, some articles about clustering web users have been published recently [
6,
9,
10], although different clustering algorithms are used. However, all of these articles cluster web users based on frequent-pattern mining of topics in common. In generating frequent patterns, based on a user-specified minimum support threshold, the process can obtain frequent web pages for all web users. This means that if some web pages are frequently accessed by one web user, then they are accessed by other web users with a high probability. These kinds of frequently visited web pages are without discrimination in clustering web users; they are like noise pages in clustering.
The discovery of class comparison or discrimination information is an important problem in the field of data mining. Emerging patterns [
11,
12], defined as multivariate features where supports change significantly from one class to another, are very useful as a means of discovering distinctions between different classes of data. Using the emerging pattern—mining technique, we can find emerging web pages in web log data. This technique has been detailed in many articles [
13,
14], and is still a hot topic in the field of computer science. Jumping Emerging Patterns (JEPs) [
15] is a special concept of EPs, which has been presented to describe some discriminating features that only occur in one class, but do not occur in other classes at all.
Term Frequency–Inverse Document Frequency (TF-IDF) [
16] is a kind of weighted technology commonly used for information retrieval and information mining. Therefore, it is considered one of the measures of the importance of a document. It is widely used in research areas for classification of literature, text mining and other related fields.
Clustering algorithms are used for category based on their cluster model, and many clustering algorithms were proposed, such as the K-means algorithm [
17,
18], Self-Organizing Map (SOM) algorithm [
19,
20], Adaptive Resonance Theory (ART1) [
21,
22] and K-means & TF-IDF [
23,
24]. In this paper, accessed web pages were collected in users, they were in text form that can be defined as the identification of web users. The K-means & TF-IDF approach was used to cluster web users, because of the advantage of TF-IDF used in text mining.
Folksonomies [
25] has been proposed as a collaborative way to classify online items. This kind of classification is determined by the defined frequency of user groups. In addition, many researches have been proposed based on Folksonomies [
26,
27] up to now. In this paper, we label the clusters based on the concept of Foksonomies.
There were many articles proposed for finding the interests of users [
1,
28,
29]. The first paper proposed a linear regression-based method to evaluate user interest in order to calculate a similarity matrix, and to cluster web users based on a threshold according to the generated matrix. The second paper proposed an entropy-based approach to obtain user interests. The third paper proposed a community-based algorithm to retrieve user interests. The above proposed approaches ignore the characteristic of specific web pages that are frequently accessed by one user but barely accessed by others, which should be a pattern mainly considered.
In this paper, we aim to cluster web users based on user interests found in web log data. We propose an efficient approach by considering the techniques of emerging-pattern mining. In the mining task, emerging patterns of each web user are used to define the interests that are frequently accessed by one user and barely accessed by others. Through finding emerging patterns for all web users, we can discard the noise (nonessential) web pages for each web user, and cluster web users according to the generation of typical web pages.
This paper is organized as follows: The next section introduces our proposed approach; then, we implement our proposed approach with a file of web log data for evaluation; finally, we discuss our conclusions and suggest future work.
2. Proposed Approach
In this section, we generate large web pages from processed web log data, then scan and transform the clean data set into simple page-linked graphs (SPLGs), and then, generate emerging patterns in the generated SPLGs. We cluster web users based on generated emerging patterns, and finally, label the clusters with typical web pages. Our work flow is shown in
Figure 1.
2.1. Preprocessing of Data Set
Web log data is automatically recorded in web log files on web servers when web users access the web server through their browsers. Not all of the records sorted into the web log files have the right format or are necessary for the mining task, so before analyzing the web log data, a data cleaning phase needs to be implemented.
2.1.1. Removing Records with Missing Value Data
Some of the records sorted in the web log file will not be complete, because some of the parameters of the records were lost. For example, if a click-through to a web page was executed while the web server was shut down, then, in the log file, only the IP address, user ID, and access time will be recorded; the method, URL, referrer, and agent are lost. This kind of record cannot be used for our mining task, so these records must be removed.
2.1.2. Removing Records with Exception Status Numbers
Some records are caused by errors in the requests or are caused by the server. Even if those records are intact, the activity did not execute normally. For example, records with status numbers 400 or 404 are caused by HTTP client errors, bad requests, or when a requested page was not found. Records with status numbers 500 or 505 are caused by HTTP server errors, in which the internal server cannot connect, or when the HTTP version is not supported. These kinds of data are not needed for our task, so the records must be removed.
2.1.3. Removing Irrelevant Records with No Significant URLs
Some URLs in the records consist of .txt, .jpg, .gif, or .js extensions, which are automatically generated when a web page is requested. These records are irrelevant to our mining task, so they must be removed.
2.1.4. Selecting the Essential Attributes
As shown in the common log format of the web log data in
Figure 2, there are many attributes in one record, but for web-usage mining, not all the attributes are necessary. In this paper, the attributes IP address, time, URL, and referrer are essential to our task, so they must remain; the rest should be discarded.
2.2. Generation of Large Web Pages
A large page set is a set of frequent web pages. We define frequent web pages as those where support thresholds are greater than, or equal to, a user-specified minimum support threshold.
In this paper, a web log file denotes a data set; Large Web Pages (LWPs) denote the set of web pages that are accessed by web users with sufficient frequency over a period of time. A special period of time called a user session is an important definition for generating LWPs for web users. Generally, the value of a user session is defined by web designers according to the desired level of security. For some websites with high security, the user session is always set to a short amount of time, such as 15 min or less, for safety. For example, a web user who is not active for a long time may have left the computer to do other things, so if someone else is using his account to do something, but the original user does not know about it, it is not safe. For other general websites, the period of a user session can be longer, such as a half hour or one hour; it can also be indefinite. If, for simplicity, the user session time is defined as one hour, in the process of generating large web pages, we should group the experimental data by periods of one hour for each web user.
After cleaning the data set, the data are sorted by the values of their IP address field, and split by user session. As a result, a session-based data set is obtained, which serves as input for our proposal. An example of a session-based data set is shown in
Table 1, and a special period of time for the user session is defined as one hour. According to the session-based data, candidate large web pages of each web user are extracted, and their supports are calculated. To calculate the support count for each candidate, we need to count the visit times of each web page accessed in different user sessions for each web user. The equation of support is Equation (1), where
is the visit times of web page
j in all sessions of web user
i, and
is the number of sessions for web user
i. Finally, a user specified Minimum Support threshold for Large Web Page (MSLWP) must be defined. The MSLWP denotes a kind of abstract level that is a degree of generalization. The support value will be determined by the proportion of web users accessing web pages at certain times. The selection of an MSLWP is very important; if it is low, then we can obtain information for a detailed event. If it is high, then we can obtain information for general events. The pseudocode for obtaining large web pages is shown in Algorithm 1.
Considering the session-based data set of
Table 1 as input to Algorithm 1, when setting the parameter value of MSLWP at 0.25, the implementation of steps 5–19 results in a candidate data set, as shown in
Table 2. Afterwards, the execution of lines 20–26 results in large web pages for user 1: {p6, p12, p14, p19}.
Algorithm 1. getLWPs (List SD, double MSLWP) |
Input: | A set of session-based web data SD; a user-specified minimum support MSLWP. |
Output: A set of large web pages for each web user. |
1. | Define tmp_IP = SD1.IP; |
2. | Define i = 1; |
3. | Define out_LWP[][]; |
4. | Define = 0; // initialize the number of sessions for web user i |
5. | for each sequence data SDn in SD |
6. | if (SDn.IP == tmp_IP) |
7. | ++; |
8. | for (int j = 1; j ≤ the number of web pages; j++) |
9. | if (SDn.URLs contain Pij) // Pij is the jth web page for web user i |
10. | ++; // the visit time of web page j by web user i, add one |
11. | break; |
12. | end if |
13. | end |
14. | else |
15. | SDn.IP == tmp_IP; |
16. | i++; |
17. | = 1; |
18. | end if |
19. | end |
20. | for each web user i |
21. | for each web page j |
22. | if (>=MSLWP) // check if web page j for web user i is a large web page |
23. | out_LWP[i][j].add(); |
24. | end if |
25. | end |
26. | end |
2.3. Generation of Simple Page Linked-Graph (SPLG)
After generating large web pages for each web user, all of the large web pages are defined as vertices in the SPLG.
In regular page-linked graphs, each edge consists of every two web pages that are contained in one session. An example of a page link graph for web user 1 is shown in
Figure 3 (left). However, in a SPLG, each edge consists of every two large web page of the web user. Applying the concept of the SPLG to the structure of web page links can reduce large and complex regular page-linked graphs to simple ones in order to reduce noise web pages. In the SPLG, links between each of the two large web pages should be checked. To check the link between every two vertices, the direction of link does not need to be considered, if the two vertices are visited by one user in one session, then they are connected. The pseudocode for checking the links is shown in Algorithm 2.
Algorithm 2. checkLinks (List SD, String [][] LWP) |
Input: | A set of session-based web data SD; a set of large web pages LWP[i][j], where i is the index of web users, and j is the index of large web pages for user i |
Output: A set of links with IP address. |
1. | Define flag_Link = 1; |
2. | Define HashMap out_List; |
3. | for each web user i |
4. | for each large web page j |
5. | if each two large web pages LWP[i][m] and LWP[i][k] are both included in SDi |
6. | flag_Link = 1; // a link between LWP[i][m] and LWP[i][k] is found |
7. | Else |
8. | flag_Link = 0; // no link between LWP[i][m] and LWP[i][k] |
9. | end if |
10. | if (flag_Link == 1) |
11. | out_List.put(SDi.IP, LWP[i][m]:LWP[i][k]); |
12. | end if |
13. | end |
14. | end |
After generating all of the links, the generated links with the same IP address are grouped and linked with the same vertices. Then, the SPLGs of all web users can be generated. For example, for the experimental data set in
Table 1 for web user 1, who visited 20 web pages {p1, p2... p20} in 14 user sessions, we define the MSLWP as 0.25, and the large web pages of user 1 are {p6, p12, p14, p19}, which was generated in the previous section. After implementing Algorithm 2, links {(web user 1, [p6, p12]), (web user 1, [p6, p19]), (web user 1, [p12, p14])} are obtained, and then the SPLG of web user 1 can be described as shown in
Figure 3 (right).
2.4. Generation of Emerging Patterns
After generating SPLGs for all web users, we try to find emerging patterns in these SPLGs. Examples of SPLGs for some web users are shown in
Figure 4.
In the process of emerging pattern mining, we will use the ideas of
-EP [
30] and JEP [
31,
32]. For example, we set the SPLG for web user U1 as class 1, and set other SPLGs for other web users as class 2.
Table 3 shows the web pages in these two classes.
Table 4 lists all the possible EPs, their support, and the growth rates of all EPs. The equation of support is Equation (2), and the equation for the growth rate is Equation (3), where
is the number of patterns (p
i) in the class, N is the number of all patterns in the class,
is the support value of class 1, and
is the support value of class 2. If we set the minimum growth rate threshold,
, to 1.5, there are nine EPs in class 1: four normal EPs ({p6}, {p12}, {p19} and {p12, p14}), where the value of the growth rate is greater than the specific value of
, and five JEPs ({p6, p12}, {p6, p19}, {p6, p12, p14}, {p6, p12, p19} and {p6, p12, p14, p19}) where the value of the growth rate is infinite.
JEPs {p6, p12}, {p6, p19}, {p6, p12, p14}, {p6, p12, p19} and {p6, p12, p14, p19} are the JEPs for class 1, with support of non-zero values in class 1, and zero in class 2. It can be seen that these JEPs appear many times in class 1, but never in class 2; so, these different values can be usefully implemented to distinguish different favorite web pages from different web users.
2.5. Clustering of Web Users
In this paper, we execute a K-means clustering algorithm [
24,
25] on emerging patterns to cluster the web users. First, we generate a TF-IDF based weighted matrix which can reflect how important a web page is to a web user. In the process of matriculation, a TM matrix is defined as U by P, where U is the number of web users, P is the number of web pages that are emerging patterns of all web users, and TM
ij represents a measure of the TF-IDF weighted value for web page j visited by web user i,
and
. According to Equation (4), we can get the value of the TF-IDF for web page j of web user i, and then the TF-IDF based TM matrix can be obtained. Then, we execute the K-means algorithm on the generated TM with a specified K value to get the clusters of web users:
where
nij is the number of occurrences of web page j for user i,
is the number of occurrences of web page j for all web users,
is the number of web users, and
is the number of web users who accessed web page j.
2.6. Annotation of Clusters
After clustering, we label the clusters based on the concept of Folksonomies. Each cluster is defined as one user group, and the web pages in each cluster are defined as online items, we use TF-IDF to calculate the frequency of each web page in each cluster. According to Equation (5), we can calculate the TF-IDF value of each web page in each cluster, and then we can select some web pages where TF-IDF values are among the Top N (N can be the number chosen by a user with freedom, where N is smaller than the number of web pages in each cluster) and the largest in each cluster is the label of this cluster:
where
nij is the number of occurrences of web page
j in cluster
i,
is the number of occurrences of web page j in all the clusters,
is the number of clusters, and
is the number of clusters that contain web page j.