1. Introduction
As the Internet gets more popular, the amount of information increases dramatically, making “information overload” a serious issue [
1]. Under this background, personalized information services and recommendations become possible, among which user portraits have become a popular application of big data. A user portrait is a method that is used to label users with some characteristic information, such as personal attributes, online behaviors, consumer activities, etc. User portraits allow service providers to better know their users and then offer services that can better meet their personal needs. User portraits can also allow users to know themselves better to encourage more positive attitudes and better behavior to make individual lives more meaningful.
To predict user attributes accurately, a variety of machine learning techniques have been applied to be the task of user portraits in which some history data can be used as the input data to the learning neural network. In this paper, we describe the requirement of user portraits and propose a multi-model approach for performing user portraits. The proposed model integrates multiple machine learning and deep learning models and uses the output of the above models as the input to XGBoost (a scalable machine learning system for tree boosting) [
2] to perform further training. We will show that our approach can allow more accurate information to be dug out based on the integration of several models to improve the performance of user portraits with higher accuracy.
The rest of this paper is organized as follows.
Section 2 reviews some related work and introduces the theoretical basis.
Section 3 describes the proposed model, and
Section 4 describes the experimental setup as well as the results. Finally,
Section 5 concludes the paper.
2. Related Work
The concept of user portraits was first introduced by Alan Cooper [
3], the father of interaction design. By analyzing information about users’ social properties and behaviors [
4,
5], user portraits can be constructed to provide an important data base for further accurate and rapid analysis of the behaviors and the habits of the user [
6]. The results can help enterprises find classified user groups and users’ current needs quickly and, at the same time, let the user get a profound understanding of himself/herself.
Current research on user portraits has mainly followed three directions. The first is on user attributes, with the main purpose of understanding the user by collecting some feature information through the social annotation system [
7]. The second is on user preference, with the main purpose of improving the quality of personalized recommendations by measuring the degree of users’ interest [
8]. The third is on user behavior, with the main purpose of predicting user behavior trends to prevent the loss of customers [
9] and to devise appropriate measures. In the application of forecasting power companies’ arrears [
10], it is very helpful to discover the characteristics of customers and provide decision-making support for power companies.
In order to classify bloggers in terms of age, Rosenthal et al. [
11] used text and social features to construct user profiles. Mueller et al. [
12] extracted features from usernames on Twitter and made gender judgments to build user profiles. Marquardt et al. [
13] proposed using multi-label classification to improve the accuracy of the prediction of gender and age. To describe the user portrait of SNS under the dynamic social network structure and users’ historical preference, Wu et al. [
14] proposed a model of user preference prediction and user social advice. To find a user with similar habits, Ma et al. [
15] proposed a method to portray similar users and to mine the same user habits from mobile devices. Zhu et al. [
16] performed emotional analysis and user portraits by collecting device logs and mobile device usage records. Zhang et al. [
17] proposed a model of data mining to construct mobile user pictures based on Internet logs, user base information, package information, terminal information, business order, and other information. Huang et al. [
18] analyzed three aspects of mobile users, i.e., frequent activities, regular behavior, and mobile speed, and portrayed mobile users for the purpose of providing personalized services to customers.
In our work, the age, gender, and educational background of users were predicted based on one month of user data. Such work is a kind of document classification task in natural language processing in which the most commonly used word-expression method is the word-bag model [
19,
20]. The drawback of such a method is that it has sparse features with little semantic information. However, some previous experience [
21,
22] has demonstrated that neural network models can be more effective in handling a variety of tasks of natural language processing. We thus proposed applying a recurrent neural network (RNN) [
23] to resolve sentence and context dependencies. Moreover, long short-term memory networks (LSTMs) [
24] are currently the most successful RNN structure and a lot of work has proved that LSTMs can learn a long range of dependencies in many natural language processing (NLP) tasks [
25,
26]. However, since our data are user history query words, thus a kind of short text, there is no clear relationship between the terms, although there are a lot of local features in the query words. A convolution neural network (CNN) [
27] can be used to take out local features, such as words, n-grams, phrases, etc. Collober [
28,
29] was among the first to apply CNN to NLP. Since then, CNN has been successively applied to document classification tasks [
30,
31]. However, tag prediction is a simple text classification task in which the deep learning model is not necessarily better than the shallow neural network model [
32]. In performing such simple tasks, the training speed of Word2vec [
33] and Doc2vec [
34] in the shallow neural network model is faster than the deep neural network model. Therefore, we will use the shallow neural network models in our method.
3. The Multi-Stage User Portrait Model
Our multi-model fusion method, which uses a two-stage structure, is shown in
Figure 1.
In the first stage, three models are used, i.e., the traditional machine learning model and the term frequency-inverse document frequency (TF-IDF) [
35] model are used to extract the differences of user habits, and the neural network model is used to extract the semantic association information of the query. The corresponding three subtasks can be executed in parallel. A support vector machine (SVM) [
36] is a two-class classifier that aims to find a hyperplane to classify a dataset into two different parts. It can also be adopted to solve multi-classification problems by combining multiple two-class SVMs. Thus, SVM can be applied to classify users’ retrieval records to obtain classification results on users’ education, age, and gender. Doc2Vec can be used to get the vector expression of sentences/paragraphs/documents. Then, the learned vector can be used to identify similarities between sentences/paragraphs/documents by calculating distances to obtain classifications on users’ education, gender, and age. The convolutional layer and the pooling layer of the CNN can extract the features of short text sentences through convolution and pooling operations, respectively, to obtain generalized binary and ternary feature vectors. The two types of feature vectors are connected to form a feature matrix, which is input into the LSTM neural network structure to predict user tags. Then, the vector matrix output of the LSTM model is input into the dropout layer to prevent the data from overfitting. Subsequently, the vector matrix is input into the full link layer to reduce the size. Finally, the probability distribution of user tags can be obtained through the softmax excitation function. The information on users’ education, age, and gender obtained from these three subtasks can be used as the input to the fusion model.
To further optimize the results from the first stage, the XGBoost Tree model and the Stacking multi-model fusion are used to improve the accuracy and the generalization ability of our model. Stacking performs K-fold crossover on the data obtained from the three models in the first stage and outputs the prediction results. The prediction results of each model are then combined and averaged as the new feature and verification set. The results obtained above are input into the XGBTree model for training before linear fusion is performed to get the desired output. The execution process of the model is shown in
Figure 2.
3.1. User Portrait Based on SVM
SVM has a strong theoretical basis that can ensure that the global optimal solution rather than the local minimum can be explained. In other words, SVM, which originally intends to find a way to deal with two kinds of data classification problems, has a good generalization ability for unknown samples. That is why we decided to use SVM for text classification. To perform classification, SVM first extracts features from the original space and maps the samples in the original space to a vector of higher dimensional feature space so as to solve the linear indivisibility of the original space.
Let us assume that the training dataset is (
x1,
y1), (
x2,
y2), …, (
xl,
yl),
x∈R
n,
y∈{+1,−1}. To find the optimal hyperplane
Wx +
b = 0, we need to solve Formula (1).
where
and
, and Formula (2) is used after getting
α.
Then, we get
w, which satisfies Formula (3):
Thus, we get b.
Finally, to determine whether a certain sample z belongs to class α, we need to go through the following two steps:
If f(x) = 1, then z is a member of class α, otherwise it is not.
Following are the steps for a user portrait based on the SVM model.
SVM is a supervised learning model that uses the Dl of a manually classified document set to train the classifier. To improve the accuracy of the classification, we should increase the size of the training set Dl by increasing the number of documents contained in Dl. However, the well-classified document set Dl is an expensive and scarce resource compared to the unclassified document set Dw, and, in fact, the number in Dw is generally much larger than |Dl|.
The EM algorithm regards the classification of unclassified documents as incomplete data and would transform Dw into Dl automatically through iteration, thus enlarging the sale of training set D. The iterative algorithm I_SVM discussed in this paper combines the EM algorithm and the SVM algorithm and makes D = {Dl, Dw}, exhibiting the characteristics of supervised learning and unsupervised learning. It is thus a semi-supervised learning algorithm.
Let Dl and Dw be classified and unclassified documents, respectively. In addition, let Zk represent the collection of categories of unclassified documents at iteration k, where k is the number of the current iteration. The process of the I_SVM algorithm is executed as follows:
The SVM classifier is initialized by classified document Dl (see Equations (1)–(3)), in which the parameters α, w, and b are obtained.
The E_step: The category of d (di∈Dw) is calculated and judged by parameters α, w, and b using Formula (4).
The M_stpe: The parameters of the SVM model, i.e., α, w, and b, are calculated again based on D = {Dl Dw}.
If the category of the classified documents changes or k is less than the specified number of iterations, then k = k + 1 and go to step 2.
The classifier tends to be stable and generates the final SVM classifier.
The final SVM classifier is used to classify test documents and output the classification results.
3.2. User Portrait Based on Doc2vec Neural Network Model
Based on the Word2vec model, the Doc2vec model learns vector representations of documents by adding vectors of a document to the training of word vectors. Doc2vec is an unsupervised learning algorithm, which has many advantages, such as no fixed length of sentences, different length of oranges as training samples, no dependence on semantics in the word-bag model, and so on. There are also two training methods in Doc2vec, i.e., the distributed memory model of paragraph vectors (PV-DM) and the distributed bag of words version of paragraph vector (PV-DBOW) [
37]. If the document matrix is
D and the word matrix is
W, in the process of training PV-DBOW,
D is used as the input to predict the word
w in the document, as shown in
Figure 3, whereas in the process of training PV-DM,
D and words other than
w are used as the input to predict word
w, as shown in
Figure 4. Both PV-DM and PV-DBOW update the weight
U and document matrix
D of the model classifier using the error gradient calculated through back propagation.
3.3. User Portrait Based on CNN+LSTM
A convolutional neural network (CNN) is a feedforward neural network widely used in time-series data and in image data processing. CNN is a special deep neural network model and, due to its incomplete connection and weight-sharing network structure, is similar to the biological neural network that can reduce the number of weights and the complexity of the network model, as shown in
Figure 5.
Convolution operation can improve machine learning performance in three aspects: sparse connection, weight sharing, and equivalent representation.
According to the relevant features of CNN, a two-layer parallel convolutional neural network is proposed, as shown in
Figure 6. The features of the short text can then be extracted and represented. The specific design is described below.
First, the users’ short text data is parsed by using the stutter participle tool to obtain the set of words. Then, Word2vec is used to form a word vector. Since the text has the features of being short and concise, the length of the sentence, i.e., the number of words, is limited to 50. Each sentence can be embedded into the layer so that each word is placed into a 256-word vector to eventually form the output layer that has a two-dimensional matrix of the size 50*256. Each sentence would then form a two-dimensional matrix of the size n*m Z = [w1, …, wi, …, wn] where wi = [xi1, …, xij, …, xim].
- 2.
Convolution layer feature extraction
The function of the convolution layer is to extract the semantic features of sentences in which each convolution kernel corresponds to some extract features. Our model sets the number of convolution kernels to 128. For each sentence matrix Z of the embedding layer, the convolution operation is carried out using Formula (5).
where
S represents the feature matrix extracted from the convolution computation, and the weight matrix
W and bias vector b are the criteria for network learning.
To facilitate the calculation, non-linear mapping is required for the convolution results of each convolution kernel, which is based on Formula (6).
To extract features in a more comprehensive way, binary and ternary features of sentences are extracted using convolution windows of 2 and 3, respectively.
- 3.
K-max pooling feature dimensionality reduction
The extracted features are shifted to the pooling layer after the convolution calculation. In this method, we used K-max pooling, which selects the top K-maximum values of each filter to express the semantic information of the filter. The value K is determined using Formula (7).
where
Len is the length of the sentence vector, which is 50, and fs is the size of the convolution window.
After the pooling operation, the number of feature vectors extracted from each convolution kernel is significantly reduced and the core semantic information of the sentence is retained. As the convolution number is set to 128, after pooling, the sentence representation matrix thus generated is W∈RK*128.
The convolution and the pooling layers of the CNN extract the features of short sentences through convolution and pooling operations, respectively, and the generalized binary eigenvector and ternary eigenvector are obtained as the result. The fusion layer then combines the two eigenvectors to form the input matrix to the LSTM model.
The LSTM neural network is a special cyclic neural network that has the ability to learn long-term dependencies, which can achieve especially good results in text processing. In the language model, a series of contextual information can be used to predict the probability of the next word to solve the gradient disappearance problem of traditional RNN in processing long sequences. Since the general principles are the same, we chose the GRU model, which is a variant of the LSTM model, to synthesize an “update threshold” and to combine the cell state with the hidden state. For this reason, we still regard GRU as the LSTM neural network structure, which is shown in
Figure 7.
The formulas involved in
Figure 7 are expressed in Formulas (8), (9), (10), and (11), respectively.
The feature matrix extracted by the CNN model is input into the LSTM neural network to predict user tags. The input of LSTM at time t is composed of the feature vector and the output ht−1 of LSTM at time t−1.
The output of the LSTM model vector matrix is input into the dropout layer to prevent overfitting of the data. In the training process, the hidden layer neurons are deleted randomly according to the input data in a certain proportion, and the number of neurons in the input layer and in the output layer are kept unchanged. Subsequently, the vector matrix is input into the full link layer to reduce the number of dimensions. Finally, the probability distribution of user tags can be obtained through the softmax excitation function.
3.4. Integration of the Models
The method that is used to combine individual learners is called the associative strategy. For classification problems, the method of voting is used to select the class with the most output. For regression problems, the method of averaging is used to select the output of the classifier.
The above voting and averaging methods both are very effective combination strategies. One strategy that uses another machine learning algorithm to combine individual machine learning results is called stacking.
In the training process of each base model, the training set is divided into five parts by using fivefold cross-validation in which four parts are used for training the model in turns and one part is used for validation of the model.
Finally, the prediction results of each base model are spliced together and used as the input to the second layer XGBoost model.
In our multi-model approach, the first stage model is trained on three sub-tasks and the output probability of the model is used as the input to the next stage model. The three sub-tasks are user portals based on SVM, CNN + LSTM, and Doc2vec, respectively. Since the three subtasks provide six, six, and two classifications, respectively, the first feature dimension is 4*(6 + 6 + 2) = 56.
In our model, fusion is very important for the following three key reasons:
- (1)
In the multi-classification task, the general model is based on OneVsRest or OneVsOne. Thus, the classifier can only see the classification information of two classes. After the probability value of each class is output by means of stacking, the second-layer model can see all the classification results to allow some threshold judgment, mutual checking, and so on. The fusion of the two classification tasks on gender is not as good as the other two and six classification tasks.
- (2)
There are some correlations between the three subtasks, especially between age and education. Therefore, the second-stage model can have good learning for this characteristic relationship. For example, when we tried to get rid of the age and the gender characteristics when predicting academic qualifications, the results became somewhat poor.
- (3)
This dataset has a problem of data imbalance. However, since the evaluation index is acc, downsample and upsample become unnecessary. With the XGBoost model, we can learn the optimal threshold of each category very well.
The specific steps of the multi-model approach are as follows:
- (1)
Datasets are divided into training sets and retention sets.
- (2)
The training set is processed by K-fold crossover, and thus the K-base classifiers are trained. The predicted results are spliced and processed as the training set of the second layer model.
- (3)
The base classifier in step (2) is also used for prediction on the reserved set, with the prediction results being averaged out as the verification set of the second layer model.
- (4)
The training set of step (2) and the verification set of step (3) are used to train the second layer of the XGBTree model.
- (5)
Steps (4) is repeated to train multiple XGBTree models to perform linear fusing so as to further improve the generalization ability of the model.
4. Experiment and Analysis
4.1. Dataset and Experiment Setup
Dataset. The dataset used in the experiment was derived from the “Search Dog User Portrait Mining in Big Data Precision Marketing” provided by the Chinese Computer Federation (CCF) in 2016. It contained query terms that last for one month as well as user’s population attribute labels, such as gender, age, and education, as the training data. Participants were required to construct classification algorithms through machine learning and data mining techniques to determine the population attributes of new users. That is, each record in the test.csv file was judged by age, gender, and educational background. The training dataset contained 100,000 data items, and so did the test dataset. The description of all data fields is provided in
Table 1.
Word segmentation. Since there are no spaces between words in Chinese sentences, it was necessary to divide the sentences into words before text categorization could be performed.
Through the analysis of the sample data, we found that “space,” “punctuation,” and many stop words were helpful to distinguish between the basic attributes of users. Therefore, in this experiment, the blanks, the punctuations, and the stop words were not processed in word segmentation through TF-IDF feature calculation.
The text in the datasets was mainly the users’ search records with short length. It was of great importance to perform word segmentation efficiently. Through extensive application and comparison, three word segmentation methods, i.e., JIEBA, THULC, and Ngram, were tried. Eventually, JIEBA was selected to perform word segmentation.
Text representation. Text information is unstructured data, so it is necessary to use a mathematical model to express text data in a form that can be processed by the computer. In our work, the vector space model was used to represent texts. The idea of the model is to map text di to a feature vector V(di), where V(di) = {(ti1,wi1), (ti2,wi2), …, (tin,win)} and tik is the kth characteristic item of document di, and wik is the weight corresponding to the kth characteristic item. In our work, two text representation methods were used, with the first one being the TF-IDF text representation and the second one being the neural network text representation.
The main idea of the TF-IDF model is that if a word appears frequently in one document and rarely in others, the word is considered to have a good ability of separating the document from the others. However, the disadvantage of this method is that such items should be given a higher weight to distinguish one document from other documents. In the study, conditional probability was introduced when characteristic items were introduced. The improved TF-IDF weight calculation formula is expressed in Formula (12).
In the formula, tf(tik) denotes the frequency of feature item tik that appears in document di, N is the total number of texts, ni is the number of documents appearing in the training set P(Ci | tik), which denotes that when feature item tik appears in document d, the document belongs to a large number of documents containing the term tik in Ci, but if the number of documents containing the term t in other classes is small, tik can represent the characteristics of the Ci class, which is given a higher weight.
4.2. Experimental Results
4.2.1. Evaluation Metrics
This paper adopted the classification accuracy rate to perform the evaluation.
True positives (TP) are the number of cases that are correctly classified as positive, i.e., the number of samples that are actually positive and are classified as positive by the classifier. False positives (FP-RRB) are the number of cases that are classified as positive incorrectly, i.e., the number of samples that are actually negative but nonetheless are classified as positive by the classifier. False negatives (FN-RRB) are the number of cases that are actually positive but nonetheless are classified as counterexamples by the classifier. Finally, true negatives (TN-RRB) are the number of counter cases that are correctly divided, i.e., the number of samples that are actually negative and are correctly classified by the classifier as counter cases.
4.2.2. Analysis of the Results
As shown in
Table 6 and in
Figure 8 and
Figure 9, the multi-fusion model showed the highest prediction accuracy. In the Dov2vec model, both the DBOW training method and the DM training method produced reasonably good prediction results. However, the prediction results obtained by the DM training method were not as accurate as the DBOW method. The performance results of each model were analyzed as follows:
According to the experimental results, the Word2vec + CNN + LSTM model performed better than the other two single models because CNN can mine the main features in sentences better, and LSTM can combine contextual semantics well to make up for the shortcomings of the Word2vec’s unclear semantics. Therefore, the Word2vec + CNN + LSTM model works well.
At the 35th iteration, the DBOW-NN model achieved the highest accuracy on the verification dataset. So, we chose the result of the model at this time.
For the multi-stage fusion, the three models in the first stage showed a lot of differences, but it can be seen that the generalization of the model after fusion was very strong. The second stage of the XGBoost model made full use of each sub-task of the first layer to predict the results, thus improving the accuracy of the prediction.