Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Margin-Based Active Learning and Background Knowledge in Text Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Margin-based Active Learning and Background Knowledge in Text

Mining

Catarina Silva, Bernardete Ribeiro


CISUC - Departamento Eng. Informatica - Universidade de Coimbra, Portugal
ESTG - Instituto Politecnico de Leiria, Portugal
{catarina,bribeiro}@dei.uc.pt

Abstract. Text mining, also known as intelligent text testing set, or even unlabeled examples, is being in-
analysis, text data mining or knowledge-discovery in vestigated as a way to improve classication perfor-
text, refers generally to the process of extracting in- mance.
teresting and non-trivial information and knowledge Seeger in (Seeger 2001) presents a report on lear-
from text. One of the main problems with text mi- ning with unlabeled data that compares several ap-
ning and classication systems is the lack of la- proaches.
beled data, as well as the cost of labeling unlabeled Our purpose is to evaluate the benets of introdu-
data (Kiritchenko and Matwin 2001). Thus, there is a cing unlabeled data in a support vector machine au-
growing interest in exploring the use of unlabeled data tomatic text classier and the possibility of actively
as a way to improve classication performance in text learning the classication task.
classication. The ready availability of this kind of The rest of the paper is organized as follows. Sec-
data in most applications makes it an appealing source tion 2 addresses several text classication issues, set-
of information. ting guidelines for problem formulation. Section 3
In this work we evaluate the benets of introdu- presents Support Vector Machines and their applica-
cing unlabeled data in a support vector machine au- tion to text mining/classication tasks.
tomatic text classier. We further evaluate the possi- Section 4 focuses on the issues related to the use
bility of learning actively and propose a method for of unlabeled data and Section 5 presents the two ap-
choosing the samples to be learned. proaches proposed and a comparison between them.
Section 6 presents the results obtained and, nally,
Keywords: Text Mining, Support Vector Machines,
Section 7 presents some conclusions and future work.
Active Learning.

1 Introduction 2 Text classication


The goal of text classication is the automatic assign-
Applications of text mining are ubiquitous, since al-
ment of documents to a xed number of semantic
most 80% of the information available is stored as
categories. Each document can be in multiple, exac-
text. Thus, there is an effective interest in researching
tly one, or no category at all. Using machine lear-
and developing applications that better help people
ning, the objective is to learn classiers from exam-
handling text-based information. On the other hand,
ples, which assign categories automatically. This is
the wealth of text information has made the organi-
usually considered a supervised learning problem. To
zation of that information into a complex and vitally
facilitate effective and efcient learning, each cate-
important task.
gory is treated as a separate binary classication pro-
Most text categorization methods, e.g., K-Nearest
blem. Each of such problems answers the question of
Neighbor, Naive Bayes, Neural Nets and Support Vec-
whether or not a document should be assigned to a
tor Machines, have their performance greatly dened
particular category (Joachims 1999).
by the training set available. This is one key difculty
Documents, which typically are strings of charac-
with current text categorization algorithms, since they
ters, have to be transformed into a suitable representa-
require manual labeling of more documents than a
tion both for the learning algorithm and the classica-
typical user can tolerate (Schohn and Cohn 2000).
tion task. The most common representation is known
Thus, methods that need a small set of labeled exam-
as the Bag of Words and represents a document by the
ples are currently being explored.
words occurring in it. Usually the irrelevant words are
Labeling data is expensive but, in most text cate-
ltered using a stopword list (Silva and Ribeiro 2003)
gorization tasks, unlabeled data are often inexpensive,
and the word ordering is not deemed relevant for most
abundant and readily available. Therefore, to achieve
applications. Information retrieval investigation pro-
this purpose, i.e., the use of relatively small training
poses that instead of words, the units of representation
sets, the information that can be extracted from the
could be word stems. A word stem is derived from

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE
the occurrence form of a word by removing case and 3.1 Support Vector Classication
inection information. For example viewer, view-
ing, and preview are all mapped to the same stem Although text categorization is a multi-class, multi-
view. label problem, it can be broken into a number
This leads to an attribute-value representation of of binary class problems without loss of general-
text. Each distinct word w i corresponds to a feature ity. This means that instead of classifying each do-
T F (wi , x), representing the number of times word w i cument into all available categories, for each pair
occurs in the document x. Rening this basic repre- {document, category} we have a two class problem:
sentation, it has been shown that scaling the dimen- the document either belongs or does not to the cate-
sions of the feature vector with their inverse docu- gory. Although there are several linear classiers that
ment frequency IDF (w i ) leads to an improved per- can separate both classes, only one, the Optimal Sep-
formance. IDF (wi ) (1) can be calculated from the arating Hyperplane, maximizes the margin, i.e., the
document frequency DF (w i ), which is the number of distance to the nearest data point of each class, thus
documents the word w i occurs in. presenting better generalization potential.
  The output of a linear SVM is u = w x b,
D where w is the normal weight vector to the hyperplane
IDF (wi ) = log (1)
DF (wi ) and x is the input vector. Maximizing the margin can
Here, D is the total number of documents. The in- be seen as an optimization problem:
verse document frequency of a word is low if it occurs 1
minimize ||w||2 ,
in many documents and is highest if the word occurs 2 (2)
in only one. To disregard different document lengths, subjected to yi (w.x + b) 1, i,
each document feature vector x is normalized to unit
length (Sebastiani 1999). where x is the training example and y i is the cor-
rect output for the ith training example, as represented
in gure 1.
3 Support Vector Machines Intuitively the classier with the largest margin
will give low expected risk, and hence better gene-
SVM are a learning method introduced by Vap- ralization.
nik (Vapnik 1995) based on his Statistical Learning To deal with the constrained optimization pro-
Theory and Structural Minimization Principle. When blem in (2) Lagrange multipliers i 0 and the La-
using SVM for classication, the basic idea is to nd grangian (3) can be introduced:
the optimal separating hyperplane between the posi- l
1
tive and negative examples. The optimal hyperplane Lp ||w||2 i (yi (w.x + b) 1). (3)
is dened as the one giving the maximum margin be- 2 i=1
tween the training examples that are closest to it. Sup- The Lagrangian has to be minimized with respect
port vectors are the examples that lie closest to the to the primal variables w and b and maximized with
separating hyperplane. Once this hyperplane is found, respect to the dual variables i (i.e. a saddle point has
new examples can be classied simply by determining to be found) (Scholkopf et al. 1999).
on which side of the hyperplane they are. Figure 1 SVM are universal learners. In their basic form,
shows a simple two-dimensional example, the optimal shown so far, SVM learn linear threshold functions.
separating hyperplane and four support vectors. However, using an appropriate kernel function, they
can be used to learn polynomial classiers, radial-
+ + + + basis function networks and three layer sigmoid neu-
+ ral networks.
Support
+ + Vectors
+
+ 4 Using Unlabeled Data
+
- To achieve the best classication performance with a
- machine learning technique, there has to be enough
- - labeled data. However these data are costly and some-
- times difcult to gather. Therefore, using unlabeled
- -
- data for text classication purposes has recently been
- - - actively investigated (Hong and Cho 2002) (Liu et
al. 2003).
Figure 1. Optimal Separating Hyperplane. In general, unlabeled examples are much less ex-
pensive and easier to gather than labeled ones. This is

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE
particularly true for text classication tasks involving
Large margin
online data sources, such as web pages, email and +
+ + +
news stories, where large amounts of text are readily
+ +
available. Collecting this text can frequently be done +
automatically, so it is feasible to collect a large set of
unlabeled examples. If unlabeled examples can be in- Small margin + Small margin
tegrated into supervised learning, then building text -
classication systems will be signicantly faster, less
-
expensive and more effective. - -
There is a catch however, because, at rst glance, -
it might seem that nothing is to be gained from un- -
- Large margin
labeled data, since an unlabeled document does not
contain the most important piece of information - its
classication. Figure 2. Testing examples (black dots) with small
Consider the following example to give some in- and large margin.
sight of how unlabeled data can be useful. Suppose we
are interested in recognizing web pages about confer-
ences. We are given just a few conferences and non-
conferences web pages, along with a large number of
5 Background Knowledge and
pages that are unlabeled. By looking at just the labeled Active Learning
data, we determine that pages containing the word pa-
per tend to be about conferences. If we use this fact In this section we will propose and compare two ap-
to estimate the classication of the many unlabeled proaches that incorporate unlabeled examples in the
web pages, we might nd that the word deadline oc- learning/classication task.
curs frequently in the documents that are classied in The idea underlying both approaches is that the
the positive class. This co-occurrence of the words pa- information contained in the testing set (or in any set
per and deadline over the large set of unlabeled train- of unlabeled data that can be gathered) can be useful
ing data can provide useful information to construct a to improve the classication performance. Therefore,
more accurate classier that considers both paper and we propose two ways of integrating those examples,
deadline as indicators of positive examples. based on the margin with which they are classied
Some authors (Zelikovitz and Hirsh 2001) refer to (see Figure 2).
unlabeled data as background knowledge, dening it
as any unlabeled collection of text from any source Approach 1 - Background Knowledge
that is related to the classication task. Choose the testing examples classied by the SVM
Joachims presents in (Joachims 1999) a study with more condence (larger margin) and incorporate
on transductive SVM (TSVM) introduced by Vapnik them directly in the training set as classied by the
(Vapnik 1995). TSVM make use of the testing set and baseline inductive SVM. This approach can be con-
extend inductive SVM, nding an optimal separating sidered as the use of background knowledge to im-
hyperplane not only of the training examples, but also prove text classication performance.
of the testing examples (Silva and Ribeiro 2004).
The goal of active learning is to design and ana- Approach 2 - Active Learning
lyse learning algorithms that can effectively lter or
A certain number of testing examples 1 in which
choose the samples to be labeled by a supervisor. The
the SVM has less condence (smaller margin) are
incentive in using active learning is mainly to expedite
chosen. This number of examples can not be a large
the learning process and reduce the labeling efforts re-
one, since the supervisor will be asked to manually
quired by the teacher (Baram et al. 2003). Schohn in
classify them. After being correctly classied, they
(Schohn and Cohn 2000) proposes a method to ac-
are integrated in the training set. This approach can
tively learn with SVM, exploring the examples that
be regarded as a form of active learning, where the
are orthogonal to the space spanned by the training
information that an example can introduce in the clas-
set, in order to give to the classier information about
sication task is considered inversely proportional to
dimensions not yet explored.
its classication margin.

Both approaches have advantages and disadvan-


tages, and we expect to conjugate them to use the ad-
vantages and mitigate the disadvantages.

1 In our case 40 examples (20 positive and 20 negative).

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE
For an objective comparison, we can use the fol-
Table 1. Number of positive training and testing do-
lowing criteria: cuments for Reuters most frequent categories.
1. User interaction: while the rst approach is auto-
mated, the second approach needs some user inter- Category Training Testing
action, since the selected items must be classied Earn 2715 1044
by the supervisor;
Acquisitions 1547 680
2. Correctness of training set: the rst approach does
not guarantee its correctness, since the added Money-fx 496 161
examples are classied by the inductive SVM, Grain 395 138
whereas in the second approach all examples in Crude 358 176
the training set are (correctly) classied by the su- Trade 346 113
pervisor; Interest 313 121
3. Computational time: there is not a signicant dif-
Ship 186 89
ference in the computational time used, however
the rst approach can take longer, because the ex- Wheat 194 66
amples are automatically classied and there is no Corn 164 52
limit on the number of examples added;
4. Performance measured as detailed in Section 6: Recall is the percentage of total documents for the
the second approach has greater potential, since given topic that are correctly classied (5).
the information added is more reliable, but has
categories f ound and correct
limitations on the number of items the supervisor Recall = (5)
can tolerate/is able to classify (40 in our experi- total categories correct
ments). Precision is the percentage of predicted documents
for the given topic that are correctly classied (6).

6 Results categories f ound and correct


P recision = (6)
total categories f ound
Reuters-21578 collection was used with the ModApte
split publicly available at http://kdd.ics.uci.edu F1 measures were also considered. To compute F1
/databases/ reuters21578/ reuters21578.html. measure we have used (7).
Reuters-21578 is a nancial corpus with news ar-
2 precision recall
ticles averaging 200 words each. Example categories F1 = . (7)
are trade, earn or crude. In this corpus there are about precision + recall
12000 classied stories into 118 possible categories. Baseline Results
The ModApte split was used, using 75% of the arti-
cles (9603 items) for training and 25% (3299 items) The baseline of comparison will be the results ob-
for testing. Table 1 presents the ten most frequent ca- tained with the SVM in the inductive setting, as des-
tegories and the number of training and testing exam- cribed in Section 3, and are presented in Tables 2 and
ples, comprising 75% of the items. 3 for the ModApte split and the new Small Split, re-
In addition to ModApte split, a Small split was spectively.
also tested. The testing set was exactly the same for
Approach 1 Results - Background Knowledge
the sake of comparison, but the training set, instead of
9603 examples was randomly reduced to 10 positive Tables 4 and 5 present the results obtained for the rst
examples and 10 negative examples. The idea was to approach with both training/testing splits.
reproduce a real situation in which a real user would Analysing F1 values 2 , there is an improvement of
be asked to provide these 20 examples. 3% (from 71,72% to 73,86%) where the ModApte
split is concerned (Tables 2 and 4), but not with the
Small split (Tables 3 and 5), where there is a decrease
6.1 Performance Criteria (from 32,57% to 29,50%).
The simulation results were evaluated using Accu- For this approach to be successful the baseline
racy, Recall, Precision and F1 measures, which were classier can not be too weak, since it will be respon-
computed in the testing set for each category. sible for classifying the testing examples. That is not
Accuracy is the percentage of correct classica- the case with Small split. With only 20 examples the
tions obtained (4). initial classier is not accurate enough to determine
new training examples.
categories f ound and correct 2 F1 values are preferred to analyse the performance of a method,
Accuracy = (4) since they fuse precision and recall values.
total categories

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE
Table 2. Baseline results for ModApte Split Table 4. Background Knowledge - Results for
(9603/3299): number of support vectors, accuracy, ModApte Split (9603/3299).
precision, recall, F1 and average values for Reuters
most frequent categories. Category SV Acc Prec Rec F1

Category SV Acc Prec Rec F1 Earn 1651 95.85 93.27 95.59 94.42
Acquisitions 1800 95.04 92.71 86.03 89.25
Earn 1632 95.92 95.53 92.50 93.99
Money-fx 928 96.13 71.07 53.42 60.99
Acquisitions 1751 94.93 93.09 85.15 88.94
Grain 802 98.93 92.71 64.49 76.07
Money-fx 908 96.13 71.43 52.80 60.72
Crude 697 97.18 85.29 65.91 74.36
Grain 771 97.96 92.55 63.04 75.00
Trade 661 97.68 79.75 55.75 65.62
Crude 693 97.04 84.85 63.64 72.73
Interest 744 97.22 76.92 49.59 60.30
Trade 647 97.64 79.49 54.87 64.92
Ship 505 98.49 89.58 53.09 66.67
Interest 742 97.15 77.03 47.11 58.46
Wheat 490 98.77 82.98 59.09 69.03
Ship 500 98.45 89.36 51.85 65.62
Corn 505 99.20 95.62 71.58 81.87
Wheat 487 98.77 84.44 57.58 68.47
Corn 484 99.08 93.33 53.85 68.29 Average 878.30 97.36 85.99 65.45 73.86

Average 861.15 97.31 86.11 62.24 71.72


Table 5. Background Knowledge - Results for
Small Split (20/3299).
Table 3. Baseline results for Small Split (20/3299).

Category SV Acc Prec Rec F1 Category SV Acc Prec Rec F1

Earn 19 90.32 90.26 82.57 86.24 Earn 42 90.14 82.43 93.01 87.40
Acquisitions 19 49.77 32.07 98.29 48.36 Acquisitions 92 40.65 28.63 99.12 44.43
Money-fx 18 38.33 8.11 95.65 14.95 Money-fx 44 99.30 99.25 100.00 99.62
Grain 20 81.31 16.06 67.39 25.94 Grain 23 37.35 6.74 92.75 12.57
Crude 18 70.50 15.52 84.66 26.23 Crude 32 37.56 8.86 97.73 16.25
Trade 18 79.41 15.50 93.81 26.60 Trade 36 21.89 4.81 99.12 9.17
Interest 18 54.10 8.02 93.39 14.77 Interest 33 22.74 5.11 97.52 9.71
Ship 19 32.31 3.90 96.30 7.50 Ship 86 30.83 3.82 90.30 7.33
Wheat 19 95.49 29.61 68.18 41.29 Wheat 24 5.03 2.39 100.00 4.67
Corn 20 98.20 52.00 25.00 33.77 Corn 23 9.43 1.98 100.00 3.88

Average 18.80 68.97 27.11 80.52 32.57 Average 43.50 39.49 24.40 96.96 29.50

Approach 2 Results - Active Learning duction of unlabeled documents information in the


Tables 6 and 7 present the results obtained for the se- learning procedure.
cond approach with both training/testing splits. The introduction of background knowledge by the
The improvement is more relevant (improvement SVM classied testing examples should not be used
of 40%, from 32.57% to 46.21%) on the Small split with small training sets, however it can constitute a
(Tables 3 and 7) than on the ModApte split, a pre- slight improvement when the baseline classier is not
dictable outcome, since the training set was substan- too weak.
tially increased (20 initial examples plus 40 examples The proposed margin-based active learning
actively chosen to be classied by the supervisor). method has potential to substantially improve perfor-
In what ModApte split (Tables 2 and 6) is con- mance when small training sets are available. This
cerned this active approach improves 10% the base- conclusion is very important in text mining tasks,
line results (from 71,72% to 79,32%). since usually there is a small number of classied
examples and a huge number of unlabeled ones.
Future work will deal with the theoretical founda-
7 Conclusions and Future Work tions of these experiments and research on other ac-
tive methods to incorporate background knowledge.
The results presented in the previous section are en- Further testing in the hybrid conjugation of both pro-
couraging to the improvement achieved by the intro- posed methods is also foreseen.

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE
chines, International Conference on Machine
Table 6. Active Learning - Results for ModApte
Split (9603/3299). Learning, 1999.
Svetelana Kiritchenko and Stan Matwin, (2001),
Category SV Acc Prec Rec F1 Email Classication with Co-Training, 2001
Conference of the Centre for Advanced Studies on
Earn 1662 96.27 94.00 95.98 94.98 Collaborative Research.
Acquisitions 1788 95.28 94.03 85.74 89.69 Ravi Kothari and Vivek Jain, (2003), Learning from
Money-fx 947 96.59 76.67 57.14 65.48 Labeled and Unlabeled Data Using a Minimal
Grain 791 98.28 94.95 68.12 79.33 Number of Queries, IEEE Transactions on Neural
Networks, Vol.14, NO.6, November 2003.
Crude 719 97.43 88.72 67.05 76.38
B. Liu, Y. Dai, X. Li, W. Lee and P. Yu, (2003),
Trade 676 97.68 79.75 55.75 65.62 Building Text Classiers Using Positive and Un-
Interest 777 97.94 89.61 62.16 73.40 labeled Examples, International Conference on
Ship 545 98.84 92.86 64.20 75.92 Data Mining, 2003.
Wheat 482 99.19 90.57 72.73 80.68 Andrew MacCallum and Kamal Nigam (1998), Em-
Corn 537 99.44 97.37 71.15 82.22 ploying EM in Pool-Based Active Learning for Text
Classication, International Conference on Ma-
Average 892.40 97.75 90.25 71.24 79.32 chine Learning, pp. 350-358, 1998.
Greg Schohn and David Cohn, (2000), Less is more:
Table 7. Active Learning - Results for Small Split Active Learning with Support Vector Machines,
(20/3299). International Conference on Machine Learning,
2000.
Category SV Acc Prec Rec F1 B. Scholkopf, C. Burges and A. Smola (1999), Ad-
vances in Kernel Methods - Introduction to Support
Earn 54 92.64 86.78 94.35 90.41
Vector Learning, MIT Press, pp. 1-15, 1999.
Acquisitions 56 57.76 35.92 97.50 52.50 Matthias Seeger, (2000), Learning with Labeled and
Money-fx 56 93.95 48.15 88.82 62.45 Unlabeled Data, Technical Report, Institute for
Grain 55 66.42 12.01 93.48 21.29 Adaptive and Neural Computation, University of
Crude 54 57.48 11.83 90.91 20.94 Edinburgh, 2001.
Catarina Silva and Bernardete Ribeiro (2003), On
Trade 55 95.04 43.81 87.61 58.41
the Evaluation of Text Processing in Inductive Cate-
Interest 53 75.99 13.71 87.60 23.71 gorization, ICMLA - International Conference on
Ship 47 70.47 8.62 97.53 15.84 Machine Learning Applications, 2003.
Wheat 55 97.92 53.61 78.79 63.81 F. Sebastiani, A Tutorial on Automated Text cat-
Corn 56 97.92 45.21 63.46 52.80 egorisation, in Analia Amandi and Alejandro
Zunino (eds.), Proceedings of ASAI-99, 1st Ar-
Average 54.10 80.56 35.97 88.01 46.21 gentinian Symposium on Articial Intelligence,
Buenos Aires, AR, pp. 7-35, 1999.
Catarina Silva and Bernardete Ribeiro (2004), La-
Acknowledgments beled and Unlabeled Data in Text Categorization,
IEEE International Joint Conference on Neural
CISUC - Center of Informatics and Systems of Uni- Networks, 2004.
versity of Coimbra and Project POSI/SRI/41234/2001 Vladimir Vapnik, (1995), The Nature of Statistical
are gratefully acknowledged for partial nancing sup- Learning Theory, Springer, 1995.
port. Sarah Zelikovitz and Haym Hirsh, (2001), Impro-
ving Text Classication with LSI Using Back-
ground Knowledge, Tenth International Con-
References ference on Information Knowledge Management,
2001.
Y. Baram, R. El-Yaniv and K.Luz, (2003), Online
Choice of Active Learning Algorithms, Interna-
tional Conference on Machine Learning, 2003.
JinHyuk Hong and Sung-Bae Cho, (2002), Incre-
mental Support Vector Machine for Unlabeled Data
Classication, International Conference on Neural
Information Processing (ICONIP), 2002.
Thorsten Joachims (1999), Transductive Inference
for Text Classication using Support Vector Ma-

Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS04)


0-7695-2291-2/04 $ 20.00 IEEE

You might also like