UNSUPERVISED ZONING OF SCIENTIFIC ARTICLES USING HUFFMAN TREES
Eugene Kagan, Irad Ben-Gal, Nataly Sharkov, Oded Maimon
Dept. ofIndustrial Engineering, Tel-Aviv University, Ramat-Aviv, Tel-Aviv 69978, Israel
kaganevg@post.tau.ac.il, bengal@eng.tau.ac.il, sharkovn@post.tau.ac.il, maimon@eng.tau.ac.il
The method is based on informational methods of
sample space partitioning and construction of decision trees,
as it is used in the framework of search problems [4]. The
decision trees of argumentative zoning are considered as a
model of the trees. The numerical features associated with
the sentences are used in the same manner as in the methods
of statistical segmentation.
ABSTRACT
In this report we propose a new method of unsupervised
zoning based on Huffman coding trees.
The suggested method acts on the level of sentences
and obtains a Huffman tree whose upper part is equal to the
tree created by the method of argumentative zoning.
The proposed method gives a general framework for the
unsupervised zoning, and may be straightforwardly
transformed to supervised zoning by mapping the bits
defined by human annotator into features.
2. BACKGROUND: ARGUMENTATIVE ZONING
Let us start with a description of argumentative zoning,
which provides the basis for the next considerations.
The method of argumentative zoning was developed by
Teufel [8] and applied to scientific articles. It acts on the
level of sentences of the discourse, while zones are
considered as sets of sentences. The scientific discourse is
divided to zones according to the scientific article structure,
which is generally accepted and estimated by the reader. It is
assumed that the discourse structure includes the following
seven zones [8]:
Index Terms- Text-mining, unsupervised zoning,
Huffman coding, symbolic dynamics.
1. INTRODUCTION
In general, text-mining addresses two main problems:
information retrieving, which deals with the search and
finding of documents satisfying the needs of the users, and
information extraction, which deals with the discovery and
extraction of previously unknown information from written
resources [3].
Both problems imply the existence of a corpus of
documents, and an analysis of a single document which is
processed in the corpus context. A main goal of text mining
of a single document is its segmentation into parts according
to predefined criterions.
Statistical methods of text -mining consider certain
statistical characteristics of the texts [1], [6], [9], [10], while
structural methods deal with linguistic structures - in
particular, with the rhetorical structures. The methods of
such segmentation using argumentative structures is widely
known as argumentative zoning [7], [8].
In this report, we address the statistical segmentation of
scientific documents, and incorporate it with argumentative
zoning methods.
The suggested method acts on the level of sentences
and obtains a Huffman decision tree (see e.g. [2]) whose
upper part is equal to the tree created by known method of
argumentative zoning [8].
1-4244-2482-5/08/$20.00 ©2008 IEEE
BACKGROUND
OTHER
OWN
AIM
TEXTUAL
CONTRAST
BASIS
background
accepted
Generally
knowledge;
Specific other work;
Own work: method, results, future work;
Specific research goal;
Textual section structure;
Contrast, comparison, weakness of other
solution;
Other work provides basis for own work.
The sentences are classified to the specified zones
according to certain grammatical argumentative structures.
That is, if the sentence includes a string like "Our method is
based on," then the sentence corresponds to the BASIS
zone. Similarly, if the sentence includes a sting like
"However, no method... was (is)," then 1he sentences is
associated with the CONTRAST zone.
A decision tree for argumentative zoning is shown in
Fig. I.
The applicative project ZAISA -1 [7] for zoning of
biological papers is based on other definition of the zones,
399
IEEEI2008
In what follows, we consider the points as features for
the sentences of the discourse and show how to derive the
probabilities by which the Huffman tree is constructed.
for which Teufel definition [8] is used as a basic template. In
what follows, we will refer to the Teufel zones.
( Refers to own work )
(
セ
Describes aim )
セ
( Describes background)
セ
(Referstoextemal]
セ
セ
®
®
セ
セ
(Negatesextemal)
®
[f!!]
セ
1TEXTUAL 1
CD
セ
1OVVN 1 1CONTRAST 1 (Mentions basis)
[!2!]
セ
BASIS
I
I
OTHER
セ
QGセ
1 '5: セQ
1
セ
®
セ
®
セ
1'2: 1124
1
セ
Fig.1. Teufel decision tree for argumentative zoning [8]
Fig.2. Huffman tree for the points from Table 1
The Teufel decision tree is based on the rhetorical
structure of the sentences. We use this tree as a model of
the decision tree, yet we create it by using statistical
characteristics of the sentences.
4. DESCRIPTION OF ZONING PROCEDURE
Let us provide a general description ofthe zoning procedure.
The key stages of the procedure will be addressed lIDre
precisely in the next sections.
Recall that for each sentence a realization of feature s
can correspond to a numerical vector. The form of the vector
and the sentence characteristics, which are represented by
the vector, depend on the corpus type and the zoning goal.
Following the analyzed paper, the probabilities of
feature s in the article are determined. These probabilities are
then used by the Huffman procedure.
The stages of suggested zoning procedure are the
following:
3. STATISTICS FOR TEUFEL DECISION TREE AND
HUFFMAN PROCEDURE
In the suggested method, instead of using the predefined
decision criterions, we fix the procedure of tree construction,
and find such statistical values for the sentences that the
chosen procedure gives "automatically" the required
decision tree.
One of the simplest procedure to builds an optimal
decision tree (wrt the tree average length) is the Huffinan
coding procedure (see e.g. [2]). This procedure acts on a
finite set of points with given probabilities. These points
form the leaves of the constructed Huffman tree.
As seen in Fig. 1, the Teufel tree ms seven leaves
according to the number of zones defined by the method of
argumentative zoning. A simple observation shows that the
Huffman procedure can build a tree, which has a form of the
Teufel tree (see Fig. I.), if the probabilities of the points in the
leaves are defined as given in Table 1 (where the points are
and the probabilities - by Pi):
denoted by
1.
A features' space is defined as a finite dimensional
metric space.
2.
Each feature is considered to be a bits-vector value.
Each sentence is mapped to a feature realization.
Sentences with the same feature realization are
considered identical.
3.
Training stage: a probability (likelihood) estimate is
obtained for each realization of the feature by
implementing the Perron-Frobenius theory. This stage is
processed once for the entire training set of the articles.
4.
Zoning stage: the probabilities are associated with the
features' realizations. Realizations that do not appear in
the training set are connected into known features by
using functional relations over a metric space.
5.
The Huffman construction procedure is applied, and a
created tree is truncated at the appropriate level which
includes the required number of zones.
1,
Table 1
1
11
J
J
J
J
J
J
Pi
1/24
1/24
1/12
1/6
1/6
1/6
1/3
2
3
4
5
6
7
The implementation of the Huffman procedure to these
points results in the tree shown in Fig.2. Note that the trees
in Fig.1 and Fig.2 have an equal structure, and that the
points
l '...,1
7
In
the
Huffman
correspondently associated with
argumentative zoning approach.
1-4244-2482-5/08/$20.00 ©2008 IEEE
the
tree
zones
may
be
of
an
The obtained Huffman decision tree is used for the
zoning process in the same way as a Teufel decision tree.
400
IEEEI2008
Notice that the obtained Huffman tree and Teufel tree
are applicable for scientific papers, and moreover for papers
of the same corpus on which the trees were constructed. The
proposed method enables an "automated" computerized
construction of zoning process.
The feature vector
1:
In the proposed method, a feature is represented by a
random vector, which may include both numerical and nonnumerical values [6], [8]. A numerical feature is defined as a
VI'
f·= {
}
0, otherwise.
or by similar definitions.
To obtain the same tree as in the argumentative zoning,
we assume that the feature is represented by a binary vector,
whose elements are the answers to the questions presented
by the Teufel decision tree (see Fig. 1), with the values "0"
and "1" representing "no" and "yes" correspondingly. In
addition (to the questions that are represented by the Teufel
tree), the binary values can also represent answers to
questions regarding the location of the sentence, e.g.,:
_ {I, the sentenceis the first in the paragraph,
In-l - , otherwlse.
.
CONTRAST
Difference
13
1
15
1
17
18
19
OWN
Else
OWN
Implication
OWN
Insight
OWN
Method
-
Outline
AIM
Problem-setting
OWN
Result
Background
1.0
BACKGROUND
};1
The sentence is the first in the paragraph
1.2
The sentence is the last in the paragraph
=1 , then the
6. PROBABILITIES ESTIMATION
As mentioned above, the values of the feature are
To apply
considered as realizations of a random vector
J.
the Huffman procedure we need to relate the probabilities
(likelihoods) to these realizations.
In the simplest case, the number of different realizations
of the feature is equal to the number of zones. Then the
probabilities related to the feature realizations obtain the
values given in Table 1.
Nevertheless, if the number of different realizations of a
feature is greater than the number of zones, then, the
probabilities, that are obtained by straightforward
calculation (MLE) in a given discourse, differ from the ones
given in Table 1. To estimate the probabilities in this case,
we apply the Perron-Frobenius theory. This theory is widely
used in symbolic dynamics [5] and is applicable for the
analysis of search trees [4].
Consider the required tree as a graph G =(V, E), where V
accordingly, the OWN zone can be divided into five subzones, as done in the ZAISA-1 [7] project.
An example for the correspondence between the
sentences of the paper and the feature realizations is shown
in Fig.3.
Fellture
realization
We also determined the influence of some additional facto 00000000100
00000000001
2 Materials and methods 21 Materials and cell lines
Tissue culb.Jre media and reagents, except when stated ot 000000011 00
YOYD-1 and f1uorescently labelled polystyrene microsphE 01000010000
The microsphere sizes were 20631,93644,220684, 5E00000010000
220684 nm for the yellow/green-labelled micro- spheres. 00000010000
In some experiments red 60610 nm polystyrene microspr00001110000
KLN 205 [23], a mouse squamous cell carcinoma celllin€00000001100
ECV 304 [24,25], found to be a derivative of the human b 00000000000
HepG2 [27], a human hepatocyte carcinoma. was obtaim 00000001100
HNX 14C [28]. a human head and neck squamous cell ce 00000000000
Hepa 1--6 [29], a mouse hepatoma cell line, was obtained 00000001100
Primary human umbilical vein endothelial cells (HUVEC) セP P Q P
22 Methods 2 2 1 CL22 Synthesis and purification
00000000001
CL22 peptide Hk gfl wrgen ktrsayeセP Q P
Charge ratio was defined as the moles of net positive cha 00000000000
The average molecular weight of a DNA base was taken c 00000000000
A solution containing 1 mg DNA (3.03 nmol negative char 00000010000
222 Uptake of labelled microspheres
00000010000
Cells were plated in six-well plates ,Iwaki, Japan) at a der 00000000000
The following day the cells were incubated with a 1: 1000 00000010000
The cells were then washed twice with PBS, harvested wit 00000100100
is a set of vertexes and E is a set of edges. Denote the
adjacency matrix of the graph G by G. Let A be a Perron
eigenvalue of the matrix G that is an eigenvalue such that
where J1 is any other eigenvalue of G. Then, the
A
>I,ul,
matrix p of transition probabilities Pij between the vertexes
of the graph G is defined as follows [5]:
Pij =(G)ij rj / Ar
Fig.3. Example ofcorrespondence between the sentences
and the feature realizations
1-4244-2482-5/08/$20.00 ©20081EEE
h
Thus, for example, f both }; 1 =1 and }; 2
I, the sentenceis the last in the paragraph,
{0, otherwise.
Sentence
Connection
sentence is related to the TEXTUAL zone.
°
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
ZAISA-I [7]
11
6
I, the sentenceis in given position in paragraph,
Table 2.
Teufel [8]
BASIS
4
vector J =
12 , ••• , In)' whose elements can be defined,
e.g., as follows:
I, the sentenceincludes given string,
J;= {
0, otherwise.
..
in Fig.3 represents twelve values
whose meaning is shown for example in Table 2.
5. FEATUREREALIZATIONS
in =
J
j
401
,
IEEEI2008
where ri and r/ are elements of the right eigenvector of the
eigenvalue Il In general, the values of the
Perron
L I ゥセ
eigenvector r have to be normalized so that
(
inconsistency may be corrected by introducing additional
values into the feature or by a human annotator.
= 1, where
are elements of the left eigenvector of the Perron
8. CONCLUSION
eigenvalue Il Following the normalization, the probabilities
of the vertexes satisfy Pi = Ii セ .
In the report we considered the relation between the Teufel
decision tree, as used in argumentative zoning, and the tree
created by the Huffman procedure which is based on the
statistical characteristics of the discourse.
We show that the Teufel tree may be obtained by the
Huffman coding procedure, and suggested a method of
unsupervised zoning based on this observation.
In addition, we suggest a method for estimating the
feature probabilities in the case where the number of
different feature realizations is not equivalent to the number
of zones. All the proposed methods are preliminary and can
merit from additional analysis and research work.
Defining the paths form the root to the leaves in the tree
by the transition matrix p, one can obtain the probabilities
that correspond to the leaves. A normalization of these
probabilities results in the required estimation.
7. NUMERICAL EXAMPLE
Let us introduce an example of probabilities estimation. We
calculate the probabilities estimations for the Teufel tree and
compare it to the obtained results that are based on the
probabilities given in Table 1.
For the tree shown in Fig.2 the adjacency matrix is a
13 x 13 matrix, which has the following form:
I
0
0
0
0
0
0
1 0 0
I
I
0
0
0
0
1 0
0
0
0
1 1 0
0
0
1 0
0
0
0
0
0
0
0
I
0
0
0
0
I
1
0
G=
1
0
8. REFERENCES
[1] D. Beefennan, A. Berger, 1. Lafferty. Statistical Models for
Text Segmentation. Machine Learning. 34 (1-3), pp. 177-210,
1999.
[2] T. M. Cover, J. A. Thomas. Elements ofInformation Theory.
John Wiley & Sons, New York, etc., 1991.
[3] M. A. Hearst.
What is Text Mining?
http://www .sims.berkeley.edul--hearst, 2003.
For this matrix, the Perron eigenvalue is A = 2.23 ,
whose left and right eigenvectors are equal.
The transition matrix, which is calculated according to
the above formula, has the following form:
o
0.46
P=
OA4 056
0
0
0
0
0.20 0.34
0.36
0
0
1.00
0
0
0
0
0
0
o OhO
0
0
0
0
0
o
0
0
o
o
o
0.20 0.44
0
0
0
0
[4] E. Kagan, I. Ben-Gal. Symbolic Dynamics Model of
Infonnational Moving Target Search Problem. Proc. 15-th Israeli
ConI IE&M'08. 2008.
0
0
0
0
[5] B. P. Kitchens. Symbolic Dynamics: One-sided, Two-sided
and Countable State Markov Shifts. Springer-Verlag, Berlin, 1998.
[6] A. McCallum, D. Freitag, F. Pereira. Maximum Entropy
Markov Models for Infonnation Extraction and Segmentation.
Proc. ICML'OO, 2000.
0.20 0.20
[7] Y. Mizuta, N. Collier. Zone Identification in Biology Articles
as a Basis for Infonnation Extraction. Proc. JNLPBA '04, 2004.
The normalized probabilities that are calculated by the
paths from the root to the leaves according to the matrix p
are given in Table 3.
[8] S. Teufel. Argumentative Zoning: Information Extraction
from Scientific Text. PhD Thesis. University of Edinburg, U.K.,
1999.
Table 3
.7
J
.1
1,
.1
Pi
0.05
0.05
0.09
0.09
1
2
4
NWセ
0.14
Nャセ
0.25
Nャセ
[9] J. P. Yamron, I Carp, L. Gillick, S. Lowe, P. van Mulbregt.
A Hidden Markov Model Approach to Text Segmentation and
Event Tracking. Proc. ICASSP'98, 1998.
0.33
Comparing Table 3 and Table セ one can see that, in
most cases, the estimated values are close to the
probabilities required for the creation of the Teufel tree by
using the Huffman coding procedure.
Nonetheless, the Huffman tree that is created based on
Table 3 probabilities differs for the Teufel tree. This
1-4244-2482-5/08/$20.00 ©2008 IEEE
Online publ.:
[10] M. Utiyama, H. Isahara. A Statistical Model for DomainIndependent Text Segmentation. Proc. 39th Ann. Meeting ACL '01,
499-506, 2001.
402
IEEEI2008