Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unsupervised zoning of scientific articles using huffman trees

2008, 2008 IEEE 25th Convention of Electrical and Electronics Engineers in Israel

UNSUPERVISED ZONING OF SCIENTIFIC ARTICLES USING HUFFMAN TREES Eugene Kagan, Irad Ben-Gal, Nataly Sharkov, Oded Maimon Dept. ofIndustrial Engineering, Tel-Aviv University, Ramat-Aviv, Tel-Aviv 69978, Israel kaganevg@post.tau.ac.il, bengal@eng.tau.ac.il, sharkovn@post.tau.ac.il, maimon@eng.tau.ac.il The method is based on informational methods of sample space partitioning and construction of decision trees, as it is used in the framework of search problems [4]. The decision trees of argumentative zoning are considered as a model of the trees. The numerical features associated with the sentences are used in the same manner as in the methods of statistical segmentation. ABSTRACT In this report we propose a new method of unsupervised zoning based on Huffman coding trees. The suggested method acts on the level of sentences and obtains a Huffman tree whose upper part is equal to the tree created by the method of argumentative zoning. The proposed method gives a general framework for the unsupervised zoning, and may be straightforwardly transformed to supervised zoning by mapping the bits defined by human annotator into features. 2. BACKGROUND: ARGUMENTATIVE ZONING Let us start with a description of argumentative zoning, which provides the basis for the next considerations. The method of argumentative zoning was developed by Teufel [8] and applied to scientific articles. It acts on the level of sentences of the discourse, while zones are considered as sets of sentences. The scientific discourse is divided to zones according to the scientific article structure, which is generally accepted and estimated by the reader. It is assumed that the discourse structure includes the following seven zones [8]: Index Terms- Text-mining, unsupervised zoning, Huffman coding, symbolic dynamics. 1. INTRODUCTION In general, text-mining addresses two main problems: information retrieving, which deals with the search and finding of documents satisfying the needs of the users, and information extraction, which deals with the discovery and extraction of previously unknown information from written resources [3]. Both problems imply the existence of a corpus of documents, and an analysis of a single document which is processed in the corpus context. A main goal of text mining of a single document is its segmentation into parts according to predefined criterions. Statistical methods of text -mining consider certain statistical characteristics of the texts [1], [6], [9], [10], while structural methods deal with linguistic structures - in particular, with the rhetorical structures. The methods of such segmentation using argumentative structures is widely known as argumentative zoning [7], [8]. In this report, we address the statistical segmentation of scientific documents, and incorporate it with argumentative zoning methods. The suggested method acts on the level of sentences and obtains a Huffman decision tree (see e.g. [2]) whose upper part is equal to the tree created by known method of argumentative zoning [8]. 1-4244-2482-5/08/$20.00 ©2008 IEEE BACKGROUND OTHER OWN AIM TEXTUAL CONTRAST BASIS background accepted Generally knowledge; Specific other work; Own work: method, results, future work; Specific research goal; Textual section structure; Contrast, comparison, weakness of other solution; Other work provides basis for own work. The sentences are classified to the specified zones according to certain grammatical argumentative structures. That is, if the sentence includes a string like "Our method is based on," then the sentence corresponds to the BASIS zone. Similarly, if the sentence includes a sting like "However, no method... was (is)," then 1he sentences is associated with the CONTRAST zone. A decision tree for argumentative zoning is shown in Fig. I. The applicative project ZAISA -1 [7] for zoning of biological papers is based on other definition of the zones, 399 IEEEI2008 In what follows, we consider the points as features for the sentences of the discourse and show how to derive the probabilities by which the Huffman tree is constructed. for which Teufel definition [8] is used as a basic template. In what follows, we will refer to the Teufel zones. ( Refers to own work ) ( セ Describes aim ) セ ( Describes background) セ (Referstoextemal] セ セ ® ® セ セ (Negatesextemal) ® [f!!] セ 1TEXTUAL 1 CD セ 1OVVN 1 1CONTRAST 1 (Mentions basis) [!2!] セ BASIS I I OTHER セ QGセ 1 '5: セQ 1 セ ® セ ® セ 1'2: 1124 1 セ Fig.1. Teufel decision tree for argumentative zoning [8] Fig.2. Huffman tree for the points from Table 1 The Teufel decision tree is based on the rhetorical structure of the sentences. We use this tree as a model of the decision tree, yet we create it by using statistical characteristics of the sentences. 4. DESCRIPTION OF ZONING PROCEDURE Let us provide a general description ofthe zoning procedure. The key stages of the procedure will be addressed lIDre precisely in the next sections. Recall that for each sentence a realization of feature s can correspond to a numerical vector. The form of the vector and the sentence characteristics, which are represented by the vector, depend on the corpus type and the zoning goal. Following the analyzed paper, the probabilities of feature s in the article are determined. These probabilities are then used by the Huffman procedure. The stages of suggested zoning procedure are the following: 3. STATISTICS FOR TEUFEL DECISION TREE AND HUFFMAN PROCEDURE In the suggested method, instead of using the predefined decision criterions, we fix the procedure of tree construction, and find such statistical values for the sentences that the chosen procedure gives "automatically" the required decision tree. One of the simplest procedure to builds an optimal decision tree (wrt the tree average length) is the Huffinan coding procedure (see e.g. [2]). This procedure acts on a finite set of points with given probabilities. These points form the leaves of the constructed Huffman tree. As seen in Fig. 1, the Teufel tree ms seven leaves according to the number of zones defined by the method of argumentative zoning. A simple observation shows that the Huffman procedure can build a tree, which has a form of the Teufel tree (see Fig. I.), if the probabilities of the points in the leaves are defined as given in Table 1 (where the points are and the probabilities - by Pi): denoted by 1. A features' space is defined as a finite dimensional metric space. 2. Each feature is considered to be a bits-vector value. Each sentence is mapped to a feature realization. Sentences with the same feature realization are considered identical. 3. Training stage: a probability (likelihood) estimate is obtained for each realization of the feature by implementing the Perron-Frobenius theory. This stage is processed once for the entire training set of the articles. 4. Zoning stage: the probabilities are associated with the features' realizations. Realizations that do not appear in the training set are connected into known features by using functional relations over a metric space. 5. The Huffman construction procedure is applied, and a created tree is truncated at the appropriate level which includes the required number of zones. 1, Table 1 1 11 J J J J J J Pi 1/24 1/24 1/12 1/6 1/6 1/6 1/3 2 3 4 5 6 7 The implementation of the Huffman procedure to these points results in the tree shown in Fig.2. Note that the trees in Fig.1 and Fig.2 have an equal structure, and that the points l '...,1 7 In the Huffman correspondently associated with argumentative zoning approach. 1-4244-2482-5/08/$20.00 ©2008 IEEE the tree zones may be of an The obtained Huffman decision tree is used for the zoning process in the same way as a Teufel decision tree. 400 IEEEI2008 Notice that the obtained Huffman tree and Teufel tree are applicable for scientific papers, and moreover for papers of the same corpus on which the trees were constructed. The proposed method enables an "automated" computerized construction of zoning process. The feature vector 1: In the proposed method, a feature is represented by a random vector, which may include both numerical and nonnumerical values [6], [8]. A numerical feature is defined as a VI' f·= { } 0, otherwise. or by similar definitions. To obtain the same tree as in the argumentative zoning, we assume that the feature is represented by a binary vector, whose elements are the answers to the questions presented by the Teufel decision tree (see Fig. 1), with the values "0" and "1" representing "no" and "yes" correspondingly. In addition (to the questions that are represented by the Teufel tree), the binary values can also represent answers to questions regarding the location of the sentence, e.g.,: _ {I, the sentenceis the first in the paragraph, In-l - , otherwlse. . CONTRAST Difference 13 1 15 1 17 18 19 OWN Else OWN Implication OWN Insight OWN Method - Outline AIM Problem-setting OWN Result Background 1.0 BACKGROUND };1 The sentence is the first in the paragraph 1.2 The sentence is the last in the paragraph =1 , then the 6. PROBABILITIES ESTIMATION As mentioned above, the values of the feature are To apply considered as realizations of a random vector J. the Huffman procedure we need to relate the probabilities (likelihoods) to these realizations. In the simplest case, the number of different realizations of the feature is equal to the number of zones. Then the probabilities related to the feature realizations obtain the values given in Table 1. Nevertheless, if the number of different realizations of a feature is greater than the number of zones, then, the probabilities, that are obtained by straightforward calculation (MLE) in a given discourse, differ from the ones given in Table 1. To estimate the probabilities in this case, we apply the Perron-Frobenius theory. This theory is widely used in symbolic dynamics [5] and is applicable for the analysis of search trees [4]. Consider the required tree as a graph G =(V, E), where V accordingly, the OWN zone can be divided into five subzones, as done in the ZAISA-1 [7] project. An example for the correspondence between the sentences of the paper and the feature realizations is shown in Fig.3. Fellture realization We also determined the influence of some additional facto 00000000100 00000000001 2 Materials and methods 21 Materials and cell lines Tissue culb.Jre media and reagents, except when stated ot 000000011 00 YOYD-1 and f1uorescently labelled polystyrene microsphE 01000010000 The microsphere sizes were 20631,93644,220684, 5E00000010000 220684 nm for the yellow/green-labelled micro- spheres. 00000010000 In some experiments red 60610 nm polystyrene microspr00001110000 KLN 205 [23], a mouse squamous cell carcinoma celllin€00000001100 ECV 304 [24,25], found to be a derivative of the human b 00000000000 HepG2 [27], a human hepatocyte carcinoma. was obtaim 00000001100 HNX 14C [28]. a human head and neck squamous cell ce 00000000000 Hepa 1--6 [29], a mouse hepatoma cell line, was obtained 00000001100 Primary human umbilical vein endothelial cells (HUVEC) セP P Q P 22 Methods 2 2 1 CL22 Synthesis and purification 00000000001 CL22 peptide Hk gfl wrgen ktrsayeセP Q P Charge ratio was defined as the moles of net positive cha 00000000000 The average molecular weight of a DNA base was taken c 00000000000 A solution containing 1 mg DNA (3.03 nmol negative char 00000010000 222 Uptake of labelled microspheres 00000010000 Cells were plated in six-well plates ,Iwaki, Japan) at a der 00000000000 The following day the cells were incubated with a 1: 1000 00000010000 The cells were then washed twice with PBS, harvested wit 00000100100 is a set of vertexes and E is a set of edges. Denote the adjacency matrix of the graph G by G. Let A be a Perron eigenvalue of the matrix G that is an eigenvalue such that where J1 is any other eigenvalue of G. Then, the A >I,ul, matrix p of transition probabilities Pij between the vertexes of the graph G is defined as follows [5]: Pij =(G)ij rj / Ar Fig.3. Example ofcorrespondence between the sentences and the feature realizations 1-4244-2482-5/08/$20.00 ©20081EEE h Thus, for example, f both }; 1 =1 and }; 2 I, the sentenceis the last in the paragraph, {0, otherwise. Sentence Connection sentence is related to the TEXTUAL zone. ° 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 ZAISA-I [7] 11 6 I, the sentenceis in given position in paragraph, Table 2. Teufel [8] BASIS 4 vector J = 12 , ••• , In)' whose elements can be defined, e.g., as follows: I, the sentenceincludes given string, J;= { 0, otherwise. .. in Fig.3 represents twelve values whose meaning is shown for example in Table 2. 5. FEATUREREALIZATIONS in = J j 401 , IEEEI2008 where ri and r/ are elements of the right eigenvector of the eigenvalue Il In general, the values of the Perron L I ゥセ eigenvector r have to be normalized so that ( inconsistency may be corrected by introducing additional values into the feature or by a human annotator. = 1, where are elements of the left eigenvector of the Perron 8. CONCLUSION eigenvalue Il Following the normalization, the probabilities of the vertexes satisfy Pi = Ii セ . In the report we considered the relation between the Teufel decision tree, as used in argumentative zoning, and the tree created by the Huffman procedure which is based on the statistical characteristics of the discourse. We show that the Teufel tree may be obtained by the Huffman coding procedure, and suggested a method of unsupervised zoning based on this observation. In addition, we suggest a method for estimating the feature probabilities in the case where the number of different feature realizations is not equivalent to the number of zones. All the proposed methods are preliminary and can merit from additional analysis and research work. Defining the paths form the root to the leaves in the tree by the transition matrix p, one can obtain the probabilities that correspond to the leaves. A normalization of these probabilities results in the required estimation. 7. NUMERICAL EXAMPLE Let us introduce an example of probabilities estimation. We calculate the probabilities estimations for the Teufel tree and compare it to the obtained results that are based on the probabilities given in Table 1. For the tree shown in Fig.2 the adjacency matrix is a 13 x 13 matrix, which has the following form: I 0 0 0 0 0 0 1 0 0 I I 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 I 0 0 0 0 I 1 0 G= 1 0 8. REFERENCES [1] D. Beefennan, A. Berger, 1. Lafferty. Statistical Models for Text Segmentation. Machine Learning. 34 (1-3), pp. 177-210, 1999. [2] T. M. Cover, J. A. Thomas. Elements ofInformation Theory. John Wiley & Sons, New York, etc., 1991. [3] M. A. Hearst. What is Text Mining? http://www .sims.berkeley.edul--hearst, 2003. For this matrix, the Perron eigenvalue is A = 2.23 , whose left and right eigenvectors are equal. The transition matrix, which is calculated according to the above formula, has the following form: o 0.46 P= OA4 056 0 0 0 0 0.20 0.34 0.36 0 0 1.00 0 0 0 0 0 0 o OhO 0 0 0 0 0 o 0 0 o o o 0.20 0.44 0 0 0 0 [4] E. Kagan, I. Ben-Gal. Symbolic Dynamics Model of Infonnational Moving Target Search Problem. Proc. 15-th Israeli ConI IE&M'08. 2008. 0 0 0 0 [5] B. P. Kitchens. Symbolic Dynamics: One-sided, Two-sided and Countable State Markov Shifts. Springer-Verlag, Berlin, 1998. [6] A. McCallum, D. Freitag, F. Pereira. Maximum Entropy Markov Models for Infonnation Extraction and Segmentation. Proc. ICML'OO, 2000. 0.20 0.20 [7] Y. Mizuta, N. Collier. Zone Identification in Biology Articles as a Basis for Infonnation Extraction. Proc. JNLPBA '04, 2004. The normalized probabilities that are calculated by the paths from the root to the leaves according to the matrix p are given in Table 3. [8] S. Teufel. Argumentative Zoning: Information Extraction from Scientific Text. PhD Thesis. University of Edinburg, U.K., 1999. Table 3 .7 J .1 1, .1 Pi 0.05 0.05 0.09 0.09 1 2 4 NWセ 0.14 Nャセ 0.25 Nャセ [9] J. P. Yamron, I Carp, L. Gillick, S. Lowe, P. van Mulbregt. A Hidden Markov Model Approach to Text Segmentation and Event Tracking. Proc. ICASSP'98, 1998. 0.33 Comparing Table 3 and Table セ one can see that, in most cases, the estimated values are close to the probabilities required for the creation of the Teufel tree by using the Huffman coding procedure. Nonetheless, the Huffman tree that is created based on Table 3 probabilities differs for the Teufel tree. This 1-4244-2482-5/08/$20.00 ©2008 IEEE Online publ.: [10] M. Utiyama, H. Isahara. A Statistical Model for DomainIndependent Text Segmentation. Proc. 39th Ann. Meeting ACL '01, 499-506, 2001. 402 IEEEI2008