Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Investigating Unstructured Texts with Latent Semantic Analysis

2006

Investigating Unstru tured Texts with Latent Semanti Analysis Fridolin Wild, Christina Stahl Institute for Information Systems and New Media, Vienna University of E onomi s and Business Administration, Augasse 2-6, A-1090 Vienna, Austria, {firstname.surname}wu-wien.a .at Latent semanti analysis (LSA) is an algorithm applied to approximate the meaning of texts, thereby exposing semanti stru ture to omputation. LSA ombines the lassi al ve tor-spa e model  well known in omputational linguisti s  with a singular value de omposition (SVD), a two-mode fa tor analysis. Thus, bag-of-words representations of texts an be mapped into a modied ve tor spa e that is assumed to ree t semanti stru ture. In this ontribution the authors des ribe the lsa pa kage for the statisti al language and environment R and illustrate its proper use through examples from the areas of automated essay s oring and knowledge representation. Abstra t. 1 Introdu tion to Latent Semanti Analysis Derived from latent semanti indexing, LSA is intended to enable the analysis of the semanti stru ture of texts. The basi idea behind LSA is that the olloation of terms of a given do ument-term ve tor spa e ree ts a higher-order  latent semanti  stru ture, whi h is obs ured by word usage (e.g., by synonyms or ambiguities). By using on eptual indi es that are derived statisti ally via a trun ated singular value de omposition, this variability problem is believed to be over ome (Deerwester et al. (1990)). In a typi al LSA pro ess, rst a do ument-term matrix M is onstru ted from a given text base of n do uments ontaining m terms. The term 'textmatrix` will be used throughout the rest of this ontribution to denote this type of do ument-term matri es. This textmatrix M of the size m× n is then resolved by the singular value de omposition into the term-ve tor matrix T ( onstituting the left singular ve tors), the do ument-ve tor matrix D ( onstituting the right singular ve tors) being both orthonormal and the diagonal matrix S . These matri es are then redu ed to a parti ular number of dimensions k , giving the trun ated matri es T , S and D  the latent semanti spa e. Multiplying the trun ated matri es T , S and D results in a new matrix M whi h is the least-squares best t approximation of M with k singular values. k k k k k k k 2 Wild, Stahl Mk is of the same format as M , i.e., rows represent the same terms, olumns the same do uments. To keep additional do uments from inuen ing a previously al ulated semanti spa e or to simply re-use the stru ture ontained in an already existing fa tor distribution, new do uments an be folded-in after the singular value de omposition. For this purpose, the add-on do uments an be added to the pre-exisiting latent semanti spa e by mapping them into the existing fa tor stru ture. Moreover, folding-in is omputationally a lot less ostly, as no singular value de omposition is needed. To fold-in, a pseudo-do ument ve tor m̂ needs to be al ulated in three steps (Berry et al. (1995)): after onstru ting a do ument ve tor v from the additional do uments ontaining the term frequen ies in the exa t order onstituted by the input textmatrix M , v an be mapped into the latent semanti spa e by applying (1) and (2). dˆ = v T Tk Sk−1 (1) m̂ = Tk Sk dˆ (2) Thereby, Tk and Sk are the trun ated matri es from the previously al ulated latent semanti spa e. The resulting ve tor dˆ of Equation (1) represents an additional olumn of Dk . The resulting pseudo-do ument ve tor m̂ from Equation (2) is identi al to an additional olumn in the textmatrix representation of the latent semanti spa e. 2 Inuen ing Parameters Several lasses of adjustment parameters an be fun tionally dierentiated in the latent semanti analysis pro ess. Every lass introdu es new parameter settings that drive the ee tiveness of the algorithm. The following lasses have been identied so far by Wild et al. (2005): textbase ompilation and sele tion, prepro essing methods, weighting s hemes, hoi e of dimensionality, and similarity measurement te hniques (see Figure 1). Dierent texts reate a dierent fa tor distribution. Moreover, texts may be splitted into omponents su h as senten es, paragraphs, hapters, bags-ofwords of a xed size, or even into ontext bags around ertain keywords. The do ument olle tion available may be ltered a ording to spe i riteria su h as novelty or sampled into a random sample, so that only a subset of the existing do uments will a tually be used in the latent semanti analysis. The textbase ompilation and sele tion options form one lass of parameters. Do ument prepro essing omprises several operations performed on the input texts su h as lexi al analysis, stop-word ltering, redu tion to word stems, ltering of keywords above or below ertain frequen y thresholds, and the use of ontrolled vo abularies (Baeza-Yates (1999)). Weighting s hemes have been shown to signi antly inuen e the ee tiveness of LSA (Wild et al. (2005)). Weighting s hemes in general an be Latent Semanti Analysis textbase compilation and selection options: - documents - chapters - paragraphs - sentences - context bags - number of docs Fig. 1. preprocessing options: - stemming - stopword filtering - global or local frequency bandwidth channel - controlled vocabulary - raw weighting local weights: - none (raw) - binary tf - log tf global weights: - none (raw) - normalisation - idf - 1+entropy dimensionality singular values k: - coverage = 0.3, 0.4, 0.5 - coverage >= ndocs - 1/30 - 1/50 - magic 10 - none (vector m.) 3 similarity measurement method: - best hit - mean of best correlation measure: - pearson - spearman - cosine Parameter lasses inuen ing the algorithm ee tiveness. dierentiated into lo al (lw) and global (gw) weighting s hemes, whi h may be ombined as follows: m̌ = lw(m) · gw(m) (3) Lo al s hemes only take into a ount term frequen ies within a parti ular do ument, whereas global weighting s hemes relate term frequen ies to the frequen y distribution in the whole do ument olle tion. Weighting s hemes are needed to hange the impa t of relative and absolute term frequen ies to, e.g., emphasize medium-frequen y terms as they are assumed to be most representative for the do uments des ribed. Espe ially when dealing with narrative text, high-frequen y terms are often semanti ally meaningless fun tional terms (e.g., 'the', 'it') whereas low-frequen y terms an in general be onsidered to be distra tors  generated, for example, through the use of metaphors. See Se tion 3 for an overview on ommon weighting me hanisms. The hoi e on the ideal number of dimensions is responsible for the effe t that distinguishes LSA from the pure ve tor-spa e model: if all dimensions are used, the original matrix will be re onstru ted and an unmodied ve tor-spa e model is the basis for further pro essing. If less dimensions than available non-zero singular values are used, the original ve tor spa e is approximated. Thereby, relevant stru ture information inherent in the original matrix is aptured, redu ing noise and variability in word usage. Several methods to determine the optimal number of singular values to be used have been proposed. Wild et al. (2005) report a new method for al ulating the number via a share between 30% and 50% of the umulated singular values to show best results. How the similarity of do ument or term ve tors is measured forms another lass of inuen ing parameters. Both, the similarity measure hosen and the similarity measurement method ae ts the out omes. Various orrelation measures have been applied in LSA. Among others, these omprise the simple rossprodu t, the Pearson orrelation (and the nearly identi al osine measure), and Spearman's Rho. The measurement method an, for example, 4 Wild, Stahl simply be a ve tor to ve tor omparison or the average orrelation of a ve tor with a parti ular ve tor set. 3 The lsa Pa kage for R In order to fa ilitate the use of LSA, a pa kage for the statisti al language and environment R has been implemented by Wild (2005). The pa kage is open-sour e and available via CRAN, the Comprehensive R Ar hive Network. A higher-level abstra tion is introdu ed to ease the appli ation of LSA. Five ore methods perform the dire t LSA steps. With textmatrix(), a do ument base an be read in from a spe ied dire tory. The do uments are onverted to a textmatrix (i.e., do ument-term matrix, see above) obje t, whi h holds terms in rows and do uments in olumns, so that ea h ell ontains the frequen y of a parti ular term in a parti ular do ument. Alternatively, pseudo do uments an be reated with query() from a given text string. The output in this ase is also a textmatrix, albeit it has only one olumn (the query). By alling lsa () on a textmatrix, a latent semanti spa e is onstru ted, using the singular value de omposition as spe ied in Se tion 1. The three trun ated matri es from the SVD are returned as a list obje t. A latent semanti spa e an be onverted ba k to a textmatrix obje t with as .textmatrix(). The returned textmatrix has the same terms and do uments, however, with modied frequen ies, that now ree t inherent semanti relations not expli it in the original input textmatrix. Additionally, the pa kage ontains several tuning options for the ore routines and various support methods whi h help setting the inuen ing parameters. Some examples are given below, for additional options see Wild (2005). Considering text prepro essing, textmatrix() oers several argument options. Two stop-word lists are provided with the pa kage, one for German language texts (370 terms) and one for English (424 terms), whi h an be used to lter terms. Additionally, a ontrolled vo abulary an be spe ied, sort order will be sustained. Support for Porter's Snowball stemmer is provided through intera tion with the Rstem pa kage (Lang (2004)). Furthermore, a lower boundary for word lengths and minimum do ument frequen ies an be spe ied via an optional swit h. Methods for term weighting in lude the lo al weightings (lw) raw, log, binary, and the global weightings (gw) normalisation, two versions of the inverse do ument frequen y (idf), and entropy in both the original Shannon as well as in a slightly modied, more popular version (Wild (2005)). Various methods for nding a useful number of dimensions are oered in the pa kage. A xed number of values an be dire tly assigned as an argument in the ore routine. The same applies for the ommon pra tise to use a xed fra tion of singular values, e.g., 1/50th or 1/30th. Several support methods are oered to automati ally identify a reasonable number of dimension: a per entage of the umulated values (e.g., 50%); equalling the number of Latent Semanti do uments with a share of the 1.0 (the so Analysis 5 umulated values; dropping all values below alled `Kaiser Criterion'); and nally the pure ve tor model with all available values (Wild (2005)). 4 Demonstrations In the following se tion, two examples will be given on how LSA in pra tise. The rst an be applied ase illustrates how LSA may be used to automati ally s ore free-text essays in an edu ational assessment setting. Typi ally, if du ted by tea hers, essays written by students are marked through reading and evaluation along spe i riteria, among others their `Essay' thereby refers to a test item whi h requires a response on- areful ontent. omposed by the examinee, usually in the form of one or more senten es, of a nature that no single response or pattern of responses an be listed as orre t (Stalnaker (1951)). convert to textmatrix construct latent semantic space domain specific documents generic documents convert to textmatrix gold standard essays convert vectors 0.7 0.3 0.5 test essays : fold-in 0.2 0.2 0.8 compare : doc vectors compare term vectors Fig. 2. LSA Pro ess for both examples. When emulating human understanding with LSA, rst a latent semanti spa e needs to be trained from domain-spe i do uments. Generi and generi ba kground texts thereby add a reasonably heterogeneous amount of general vo abulary whereas the domain-spe i vo abulary. The do ument texts provide the professional olle tion is therefore onverted into a textmatrix obje t (see Figure 2, Step 1). Based on this textmatrix, a latent semanti spa e is onstru ted in Step 2. Ideally, this spa e is an optimal tors onguration of fa - al ulated from the training do uments and is able to evaluate similarity. To avoid the essays to be tested and a examples (so ontent olle tion of best-pra tise alled `gold-standard essays') from inuen ing this spa e, they are folded in after the SVD. In Step A they are onverted into a textmatrix applying the vo abulary and term order from the textmatrix generated in Step 1. In Step B they are folded into this existing latent spa e (see Se tion 1). 6 Wild, Stahl As a very simple s oring method, the Pearson Correlation between the test essays and the gold-standard essays an be used for s oring as indi ated in Step C. A high orrelation equals a high s ore. See Listing 1 for the R ode. Listing 1. Essay S oring with LSA # load pa kage 3 # load training te x ts 1 library ( " l s a " ) 2 4 trm = textmatrix ( " t r a i n i n g t e x t s / " ) trm = lw_bintf ( trm ) ∗ gw_i d f ( trm ) s p a e = l s a ( trm ) # weighting # reate LSA spa e 7 8 # fold − in t e s t and gold standard essays 9 tem = textmatrix ( " e s s a y s / " , v o a b u l a r y=rownames ( trm ) ) 10 tem = lw_bintf ( tem ) ∗ gw_i d f ( tem ) # weighting 5 6 11 12 13 14 tem_red = fold_in ( tem , s p a e ) # s ore essay against gold standard or ( tem_red [ , " g o l d . t x t " ℄ , tem_red [ , "E1 . t x t " ℄ ) # 0.7 The se ond ase illustrates, how a spa e hanges behavior, when both, orpus size of the do ument olle tion and number of dimensions, are varied. This example an be used for experiments investigating the two driving parameters ' orpus size` and 'optimal number of dimensions`. Fig. 3. Highly frequent terms `eu' Fig. vs. `oesterrei h' (Pearson) `jahr' vs. `wien' (Pearson) 1.0 4. Highly frequent terms 1.0 cor cor 0.5 0.0 0.5 0.0 200 200 400 200 s s oc nd 200 oc nd 400 400 600 s 600 dim 400 600 s 600 dim Therefore, as an be seen in Figure 2, a latent semanti spa e is onstru ted from a do ument olle tion by onverting a randomised full sample of the available do uments to a textmatrix in Step 1 (see Listing 2, Lines 4-5) and by applying the lsa () method in Step 2 (see Line 16). Step 3 (Lines 17-18) onverts the spa e to textmatrix format and measures the similarity between Latent Semanti Analysis 7 two terms. By varying orpus size (Line 9 and 10-13 for sanitising) and dimensionality (Line 15), behavior hanges of the spa e an be investigated. Figure 3 and Figure 4 show visualisations of this behavior data: the terms of Figure 3 were onsidered to be highly asso iated and thus were expe ted to be very similar in their orrelations. Eviden e for this an be derived from the hart when omparing with Figure 4, visualising the similarities from a term pair onsidered to be unrelated (`jahr' = `year', `wien' = `vienna'). In fa t the base level of the orrelations of the rst, highly asso iated term pair is visibly higher than that of the se ond, unrelated term pair. Moreover, at the turning point of the or-dims urves, the orrelation levels have an even in reased distan e whi h already stabilises for a omparatively small number of do uments. Listing 2. 1 2 3 4 5 6 7 8 9 tm = textmatrix( " t e x t s /" , stopwords=stopwords_de) # randomize do ument order rndsample = sample ( 1 : n ol (tm ) ) sm = tm [ , rndsample ℄ # measure term−term s i m i l a r i t i e s s = NULL for ( i in ( 2 : n ol (sm ) ) ) { # f i l t e r out unused terms 10 i f (any(rowSums(sm [ , 1 : i ℄)==0)) { m = sm[ − (whi h(rowSums(sm [ , 1 : i ℄ ) = = 0 ) ) , 1 : i ℄ } e l s e { m = sm } 11 12 13 # in rease dims 14 15 16 17 18 19 20 The Geometry of Meaning } for ( d in 2 : i ) { spa e = l s a (m, dims=d ) redm = as . textmatrix( spa e ) s = ( s , or ( redm [ " j a h r " , ℄ , redm [ " wien " , ℄ ) ) } 5 Evaluating Algorithm Ee tiveness Evaluating the ee tiveness of LSA, espe ially with hanging parameter settings, is dependent on the appli ation area targeted. Within an information retrieval setting, the same results may lead to a dierent interpretation than in an essay s oring setting. One evaluation option is to externally validate by omparing ma hine behavior to human behavior (see Figure 5). For the essay s oring example, the authors have evaluated ma hine against human s ores, nding a man-ma hine orrelation (Spearman's Rho) of up to .75, signi ant at a level below .001 in nine exams tested. In omparison, human-to-human 8 Wild, Stahl interrater orrelation is often reported to vary around .6 (Wild et al. (2005)). In the authors own tests, the highest human rater inter orrelation was found to be .8 (for the same exam as the man-ma hine orrelation mentioned above), de reasing rapidly with dropping subje t familiarity of the raters. machine scores 0.2 0.2 0.8 Fig. 5. human scores 0.2 0.5 0.2 0.5 0.8 3.5 rho = 1 0.5 0.5 3.5 Evaluating the algorithm. 6 Con lusion and Identi ation of Current Challenges An overview over latent semanti analysis and its implementation in the R pa kage `lsa' has been given and has been illustrated with two examples. With the help of the pa kage, LSA an be applied using only few lines of ode. As rolled out in Se tion 2, however, the various inuen ing parameteres may hinder users in alibrating LSA to a hieve optimal results, sometimes even lowering performan e below that of the (qui ker) simple ve tor-spa e model. In general, LSA shows greater ee tiveness than the pure ve tor spa e model in settings that benet from fuzziness (e.g., information retrieval, re ommender systems). However, in settings that have to rely on more pre ise representation stru tures (e.g., essay s oring, term relationship mining), better means to predi t behavior under ertain parameter settings ould ease the appli ability and in rease e ien y by redu ing tuning times. For the future, this an be regarded as the main hallenge: an extensive investigation of the inuen ing parameters, their settings, and their interdependan ies to enable a more ee tive appli ation. Referen es DEERWESTER, S., DUMAIS, S., FURNAS, G., LANDAUER, T., HARSHMAN, R. (1990): Indexing by Latent Semanti Analysis. JASIS, 41, 391407. BERRY, M., DUMAIS, S. and O'BRIEN, G. (1995): Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37, 573595. WILD, F., STAHL, C., STERMSEK, G., NEUMANN, G. (2005): Parameters Driving Ee tiveness of Automated Essay S oring with LSA. In: M. Danson (Ed.): Pro eedings of the 9th CAA. Prof. Development, Loughborough, 485494. BAEZA-YATES, R., RIBEIRO-NETO, B. (1999): Modern Information Retrieval. ACM Press, New York. WILD, F. (2005): lsa: Latent Semanti Analysis. R pa kage version 0.57. LANG, D.T. (2004): Rstem. R pa kage version 0.2-0. STALNAKER, J.M. (1951): The essay type of examination. In: E.F. Lindquist (Ed.): Edu ational Measurement. George Banta, Menasha, 495530.