Investigating Unstru tured Texts
with Latent Semanti Analysis
Fridolin Wild, Christina Stahl
Institute for Information Systems and New Media,
Vienna University of E onomi s and Business Administration,
Augasse 2-6, A-1090 Vienna, Austria, {firstname.surname}wu-wien.a .at
Latent semanti analysis (LSA) is an algorithm applied to approximate
the meaning of texts, thereby exposing semanti stru ture to omputation. LSA
ombines the lassi al ve tor-spa e model well known in omputational linguisti s with a singular value de omposition (SVD), a two-mode fa tor analysis.
Thus, bag-of-words representations of texts an be mapped into a modied ve tor
spa e that is assumed to ree t semanti stru ture. In this ontribution the authors
des ribe the lsa pa kage for the statisti al language and environment R and illustrate its proper use through examples from the areas of automated essay s oring
and knowledge representation.
Abstra t.
1 Introdu tion to Latent Semanti Analysis
Derived from latent semanti indexing, LSA is intended to enable the analysis
of the semanti stru ture of texts. The basi idea behind LSA is that the olloation of terms of a given do ument-term ve tor spa e ree ts a higher-order
latent semanti stru ture, whi h is obs ured by word usage (e.g., by
synonyms or ambiguities). By using on eptual indi es that are derived statisti ally via a trun ated singular value de omposition, this variability problem
is believed to be over ome (Deerwester et al. (1990)).
In a typi al LSA pro ess, rst a do ument-term matrix M is onstru ted
from a given text base of n do uments ontaining m terms. The term 'textmatrix` will be used throughout the rest of this ontribution to denote this type of
do ument-term matri es. This textmatrix M of the size m× n is then resolved
by the singular value de omposition into the term-ve tor matrix T ( onstituting the left singular ve tors), the do ument-ve tor matrix D ( onstituting
the right singular ve tors) being both orthonormal and the diagonal matrix
S . These matri es are then redu ed to a parti ular number of dimensions k ,
giving the trun ated matri es T , S and D the latent semanti spa e.
Multiplying the trun ated matri es T , S and D results in a new matrix M
whi h is the least-squares best t approximation of M with k singular values.
k
k
k
k
k
k
k
2
Wild, Stahl
Mk is of the same format as M , i.e., rows represent the same terms, olumns
the same do uments.
To keep additional do uments from inuen ing a previously al ulated semanti spa e or to simply re-use the stru ture ontained in an already existing
fa tor distribution, new do uments an be folded-in after the singular value
de omposition. For this purpose, the add-on do uments an be added to the
pre-exisiting latent semanti spa e by mapping them into the existing fa tor
stru ture. Moreover, folding-in is omputationally a lot less ostly, as no singular value de omposition is needed. To fold-in, a pseudo-do ument ve tor m̂
needs to be al ulated in three steps (Berry et al. (1995)): after onstru ting
a do ument ve tor v from the additional do uments ontaining the term frequen ies in the exa t order onstituted by the input textmatrix M , v an be
mapped into the latent semanti spa e by applying (1) and (2).
dˆ = v T Tk Sk−1
(1)
m̂ = Tk Sk dˆ
(2)
Thereby, Tk and Sk are the trun ated matri es from the previously al ulated latent semanti spa e. The resulting ve tor dˆ of Equation (1) represents
an additional olumn of Dk . The resulting pseudo-do ument ve tor m̂ from
Equation (2) is identi al to an additional olumn in the textmatrix representation of the latent semanti spa e.
2 Inuen ing Parameters
Several lasses of adjustment parameters an be fun tionally dierentiated in
the latent semanti analysis pro ess. Every lass introdu es new parameter
settings that drive the ee tiveness of the algorithm. The following lasses
have been identied so far by Wild et al. (2005): textbase ompilation and
sele tion, prepro essing methods, weighting s hemes, hoi e of dimensionality,
and similarity measurement te hniques (see Figure 1).
Dierent texts reate a dierent fa tor distribution. Moreover, texts may
be splitted into omponents su h as senten es, paragraphs, hapters, bags-ofwords of a xed size, or even into ontext bags around ertain keywords. The
do ument olle tion available may be ltered a ording to spe i riteria
su h as novelty or sampled into a random sample, so that only a subset of the
existing do uments will a tually be used in the latent semanti analysis. The
textbase ompilation and sele tion options form one lass of parameters.
Do ument prepro essing omprises several operations performed on the
input texts su h as lexi al analysis, stop-word ltering, redu tion to word
stems, ltering of keywords above or below ertain frequen y thresholds, and
the use of ontrolled vo abularies (Baeza-Yates (1999)).
Weighting s hemes have been shown to signi antly inuen e the ee tiveness of LSA (Wild et al. (2005)). Weighting s hemes in general an be
Latent Semanti Analysis
textbase compilation
and selection
options:
- documents
- chapters
- paragraphs
- sentences
- context bags
- number of docs
Fig. 1.
preprocessing
options:
- stemming
- stopword filtering
- global or local
frequency bandwidth channel
- controlled
vocabulary
- raw
weighting
local weights:
- none (raw)
- binary tf
- log tf
global weights:
- none (raw)
- normalisation
- idf
- 1+entropy
dimensionality
singular values k:
- coverage
= 0.3, 0.4, 0.5
- coverage
>= ndocs
- 1/30
- 1/50
- magic 10
- none (vector m.)
3
similarity
measurement
method:
- best hit
- mean of best
correlation
measure:
- pearson
- spearman
- cosine
Parameter lasses inuen ing the algorithm ee tiveness.
dierentiated into lo al (lw) and global (gw) weighting s hemes, whi h may
be ombined as follows:
m̌ = lw(m) · gw(m)
(3)
Lo al s hemes only take into a ount term frequen ies within a parti ular
do ument, whereas global weighting s hemes relate term frequen ies to the
frequen y distribution in the whole do ument olle tion. Weighting s hemes
are needed to hange the impa t of relative and absolute term frequen ies to,
e.g., emphasize medium-frequen y terms as they are assumed to be most representative for the do uments des ribed. Espe ially when dealing with narrative
text, high-frequen y terms are often semanti ally meaningless fun tional terms
(e.g., 'the', 'it') whereas low-frequen y terms an in general be onsidered to
be distra tors generated, for example, through the use of metaphors. See
Se tion 3 for an overview on ommon weighting me hanisms.
The hoi e on the ideal number of dimensions is responsible for the effe t that distinguishes LSA from the pure ve tor-spa e model: if all dimensions are used, the original matrix will be re onstru ted and an unmodied
ve tor-spa e model is the basis for further pro essing. If less dimensions than
available non-zero singular values are used, the original ve tor spa e is approximated. Thereby, relevant stru ture information inherent in the original
matrix is aptured, redu ing noise and variability in word usage. Several methods to determine the optimal number of singular values to be used have been
proposed. Wild et al. (2005) report a new method for al ulating the number
via a share between 30% and 50% of the umulated singular values to show
best results.
How the similarity of do ument or term ve tors is measured forms another lass of inuen ing parameters. Both, the similarity measure hosen
and the similarity measurement method ae ts the out omes. Various orrelation measures have been applied in LSA. Among others, these omprise the
simple rossprodu t, the Pearson orrelation (and the nearly identi al osine
measure), and Spearman's Rho. The measurement method an, for example,
4
Wild, Stahl
simply be a ve tor to ve tor omparison or the average orrelation of a ve tor
with a parti ular ve tor set.
3 The lsa Pa kage for R
In order to fa ilitate the use of LSA, a pa kage for the statisti al language
and environment R has been implemented by Wild (2005). The pa kage is
open-sour e and available via CRAN, the Comprehensive R Ar hive Network.
A higher-level abstra tion is introdu ed to ease the appli ation of LSA.
Five ore methods perform the dire t LSA steps. With textmatrix(), a do ument base an be read in from a spe ied dire tory. The do uments are onverted to a textmatrix (i.e., do ument-term matrix, see above) obje t, whi h
holds terms in rows and do uments in olumns, so that ea h ell ontains the
frequen y of a parti ular term in a parti ular do ument. Alternatively, pseudo
do uments an be reated with query() from a given text string. The output
in this ase is also a textmatrix, albeit it has only one olumn (the query).
By alling lsa () on a textmatrix, a latent semanti spa e is onstru ted,
using the singular value de omposition as spe ied in Se tion 1. The three
trun ated matri es from the SVD are returned as a list obje t. A latent semanti spa e an be onverted ba k to a textmatrix obje t with as .textmatrix().
The returned textmatrix has the same terms and do uments, however, with
modied frequen ies, that now ree t inherent semanti relations not expli it
in the original input textmatrix.
Additionally, the pa kage ontains several tuning options for the ore routines and various support methods whi h help setting the inuen ing parameters. Some examples are given below, for additional options see Wild (2005).
Considering text prepro essing, textmatrix() oers several argument options. Two stop-word lists are provided with the pa kage, one for German language texts (370 terms) and one for English (424 terms), whi h an be used
to lter terms. Additionally, a ontrolled vo abulary an be spe ied, sort
order will be sustained. Support for Porter's Snowball stemmer is provided
through intera tion with the Rstem pa kage (Lang (2004)). Furthermore, a
lower boundary for word lengths and minimum do ument frequen ies an be
spe ied via an optional swit h.
Methods for term weighting in lude the lo al weightings (lw) raw, log,
binary, and the global weightings (gw) normalisation, two versions of the
inverse do ument frequen y (idf), and entropy in both the original Shannon
as well as in a slightly modied, more popular version (Wild (2005)).
Various methods for nding a useful number of dimensions are oered in
the pa kage. A xed number of values an be dire tly assigned as an argument in the ore routine. The same applies for the ommon pra tise to use a
xed fra tion of singular values, e.g., 1/50th or 1/30th. Several support methods are oered to automati ally identify a reasonable number of dimension:
a per entage of the umulated values (e.g., 50%); equalling the number of
Latent Semanti
do uments with a share of the
1.0 (the so
Analysis
5
umulated values; dropping all values below
alled `Kaiser Criterion'); and nally the pure ve tor model with
all available values (Wild (2005)).
4 Demonstrations
In the following se tion, two examples will be given on how LSA
in pra tise. The rst
an be applied
ase illustrates how LSA may be used to automati ally
s ore free-text essays in an edu ational assessment setting. Typi ally, if
du ted by tea hers, essays written by students are marked through
reading and evaluation along spe i
riteria, among others their
`Essay' thereby refers to a test item whi h requires a response
on-
areful
ontent.
omposed by
the examinee, usually in the form of one or more senten es, of a nature that
no single response or pattern of responses
an be listed as
orre t (Stalnaker
(1951)).
convert to
textmatrix
construct latent
semantic space
domain specific
documents
generic
documents
convert to
textmatrix
gold standard
essays
convert
vectors
0.7
0.3
0.5
test essays
:
fold-in
0.2
0.2
0.8
compare
: doc
vectors
compare
term
vectors
Fig. 2. LSA Pro ess for both examples.
When emulating human understanding with LSA, rst a latent semanti
spa e needs to be trained from domain-spe i
do uments. Generi
and generi
ba kground
texts thereby add a reasonably heterogeneous amount of
general vo abulary whereas the domain-spe i
vo abulary. The do ument
texts provide the professional
olle tion is therefore
onverted into a textmatrix
obje t (see Figure 2, Step 1). Based on this textmatrix, a latent semanti spa e
is
onstru ted in Step 2. Ideally, this spa e is an optimal
tors
onguration of fa -
al ulated from the training do uments and is able to evaluate
similarity. To avoid the essays to be tested and a
examples (so
ontent
olle tion of best-pra tise
alled `gold-standard essays') from inuen ing this spa e, they
are folded in after the SVD. In Step A they are
onverted into a textmatrix
applying the vo abulary and term order from the textmatrix generated in Step
1. In Step B they are folded into this existing latent spa e (see Se tion 1).
6
Wild, Stahl
As a very simple s oring method, the Pearson Correlation between the test
essays and the gold-standard essays an be used for s oring as indi ated in
Step C. A high orrelation equals a high s ore. See Listing 1 for the R ode.
Listing 1. Essay S oring with LSA
# load pa kage
3 # load training te x ts
1
library ( " l s a " )
2
4
trm = textmatrix ( " t r a i n i n g t e x t s / " )
trm = lw_bintf ( trm ) ∗ gw_i d f ( trm )
s p a e = l s a ( trm )
# weighting
# reate LSA spa e
7
8 # fold − in t e s t and gold standard essays
9 tem = textmatrix ( " e s s a y s / " , v o a b u l a r y=rownames ( trm ) )
10 tem = lw_bintf ( tem ) ∗ gw_i d f ( tem ) # weighting
5
6
11
12
13
14
tem_red = fold_in ( tem , s p a e )
# s ore essay against gold standard
or ( tem_red [ , " g o l d . t x t " ℄ , tem_red [ , "E1 . t x t " ℄ )
# 0.7
The se ond ase illustrates, how a spa e hanges behavior, when both,
orpus size of the do ument olle tion and number of dimensions, are varied. This example an be used for experiments investigating the two driving
parameters ' orpus size` and 'optimal number of dimensions`.
Fig. 3. Highly frequent terms `eu'
Fig.
vs. `oesterrei h' (Pearson)
`jahr' vs. `wien' (Pearson)
1.0
4.
Highly
frequent
terms
1.0
cor
cor
0.5
0.0
0.5
0.0
200
200
400
200
s
s
oc
nd
200
oc
nd
400
400
600
s
600
dim
400
600
s
600
dim
Therefore, as an be seen in Figure 2, a latent semanti spa e is onstru ted from a do ument olle tion by onverting a randomised full sample
of the available do uments to a textmatrix in Step 1 (see Listing 2, Lines 4-5)
and by applying the lsa () method in Step 2 (see Line 16). Step 3 (Lines 17-18)
onverts the spa e to textmatrix format and measures the similarity between
Latent Semanti Analysis
7
two terms. By varying orpus size (Line 9 and 10-13 for sanitising) and dimensionality (Line 15), behavior hanges of the spa e an be investigated.
Figure 3 and Figure 4 show visualisations of this behavior data: the terms
of Figure 3 were onsidered to be highly asso iated and thus were expe ted
to be very similar in their orrelations. Eviden e for this an be derived from
the hart when omparing with Figure 4, visualising the similarities from a
term pair onsidered to be unrelated (`jahr' = `year', `wien' = `vienna'). In
fa t the base level of the orrelations of the rst, highly asso iated term pair
is visibly higher than that of the se ond, unrelated term pair. Moreover, at
the turning point of the or-dims urves, the orrelation levels have an even
in reased distan e whi h already stabilises for a omparatively small number
of do uments.
Listing 2.
1
2
3
4
5
6
7
8
9
tm = textmatrix( " t e x t s /" , stopwords=stopwords_de)
# randomize do ument order
rndsample = sample ( 1 : n ol (tm ) )
sm = tm [ , rndsample ℄
# measure term−term s i m i l a r i t i e s
s = NULL
for ( i in ( 2 : n ol (sm ) ) ) {
# f i l t e r out unused terms
10
i f (any(rowSums(sm [ , 1 : i ℄)==0)) {
m = sm[ − (whi h(rowSums(sm [ , 1 : i ℄ ) = = 0 ) ) , 1 : i ℄
} e l s e { m = sm }
11
12
13
# in rease dims
14
15
16
17
18
19
20
The Geometry of Meaning
}
for ( d in 2 : i ) {
spa e = l s a (m, dims=d )
redm = as . textmatrix( spa e )
s = ( s , or ( redm [ " j a h r " , ℄ , redm [ " wien " , ℄ ) )
}
5 Evaluating Algorithm Ee tiveness
Evaluating the ee tiveness of LSA, espe ially with hanging parameter settings, is dependent on the appli ation area targeted. Within an information
retrieval setting, the same results may lead to a dierent interpretation than
in an essay s oring setting. One evaluation option is to externally validate by
omparing ma hine behavior to human behavior (see Figure 5). For the essay
s oring example, the authors have evaluated ma hine against human s ores,
nding a man-ma hine orrelation (Spearman's Rho) of up to .75, signi ant
at a level below .001 in nine exams tested. In omparison, human-to-human
8
Wild, Stahl
interrater orrelation is often reported to vary around .6 (Wild et al. (2005)).
In the authors own tests, the highest human rater inter orrelation was found
to be .8 (for the same exam as the man-ma hine orrelation mentioned above),
de reasing rapidly with dropping subje t familiarity of the raters.
machine
scores
0.2
0.2
0.8
Fig. 5.
human
scores
0.2 0.5
0.2 0.5
0.8 3.5
rho = 1
0.5
0.5
3.5
Evaluating the algorithm.
6 Con lusion and Identi ation of Current Challenges
An overview over latent semanti analysis and its implementation in the R
pa kage `lsa' has been given and has been illustrated with two examples. With
the help of the pa kage, LSA an be applied using only few lines of ode. As
rolled out in Se tion 2, however, the various inuen ing parameteres may
hinder users in alibrating LSA to a hieve optimal results, sometimes even
lowering performan e below that of the (qui ker) simple ve tor-spa e model.
In general, LSA shows greater ee tiveness than the pure ve tor spa e
model in settings that benet from fuzziness (e.g., information retrieval, re ommender systems). However, in settings that have to rely on more pre ise
representation stru tures (e.g., essay s oring, term relationship mining), better means to predi t behavior under ertain parameter settings ould ease the
appli ability and in rease e ien y by redu ing tuning times. For the future,
this an be regarded as the main hallenge: an extensive investigation of the
inuen ing parameters, their settings, and their interdependan ies to enable
a more ee tive appli ation.
Referen es
DEERWESTER, S., DUMAIS, S., FURNAS, G., LANDAUER, T., HARSHMAN,
R. (1990): Indexing by Latent Semanti Analysis. JASIS, 41, 391407.
BERRY, M., DUMAIS, S. and O'BRIEN, G. (1995): Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37, 573595.
WILD, F., STAHL, C., STERMSEK, G., NEUMANN, G. (2005): Parameters Driving Ee tiveness of Automated Essay S oring with LSA. In: M. Danson (Ed.):
Pro eedings of the 9th CAA. Prof. Development, Loughborough, 485494.
BAEZA-YATES, R., RIBEIRO-NETO, B. (1999): Modern Information Retrieval.
ACM Press, New York.
WILD, F. (2005): lsa: Latent Semanti Analysis. R pa kage version 0.57.
LANG, D.T. (2004): Rstem. R pa kage version 0.2-0.
STALNAKER, J.M. (1951): The essay type of examination. In: E.F. Lindquist (Ed.):
Edu ational Measurement. George Banta, Menasha, 495530.