Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Information Space Based On HTML Structure: Gregory B. Newby UNC Chapel Hill

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 9

InformationSpace

basedonHTMLStructure
*
Gregory
BNewby
.
UNCChapelHill

Abstract

Themain goalfortheInformation Spacesystem for TREC9was earlyprecision.


Tofacilitatethis,an
emphasis wasplacedosneeking matches fromonlyth T
e ITLE,H1,H2andH3tags
in
theWeb(wt10G)
andlargeWeb(wt100)documentcollections. Rankin ogdocuments
f was basedocombination
an of
Boolean union sets,term
weights,andprincipalcom ponents
analysis(PCA).Very largesparse
cooccurrencematrices werecreatedforterm
weighti ng
andPCA.
TheInformation Spacesystem is
parto f
laargergeneralsoftwarepackagecalledIRTools.

Introduction

Thisyear’sTRECentry for
the
Information Space systembuilds onpastyears, with
some
specific goals.
Due to2000
being the
firstyear
with Web datafor the
main taskfor
TREC(insteadonewswire
f andother data,asinpastyears),it seemed desirable tm
o ake
use othe
f structureoHTML.
f Ascasualobservation othe
f popularWe sbearchengines
(Google,Lycos,etc.) reveals,thesesystemsprovide additionalweig htto
termsoccurring
inthe<TITLE> tagsofdocuments,inadditionto searching through
thet ermsin
each
document.

The InformationSpace (IS)


mainWebtaskentry for
thisyearfoc used only
otnagsin
the
<TITLE>,<H1>,<H2>,and<H3> tagsinthe
datasets.
Thiswasint ended tfoacilitate
earlyprecision,bymatching the
shortTRECtopic title
otitle
r plusdescription statements
totermsinthese
tags.
Thesubmission for
the mainWeb taskwas d6ayslate,
and
therefore notjudged(althoughiwas
t counted bN
y IST asan
“officia l”
run).
Posthoc
analysisof some queriesindicatethatif
resultswere judged, t hey
probably would not
have beensubstantially better
thanthenon-judged resultsfounditnhe
confe rence
proceedings.

ISalsomade anentry tothe


largeWebtaskoVLC.
r The 100GB VL C(w100)was
processedsimilarly tothe
mainWebtask,by focusing
onlyonterms in
the
samesetof
tags(title,h1,h2andh3).
Because thisrun
wasalsosubmittedlate by
,nearly
weeks,
2 it
wasnotjudged. Due ttohesmallnumber of
officialVLCsubmissions and
smallnumber
of
judgeddocuments, nousefulrecallor
precision
statisticsare
ava ilable.

Thispaper
willpresentanoverview of
theprocedure
used tiondexand re trievefromthe
wt10gandw100datasets,followedbbrief
ay discussion othe
f largeco -occurrence
matricesgenerated.Then,system-basedandrelevance-based perform anceoutcomesare
discussed.
Itisconcludedthatqueryexpansiondidnotserve wellto
fa cilitate
earlyhigh
precision.
Furthermore, laack
osophisticated
f termweighting als hourtresults.

*
Contact
data:gbnewby@ils.unc.edu,http://ils.unc. edu/gbnewby.
CB3360Manning
Hall,ChapelHill,
NC,27599-3360USA.

InformationSpace 1-
The
IRToolsSoftware

ISipart
s of
saetof
software
toolsfor
IRexperimentation unde development
r bythe
author andhiscolleagues.
Thesoftwareicalled
s the “Information RetrievalToolkit,”
or
IRTools. The purpose
oIRTools
f istwofold:

1.Toprovide
anintegratedcollectionoC++
f classesdesigned
tfoac ilitate
IR
experimentation;and
2.Toincorporatedesign
forlarge-scalepracticaluse.

Althoughmoderninformationscientistshave alwaysrelied
onsoftwar feor
their
experiments,relativelyfewhave chosentomaketheir
softwaref reely
available tothers.
For
thosewhohaveshared,the
software ioften
s notsuitable for
re-us ieonther
experimentalsettings –due toeither
lackodocumentation,
f cross-platforminstability,or
non-modular design.
IRToolsisintendedtohelp addressthe
shortageosf oftware for
retrievalexperimentation.

Anotherproblemthathasoften
hinderedinformation scientistsisthe
di fficulty
of
demonstrating thescalabilityotheir
f ideas. IRToolsplacesand emphasison high
performance data
structures,filestructuresand algorithms(New by,
2000b).Real-world
functionality
willinclude
theabilitytoupdate the
documentcollecti on
(e.g.,
by
spidering
the
Webperiodically).
IRTools’ goalistoindexbillionsof
documents with
, hundredsof
millionsof
unique
terms,andovertaerabyte oaggregated
f data.

IRToolsisdesignedmodularly,as laibrary
oC++
f classes.
Cur rently,
IRToolsisover
25,000linesof
codeincluding
testprograms. Itmakesextensiveuseof the
standard
template library
(STL).
The
planfor
IRToolsistoincorporate t he
functionality oall
f
majortypesof
experimentalIR:probabilistic retrieval,
thevec tor
space
model, latent
semantic indexing,simpleBooleanretrieval, andothers.
IRToolswil make
l ieasier
t and
faster
for
informationscientiststoperformexperimentsor expa ndsoftware.
The
software developmentissupportedinpartby grant
a from the
NSF undertheir
informationtechnology andresearch(ITR) program.
The projecthomepag ies
http://irtools.sourceforge.net.

InformationSpace
Techniquesfor
TREC9

InformationSpace,or IS,isan
approachtionformation retrievalthat issimilarto
latent
semantic indexing
(LSI).Over the
pastseveralyears,
IShas incorporated different
specific techniquestoachieve particulargoals.
IRToolswillena ble
more othese
f goals
tobientegrated –for
example,the TREC9ISprogramsdid nothavegood
facilitiesfor
termweighting,eventhoughthe utilityoterm
f weighting usingIS techniqueswas
demonstratedinTREC8(Newby,2000a;Newby 1998).

InformationSpace 2-
ThemaindistinctionbetweenLSI andISithat
s LSI utilizesa singular
value
decomposition(SVD) onthe
termbydocumentmatrix, while ISutilize principal
s
componentsanalysis(PCA) onthe
termby termmatrix.
In
bothLSI andIS,
the
distinguishingpointfromthe
vector space model(VSM) isthatterm are
s notassumed to
be
mutuallyunrelated.
Thebasicprocessisthesame,however:docume ntvectorsare
computedbasedonthevectorsfor
termsthey contain.
Aquery vector is similarly
computed,andthe
closestdocumentstothe query
areretrieved.

AlthoughLSI and ISare


comparable,and havesimilar
a intellectualheritagein the
mathematicsof linearalgebra,they actuallyoperationalize a significantly differentgoal.
WithbothLSIandIS,only ckolumnsoftheeigenvectorsfromthe SVD or
PCA process
are
used,rather
thanall Ncolumnsforeachothe
f Nterms.
With
LSI, allcolumnsofthe
eigenvectorswouldifnactresultin vector
a space iw
n hichallt ermsare
mutually
orthogonal –inother words, the
same fundamentalmodelof theVSM.Thus,the k-
dimensionalvector space representing termrelationsinLSIisa approximation
n oan
f
orthogonaltermspace. Byreducing k,
LSI attemptsto accountfor
assumed “errors”in
the
originaltermbydocumentmatrix.

WithIS,all Ncolumnsof
the
eigenvectorswould resultin
vector
a space iw
n hich
t erm
relationsare identically
scaledtothe
numeric relationsamong termsin
the
originalterm
by terminputco-occurrence matrix.Thus, the k-dimensionalvector
spacerepresenting
termrelationsinISian
sapproximation othe
f relationsamongte rmsactually
measured
inthetermby
termmatrix.

Thesedifferencesare
moderatedbtyhe other
differencesin howthe techniquesare
actually
applied.
Formostpurposes,itisaccuratetcoharacteri ze
ISasimilar
s to
LSI.
Theauthor
haswritten
more
a extensive treatmentofthissubjec which
t hasbeen
submittedelsewhere for
publication.

The
specific
techniquesusedfor
boththe
mainWeb
and
VLCin
TREC9
ar aefollows:
s

Phase
1Indexing
:
1.Only termsinthe
<TITLE>, <H1>,<H2> and<H3> tagswereproce ssed.
All
termsinother
tagswereignored,aswasany
documentmetadata f or
the
wt10g
or
w100collections. Documentswithoutthese tagswere ignored.
2.Alltermswithfewer than20charactersandconsisting only
oaf lphabetical
charactersA-Z (case insensitive)wereindexed. No stemming wasapplied.
3.A termbytermco-occurrence matrixwasbuiltfor
allthe
i ndexed
termsfor
allthe
documentsthey occurrediThis
n. resulted ivery
a large
n andvery
s parse matrix.

Phase
2Retrieval
:
1.Onlytermsthathadbeenindexedwere used;otherswerestopped. In addition,
the
SMART stoplistwasemployed,along with
few
a additionalstop
w ords
consisting oHTML
f tags.
2.Query termswereexpanded (by100termsfor
wt10g, and2t5ermsfor w100).
Thetopco-occurring termsfor
eachquerytermwereadded ttohe query.

InformationSpace 3-
3.Alldocumentswith any othe
f expandedquery termswere selected for
further
consideration;the restof
the
documentswere assumed tboneon-relevant .
4.The full(sparse)
co-occurrence matrixfor
allof
the
expanded que ry
termswas
usedtocalculate the
full(dense)correlation matrixfor thete rms.
5.PCA wasperformedonthiscorrelationmatrix:
a.Theeigenvectorsof the
correlation matrixwere computed
b.Termvectorswere computed athe
s dotproductof thatterm’seige nvector
andthe
termsstandardized ( z)
scoresfromthe originalco-occurrence
matrix.
6.Eachdocumentunder considerationwaslocated athe
t geometric cente of
rthe
expandedquery termsitcontained(termsitcontained thatwere not part of
the
expandedquery were ignored).
7.The query waslocatedathe
t geometric centerof
itsterms.
8.The query anddocumentlocationswere normalized tuonitlength.
9.Distancesfromeachdocumenttothe querywere ranked, and thecloses retrieved.
t

Notethatthe
choice
othe
f geometric distance versuscosine
ias rbitrary for
unitlength
vectors:theranking
ithe
s same.
Butfor non-uniformvector lengths, thegeometric
distance imore
s accurate thanthe
cosine, asthe
cosine
onlyconside rsthe
angleof
incidence betweenvectors,notthe
difference.

Large
Co-Occurrence
Matrices

Adifficulty
oworking
f withco-occurrence matriceswith large numbersof
termsisthat
the
number ofupdatestothe
matrixduringindexing can
bdeaunting.
Conside that
r for
a
documentwith1000terms,(1000x1000-1)/2o499500
r termpairsexist,
andmustbe
consideredfor updating the
termby
termco-occurrence matrix.
Even if
termorderingor
termcountsareignored,thenumber of
possible
termpairsper
document can
blearge.

Oneapproachtoavoidingvery
a large number of
termpairsfor
each documentisto
considerco-occurrence onlywithinsubdocuments(thisisalso conceptuall ayppealing).
Asubdocumentmightbe consideredatasermplusitssurroundingterms (a
sliding
window),termswithinthesame paragraph, or
termswithin the
sam seentence.
Another
obviousapproach,employed bIySfor
TREC9,isto
only consider
termswithi tnhe
same
tag
sHere,
et. the
co-occurrence matrixwascomputed based
only on
termsthatwere
foundtogetherwithin
title,
a h1, h2oh3
rtag.

Thisresultedimanageable
na number of
termpairsfor
mostdocument s,
asHTML titles
andh1,h2andh3tagstendtocontainfewer than
dozen
a terms. Thisalso added ttohe
sparsity othe
f matrix,whichhelpswithstorage. Were everyc ellin
term
a by term
matrixtobfeilled,the
storage size odniskwouldbNetimesN (forNterms)timesthe
sizeoeach
f datumstored. Forthe
1.2Munique termsidentifieditnhe w100collection
and4bytesperinteger,thisiswellover5terabytes.

InformationSpace 4-
Usingvariation
a onthe
Harwell-Boeingsparsematrixformat,
I Sonly
stored
the
non-
zerocellsondisk.
The storage
required
using
the
ISvariation
otnhe H-Bformatis:

S(3*N
+
2(C)
+
2(N))

where:
Sithe
s number of
bytesper
integer
N isthe
number of
terms(akarows)
Cisthetotalnumberof
non-zerocolumnentries

Using thisformatwiththe
number ofnon-zeroco-occurrence scoresre ported iTn able
1,
about304Mbyteswere requiredtsotorevaluesfor
the
co-occurrence ma trixfor
the1.2M
unique termsfromw100, saavingsof
wellover
99%.
Infact,
thisisne arlytwice as
muchstorage wouldbreequiredtostore
only½of
the
matrixwith nloos of
sinformation,
asthe
matrixisymmetric.
s Bothsideswere usedduring the
r etrievalphase
described
above,sothesymmetric matrixwasconvertedtfull
oa matrixa fter
indexing was
completed.

Table 1:Term
co-occurrencematrix
properties
Term Non-zero co-
Dataset count occurrence scores Sparsity
wt10g 310050 27233214 0.00028329
w100 1207560 34982212 0.00002399

IndexingandRetrievalSystem-BasedPerformance
Measures

ForTREC8,ISwasable toindexw100i5hours,
n and processall10KVLC queriesin
about52seconds.
TheTREC9implementation did
notstrivefor
such
high
sy stem
performance measures:termco-occurrence added significantly
tto heindexing
overhead,
asdididentificationotag
f setswithindocuments.
Indexing timef or
the
w100wasabout
120hours;thewt10g tookabout20hours.

Asfor
TREC8,allindexing andretrievalwascompletedonUNC’sS un
Enterprise Server
10000, haigh-endserver
thatwasshared
withmany other
processes.
The ES10000 had
36processorsand20GB ofmemory,butISutilized
onlyone
processor ata time
and
operatedinlessthan2GBof
memory. Ahigh-speeddisksubsystemwi th
tape-to-disk
a
robotenabledvirtually unlimitedstorage with
latencyoless
f than m
a inute
forstaging
the
filestobiendexed.

Retrievalfor the
wt10g
tookwellunder
.1secondsper
query.
Queryproce ssing involved
minimaldiskaccess:the key
tothe
invertedindexwasread
into
m emory,
aswasthe
term
hashandfullco-occurrence matrix.
Diskaccesswasneededto
ge inverted
t indexentries
(thatis,the
listof
documentscontaining eachexpanded term)and
tm
o apdocumentID
numberstoTRECdocumentstrings.

InformationSpace 5-
For
the
w100,retrievaltime dependedow n hatsortof
queryexpansion was used.
When
simplequery expansion b2y5termswasused,asdescribed above, queries were
completedinanaverageo.21
f secondsacrossthe 10K topicstatements. A
more
sophisticatedquery expansionmodelwasattempted, in
which severalit erationsand
permutationsontheco-occurrence matrixwere made.Theretrieva performance
l for
this
variationinot
s known,because the
w100runswere notjudged,butthe
system
performance oover
f 9secondsper queryinot
s favorable.

Table 2:System
performanceforindexingandretrieval

Build index wt10g 20 hours


w100 120 hours
Index size wt10g .58GB
w100 2.7GB
Retrieval time wt10g .1 sec/query
w100 .21 sec/query method 1: simple expansion
w100 9.7 sec/query method 2: complicated expansion

RetrievalPerformance

Because
the
resultsfor wt10g were notjudged,there
isome
s risk of
bias in
interpretation
of
the
TRECperformance measures.
However, an
informalevaluation
of non-judged
documentsfor
saetof
6topicsgavethe
authorsomeconfidence thatthe retrieval
performance measuresare reasonably indicativeoIS’
f performance in
TREC9.

Because
there
are
essentially
nojudgmentsforthe
VLCthatar uesefulfor
evaluating
the
w100submissiondiscussedabove,noretrievalperformance
measurescan
be discussed
here.

For
themainWebtask,recallfromabove thatthe
main goalfor
thi year’s
s work wasto
havehigh earlyprecisionbuytilizingthestructure oHTML
f document The
s. reasoning
wasthattermsinthe
title,
h1,h2and
h3tagsweremostindicative of
daocument’s
content.
Thus,indexing andretrievalfocused ontermsin
thosetags.

Inhindsight,itwaspoor
judgmentto
apply
queryexpansion. In
readingthroug highly-
rankeddocuments,many documentshad expanded termsbutno
query
terms.
More
effective
termweighting wouldhave
helpedavoid thisproblem, although
com putation
of
termweightswashinderedbtyhe
particularfile
structuresem ployed
(because
countsof
thefrequency oterm
f occurrenceswerenotkeptat
daocumentlevel, only
tag
a level).

Abetter
approachwouldhave been
tobypassthe
useothe
f co-occurrence m atrixentirely
inorder
todevelopbaseline retrievalperformance. In
other words, to
pe rformsimple
rankedBooleanretrievalbasedonly ontermsoccurring in
the
targete tdag
sets.
Although
thiswouldhaveresultedinseveralTREC9topicswith noresults,
fa ar
largerdataset
(eitherw100or,more
interestingly,the Webaw
as hole),presumably would have
producedresultsforall50topic
statements.

InformationSpace 6-
Achallengeinseekingstrong retrievalperformance combinedwith
s trongsystem-based
performance measures isthe conflictinthe
number of
documentsthatcan
bevaluated.
Conceptually,IS(likeLSI) wouldlike toevaluate
the
relationship be tween every
single
documentinthe collectionand query.
a Thisisbecause theIStechnique (like
LSI)
enablesmatching basedonconceptsevenwhen termsdo notmatch.
However for
,
practicalpurposesthisisnotfeasible:evaluating all18Mw100 docume ntswouldtake
toolong.

There may
bsolution
ae tomanagingthesizeothe
f problemforcomput ing
allpossible
documentrelations,asdiscussedinthe
author’ssubmission to
the
TREC8 proceedings.
Butinthe
meantime,the
time-testedapproachfor IRisto
only
cons ider
the
subsetof
documentsthatcontaintermsof
interest either
– the
query
termsthemselves,
or
thequery
termsplusexpandedterms.

Basedonthepreviousparagraphs,the ISsystemwasimplemented teo valuatedlarger


a
subsetofdocumentsthanwouldbevaluated basedosimple
a
n Boolean matchi ng
of
query terms,butfarsmallerthan the
complete documentset.
This is
gaoalconsistent
withtraditionalgoalsof theISapproach,but(again,
in
hindsight)
probabl nyot
gaood
matchfor effortsathighearly
precisionbasedolimited
na number of
HTML tags.

Thespecificsof
retrievalperformance
areafollows.
s For
wt10g,
four
variationson
the
stepsdescribedabovewere submitted:

1.iswt:title-only
2.iswtd:title
+
description
3.iswtdn:title
+description+
narrative
4.isnnwt:title
+description+
narrative,butwith
“not”
o“non-rele
r vant”
phrases
automatically removed

Retrievalperformance for
allfour
setswasnotoutstanding.Table s3howsthatthe
overallnumberof
relevantdocumentsretrieved @1000 ifairly
s low, wi th under
10%
of
relevantdocumentsidentifiedbayny sIntuitively,
et. thiswould btehe retrieval
performance statistic
mostlikely tobheurtby
non-judged sets.

Table 3:Relevantretrieved@1000
iswt iswtd iswtdn isnnwt
Best 1 2 2 1
>= Median 3 3 4 1
Worst 12 11 13 24
Total relevant retrieved 242 236 172 126
% total relevant retrieved 9.25% 9.02% 6.57% 4.81%

Retrievalperformance basedonaverage
precisiontellsapproximate ly
the
same
tale.
IS
tendedtohave
scoresabove
the
medianwhen themedianscoreswererel ativelylow,
withouteverachievingaverageprecisionover
0.33.

InformationSpace 7-
Table 4:Averageprecision
iswt iswtd iswtdn isnnwt
Best 1 2 2 1
>= Median 8 7 7 5
Worst 13 12 12 24

Whatof early
precision? Precisiona10
tdocs(P@10) acrossthe
4 setswasnotashigh
ashoped.
Noneothe
f setsachievedperfectprecision a5
ot10
rdocuments Fewer
. than
½ofthe
queriesfor
allsetsresulted
inany
relevantdocumentsat allin
the
top
10,which
isdisappointing.However,asshowninTable 5these
, were numerousquerie with
s
numbersof relevantdocumentsinthe top5or
10
documentspresented.

Table 5:Precisiona5atnd10documents
P@5 score iswt iswtd iswtdn Isnnwt
0.8 0 1 0 0
0.6 1 2 3 0
0.4 4 7 3 6
0.2 12 13 16 11

P@10 score iswt iswtd iswtdn isnnwt


0.5 0 1 0 0
0.4 2 1 2 0
0.3 4 10 4 2
0.2 5 1 4 4
0.1 11 16 14 13

Themaintrendsevidentfromexamining the
TREC9 topicsandISretr ievalperformance
are
variabilityinthe
HTML documentuse otags,
f and failure
oque
f ry
expansion.
Variabilityis,asmentionedabove,perhapslessofparobleminla
a rgerdataset(w100
or
the
wholeWeb).Exactmatchesof title
otitle
r +description termswere
fairly
rare.
Furthermore,more effectiveretrievalwould necessitate additional examination othe
f
termswithinthe documents,notonly the
fourtagsused
here.

Fromthisresult,wetentatively conclude thatbetter


retrieval fromHTMLdocuments
wouldinvolve multiple phasesorranking schemes. Atone
level,
documentsw ith
matching <TITLE> oother
r key HTML tagsshould bgeivenhigh consider ation.
At
anotherlevel,more typicalIRtechniquesshould bemployed in
order
to
i dentify
potentiallyusefuldocumentsthatdonothave the
query termsin
the<TI TLE> oother
r
targetedtags.
Then,ranking schemesneedtbodeeveloped to
assess which
documents
fromthesetwosetsof
candidatesare bestfor
retrieval.

For
query
expansion,asmentionedabove,the danger
isin
retrieving
document on
s
unrelatedtopicsdue
tothe
variability
inhuman
language.Thereis little
reason
tdooubt

InformationSpace 8-
the
generalutility
oquery
f expansionbasedonthe resultshere,
andi fnactprior
IS
entriesto
TREChave discussedthe
utility
othe
f termcorrelat ion
matrixfor
identifying
synonyms.

For
queryexpansion,we suggestthatrelatively
inexpensive approaches such
, athe
s co-
occurrence matrixappliedhere,mustbeusedwithcaution.
Moreexpens ive
approaches
would,presumably,resultin fewer ambiguoustermsbeing used s–uchapproachesmight
be
appliedathe
t indexing phase,thequeryphase,
or
thedocumentranking phas e.
Approachescouldinclude dictionary lookupsof
termmeaningsand relations,m ore
detailedstatisticalanalysis(including LSI),and
partof
speec thagging.
In
fact,
allthree
approachesandother variationshave been used
bIySin
the
past,
andwillbe
incorporatedfor furtherexperimentationinIRTools.

Conclusion

Earlyprecisionwasnotachievedtothe
extenthopedfor.
The main
probl emswere query
expansion,whichaddedsome inappropriate
termstosome
topicstatements and
, reliance
ononly
the <TITLE>,<H1>,<H2> and<H3>tags.
Forfuturework,
terms fromother
tagswillbe
includedinthe
index,andquery
expansion willbe
employed more
selectively.

Continueddevelopmentof IRToolsandthe
IStechniques itcontainsisanticipated to
makeieasier
t toincorporate multipletechniqueswithout laarge investmentin
programming time.
Acomparisonothe
f relativecontributionsofthee ffectsof
such
factorsasstemming,PCA andLSItechniques,query expansion, termw eightingand
other
approachesisneededtoassessthesituationsin
whicheachtec hnique imost
s
importantfor highprecisionoother
r goals.

References

Newby,Gregory B2000a.
. "Moving MoreQuickly
TowardsFullTermRela tionsin
InformationSpace."TextREtrievalConference
(TREC-8)Proceeding Gaithersburg,
s.
MD:NationalInstitute
oScience
f andTechnology.November 16-19,
1999.

Newby,Gregory B2000b.
. "The Science
oLarge-Scale
f Information
Re trieval."
InternetArchive
2000Colloquium.San
Francisco,
March8-9.

Newby,GregoryB1999.
. "InformationSpace GetsNormal."
TextREt rieval
Conference(TREC-7)Proceedings,pp.
567-571.Gaithersburg,MD:NationalI nstitute
of
Science
andTechnology.November9-11,1998.

InformationSpace 9-

You might also like