U NIVERSITY
OF
L JUBLJANA
I NSTITUTE OF M ATHEMATICS , P HYSICS AND M ECHANICS
D EPARTMENT OF T HEORETICAL C OMPUTER S CIENCE
JADRANSKA 19, 1 000 L JUBLJANA , S LOVENIA
Preprint series, Vol. 40 (2002), 833
NETWORK ANALYSIS OF TEXTS
Vladimir Batagelj, Andrej Mrvar,
Matjaž Zaveršnik
ISSN 1318-4865
First version: July 28, 2002
Math.Subj.Class.(2000): 68 T 50, 91 F 20, 01 A 90, 05 C 70, 05 C 85,
92 H 30, 93 A 15, 68 T 30.
Presented at the Third Language Technologies Conference, October
14-15, 2002, Ljubljana, Slovenia,
Supported by the Ministry of Education, Science and Sport of Slovenia,
Project J1-8532.
Ljubljana, September 22, 2002
Network analysis of texts
Vladimir Batagelj
University of Ljubljana, Faculty of Mathematics and Physics,
Jadranska 19, 1000 Ljubljana
vladimir.batagelj@uni-lj.si
Andrej Mrvar
University of Ljubljana, Faculty of Social Sciences,
Kardeljeva ploščad 5, 1000 Ljubljana
andrej.mrvar@uni-lj.si
Matjaž Zaveršnik
University of Ljubljana, Faculty of Mathematics and Physics,
Jadranska 19, 1000 Ljubljana
matjaz.zaversnik@fmf.uni-lj.si
Abstract
In the paper different ways to derive networks from the textual data and an overview
of (possible) applications of network analysis to the analysis of texts is presented. Several examples of analyses of different text networks are given as illustrations.
Key words: text analysis, vocabulary, dictionary, citation, collaboration, core, normalization, temporal network.
Math. Subj. Class. (2000):
93 A 15, 68 T 30.
68 T 50, 91 F 20, 01 A 90, 05 C 70, 05 C 85, 92 H 30,
1 Introduction
Different kinds of networks can be generated from already existing electronic sources. Text
is a special and frequent form of such data. In the paper an overview of different ways to
derive networks from the textual data and to analyse them are presented.
2 Vocabularies
2
The obtained networks can be very large, having some ten or hundred of thousands of
vertices. Therefore special algorithms are needed to analyze and visualize them.
All the analyses in the paper were done with Pajek, a program (for Windows) for large
network analysis and visualization. It is freely available, for noncommercial use, at its site
[2].
We shall assume that the reader is familiar with the basic notions of graph theory (see
for example [32]).
2 Vocabularies
2.1 Transforming water into wine
In the recreational literature we find problems such as: transform the water into wine by
a sequence of words obtained by changing one character (deleting, inserting or replacing it)
each time. For example
water – wader – wade – wane – wine
or
water – waver – wave – wive – wine
Given a vocabulary of a language (we are using Knuth’s vocabulary of English [21]) we
can construct the corresponding transformations graph G = (V, E) in which the set of
vertices V consists of words from the dictionary; and there is an edge (u : v) ∈ E linking
the words u and v iff v can be obtained from u by changing one character. On this graph the
recreational problem turns into a problem of determining a path between given two words.
Usually we try to find a shortest such path, also called a geodesic. In Figure 1 the graph of
all geodesics leading from black to white is presented.
Constructing the transformations graph is interesting also as a computational problem
– how to do it efficiently? We used the following approach: for each word w ∈ V a list
of pairs (w 0 , w) is produced, where w 0 is a transformation pattern in which the place of
transformation is indicated by a star *. For example, for the word brain we obtain the list
(*rain, brain), (b*ain, brain),
(br*in, brain), (bra*n, brain),
(brai*, brain), (*brain, brain),
(b*rain, brain), (br*ain, brain),
(bra*in, brain), (brai*n, brain),
(brain*, brain)
2.1
Transforming water into wine
3
black
lack
rack
lace
rick
race late
rice
rate
rite
back
balk
bale
hale
wack
walk
wale
bank
bilk lice wick
bile
whale
write
lick
blank
wink
wile
lank
bane
bine
clack
bask blink
lane
wane line
link
bait
wine
while
wait
whine
clank
bast
wast cline
chine
click
clink
chick
chink
chic
chit
whit
white
Figure 1: black – white
It holds: (u : v) ∈ E iff there exists a pattern p such that the union L of all lists contains
both pairs (p, u) and (p, v). For example
(rain: brain) ∈ E
since (*rain, rain), (*rain, brain) ∈ L; and
(train: brain) ∈ E
since (*rain, train), (*rain, brain) ∈ L .
To identify efficiently all such pairs we first sort the list L on the first elements of its
pairs. In this way pairs with the same pattern are grouped together. We have only to produce
the corresponding edges for each such group. Note also that the list L can be viewed also
as a 2-mode (bipartite) graph between patterns and words.
Pajek
2.2
Things to do
4
Using standard sorting algorithm the complexity of this procedure is of order O(|V | log |V |);
it can be made linear by using bin sort.
Several transformations graphs in Pajek’s format are available at Pajek’s site.
2.2 Things to do
The transformations graphs can be produced also for other languages, provided the language
vocabulary is available. For Slovene language only a vocabulary of all word forms is freely
available [24, 16] – it is not appropriate according to ’recreational rules’.
It is also possible to introduce additional transformations. For example a swap (interchange) of two characters (with empty character allowed):
life – file
and
arc – car
Other, linguistic relations between words are also interesting. An example of such data
collection is the WordNet: a lexical database for the English language [33, 22]. The Pajek’s
version of WordNet data is in preparation.
3 Dictionaries
On the web several on-line dictionaries are available in which each term is described using
other terms. For example: Online Dictionary of Library and Information Science [25], Free
Online Dictionary of Computing [11], and The GNU collaborative international dictionary
of English [13].
Such dictionary can be transformed into a directed graph G = (V, A): the terms determine the set of vertices V ; and there is an arc (u, v) ∈ A from term u to term v iff the term
v appears in the description of term u (as a marked term).
We present some approaches to analysis and visualization of dictionaries in a separate
paper [5], demonstrating several options for analysis: searching for important, dense or in
some other way interesting parts of network; searching for important (central) words in
networks; and visualization of results.
4 Bibliographic networks
4.1 Collaboration networks
A ’classical’ example of collaboration network is the Erd ős network [3].
On the Internet many bibliographies in BiBTEX format are available [8]. From such a
bibliography an authors colaboration network can be build. Its vertices represent different
4.1
Collaboration networks
5
E.Arkin
J.Mitchell
M.Bern
I.Tollis
A.Garg
D.Eppstein
L.Vismara
G.diBattista
R.Tamassia
M.Goodrich
G.Liotta
D.Dobkin
S.Suri
J.O’Rourke
J.Vitter
J.Hershberger
B.Chazelle
R.Seidel B.Aronov L.Guibas
H.Edelsbrunner
M.Sharir
F.Preparata
J.Snoeyink
P.Agarwal
R.Pollack
D.Halperin
J.Pach
E.Welzl
P.Gupta
M.Overmars
P.Bose
M.vanKreveld
J.Matousek
C.Yap
M.Smid
J.Boissonnat
O.Devillers
M.Yvinec
M.deBerg
O.Schwarzkopf
G.Toussaint
R.Janardan
J.Majhi
J.Schwerdt
M.Teillaud
J.Urrutia
J.Czyzowicz
C.Icking
R.Klein
Pajek
Figure 2: Valued core of the colaboration network of Computational Geometry at level 46.
authors. Two authors are linked with an edge, iff they wrote a common paper. The weight
of the edge is the number of publications they wrote together.
As an example we produced the authors collaboration network based on the bibliography obtained from the Computational Geometry Database geombib [19].
Using a simple program in Python, the BiBTEX data were transformed into the corresponding network, and output to the file in Pajek format. The obtained network has 9072
vertices (authors) and 22577 edges / 13567 edges as a simple network.
The problem with the obtained network is that it contains several vertices corresponding
to the same author (Pankaj K. Agarwal, P. Agarwal, Pankaj Agarwal, and P.K. Agarwal) –
that are easy to guess; but an ’insider’ information is needed to know that O. Schwarzkopf
and O. Cheong are the same person. We manually produced the name equivalence partition
and then shrank the network according to it. The reduced simple network contains 7343
vertices and 11898 edges.
To this network we applied the algorithm for determining valued cores – vertex value
is the sum of weights in a vertex [7]. The cut at level 46 gave the network presented in
4.2
Citation networks
6
Figure 2.
4.2 Citation networks
Another interesting type of networks that can be derived from the bibliographical data are
citation networks. Here the vertices are different publications from the selected area; two
publications are connected by an arc if the first is cited by the second. The citation networks
are almost acyclic.
A great source of the necessary data for building citation networks is the Web of Science
[18] from where a selection of networks was constructed [26].
The citation network analysis started with the paper [12] in which, on the example of
Asimov’s history of DNA, it was shown that the analysis ”demonstrated a high degree of
coincidence between an historian’s account of events and the citational relationship between these events”. The next step was made by [17]. They proposed three indices (NPPC,
SPLC, SPNP) – weights of arcs that provide us with automatic way to identify the (most)
important part of the citation network – the main path analysis. We developed algorithms to
efficiently compute the Hummon and Doreian’s weights [1], so that they can be used also
for analysis of very large citation networks with several thousands of vertices.
In Figure 3 we present the main path determined in the SOM (self-organizing maps)
citation network (4470 vertices and 12731 arcs).
For the 2001 Graph-Drawing Conference Contest the contest graph A was a self-citing
network [14] of GD Conference proceedings. There is a vertex for every paper in the proceedings of GD94 to GD2000, and an arc, if a paper refers to another GD paper.
5 Text analysis networks
5.1 Reuters terror network
Centering Resonance Analysis (CRA) is a new text analysis technique developed by Steve
Corman and Kevin Dooley at Arizona State University [10]. It uses natural language processing and network text analysis techniques to produce abstract representations of texts.
For demonstration of CRA they produced and analyzed several networks. Among them
also the Reuters terror news network that is based on all news released during 66 consecutive days by the news agency Reuters concerning the September 11 attack on the U.S.,
beginning at 9:00 AM EST 9/11/01. The vertices of a network are words; there is an edge
between two words iff they appear in the same text unit (sentence). The weight of an edge
is its frequency.
This network was selected by Viszards (network visualization group) as the case study
network for a special visualization session on the Sunbelt XXII International Sunbelt Social
5.1
Reuters terror network
7
POGGIO-T-1975-V19-P201
KOHONEN-T-1976-V21-P85
KOHONEN-T-1976-V22-P159
KOHONEN-T-1977-V2-P1065
COOPER-LN-1979-V33-P9
BIENENSTOCK-EL-1982-V2-P32
ANDERSON-JA-1983-V13-P799
KNAPP-AG-1984-V10-P616
MCCLELLAND-JL-1985-V114-P159
CARPENTER-GA-1987-V37-P54
HECHTNIELSEN-R-1987-V26-P1892
HECHTNIELSEN-R-1987-V26-P4979
HECHTNIELSEN-R-1988-V1-P131
KOHONEN-T-1990-V78-P1464
BAUER-HU-1992-V3-P570
LI-X-1993-V70-P189
GASTEIGER-J-1994-V33-P643
GASTEIGER-J-1994-V116-P4608
BAUKNECHT-H-1996-V36-P1205
SCHNEIDER-G-1998-V70-P175
SCHNEIDER-G-1999-V237-P113
POLANSKI-J-2000-V3-P481
ZUEGGE-J-2001-V280-P19
ROCHE-O-2002-V3-P455
Figure 3: Main path in SOM citation network.
Network Conference, New Orleans, USA, 13-17. February 2002. Different approaches to
the analysis of the Reuters terror news network were presented.
We transformed the sequence of CRA networks into a single Pajek’s temporal network
and analyzed it using Pajek [4]. It has n = 13332 vertices (different words in the news) and
m = 243447 edges. We present here only two results.
We identified, using cores [6], in the total network the most important words and determined their layout. Then we produced a sequence of pictures (one for each day) displaying
the changes of the news attention. In Figure 4 a picture for the 58th day is presented. The
pictures were realized using SVG with the Javascript support for interactive viewing at different levels.
Pajek
5.2
Other
8
Figure 4: Main links in messages on the 58th day.
The second picture in Figure 5 presents a segment of the display of the total matrix of
1111 most important vertices (determined by a cut). To nevtralize the most frequent words
we normalized the matrix using the geometric normalization.
wuv
Geouv = √
wuu wvv
Different stories appeared as connected components.
5.2 Other
Another source of temporal network data are the Keds encodings of news [20].
For the 1999 Graph-Drawing Conference Contest the contest graph A was a temporal
network representing different relations among characters in German TV series ‘Lindenstrasse’ [15].
loveland
mid-november fast
ramadan
han
bosnia
istan-based
di-born
month
6 Conclusions
holy
jihad
ghana
persian
hollings
carolina
21st
hellish
gulf
f-16
housewife
qassem
es
mobilize
terrorism
kosovo
laryea
ernest
war
zaeef
century
amal
42-year-old
salaam
dar
reservist tanzania
fighter
strike-alert
coast
knife-wieldingjet
mobilization
east
people
protection
hijacker
anti-american
commercial
miss big
plane
airliner
snowshoeskate
sled
rink
ice
cream
doughnut
shop
co
form
perspective
therapist
joe
occupational
bacterial
rare disease
c
saunders
lewin
violator
infectious
robert
warfare
germ mueller
akamai
scott
allb
allergy
abuse
institute
biological
rights
chemical
agentfbi
director
being
kenya
africa
mobility
mule
9
deadly
cattle
infection
pig goat horse
greedy
trace bacterium
confirm
sheepspore
case
anthrax
canoe
skin
exposure
scare
inhalation
inhale
barbed
need
accountability
human
half-mast
weapon
nuclear
plant
fence
buildngs
tall
nut
homela
mass destruction ridge
tom
pole
roof
ton
slate
structure
health
north
caliber
brokaw anc
embassy
blacken extract
flagpole
crush
mast-like
pentagon
metal
arco
hijack
official
concrete
shard
pulverize
heap
jag
vantagenbc
tower
smolder twist
break
united_states
skyway
rellozai
world_trade_ctr
steel
air
attack
rooftop
glass
pile
washington
strike
bio
stench
military
mighty
rock
rule afghanistan
bomb-sniffing
plume
billow
taliban
e
two-inch
new_yorkmayor_giuliani
dog sniff
tuesday
thicksmoke
ash gray davis
trail browndust
twin
action
rulerstronghold
september
sniffer
pour layer
mayor
rec
cloud
kandahar
volcano
freezerain
wreckage
plow
suicide
several-inch-thickmichael
hnology-laced
williams anthony
devita
huge
southern
wind
crater
sept
floor
cnn/usa
wes
columbiafauci
frustrate
tip
finance
dealer-broker
interdealer-broker
week
today/gallup
20-feet
cantor_fitzgerald
oll
11
underline
role
manhattan
110-story
district
brokerage
new_york
low
15-feet
key
bc_news/washington
756
terror
greenwich_village
lutnick
financial
russian-speaking
neighborhood
stern
howard photo
teenager
despicable
market
siberia
collection
editor
nakhib
tel
irwin
magic
novosibirsk
steven
ac-130s stock
federal_reserveemployee
ta
aviv
dragon
four-engined
ripped-up
life
aschle
disco
schwartz
turbo-prop
publis
norton three-star
four-engine
exchange
ben-gurion
cardboard
nthrax-laced
act
box-cutters
almond
knife
jennifer
jstars
turboprop
mercantile
nted
cutter
cowardly
janeiro
raisin
x-laden
harbury
box
hdr daily
rio
normalization.
chicago Figure 5: Geometric
kristen
dye black
swamp
de guerre
hip
7-year-old
rationhumanitarian
makeshift
o’hare sears_tower
moy
morgue
hair
bohm
nom
terrorist
on
double
meal meatless
portable
michelle
c-17s
marion
surface-to-air
lopez
For some additional ideas on text analysis read [27]. N2200-calorie
saudal-faisal
mit’eb
soil
francis
nayef
missile tomahawk
ship-fired
dec
princeabdul-aziz
pm_blair
turki
pearl_barbor
archbishop
r-la
edition
69-year-old
cruise
bandar
cnn
bin
canberra
sultan
deputy
erly
british
tional
japanese
tauzin beckwith
koizumi
zaid
henry_
blair qudratullah
ession
panam
sharon tony information wolfowitz
bob barr georgia
sheikh
al-nahayan
moldova
billy
parent
israelijunichiro
leventhal
board
jamalinto a corresponding
prime
graham
The transformation
of
textual
data
network
is
much
easier
if
the
data
lockerbie
-old
ncy
parallel
pope
e_booker_school ariel minister
qudrutullah
paul
sermon
rev
treasury
eight-engine
transportation
colacicco
are structured education
using some kind of markup
such as
[30,
28]. The spread of XML
based
o’neill
scotland
-month
frank
differedinburgh
swing-wing vietnam-era
boat airb
priest
secretary
tommy
foreign
beryllium
jack
amir
aluminum
cem muttaqi
yard
b-52
norman_mineta
thompson
applications will
contribute a lot in thisstraw
direction.
b-1
burnett
copper
deena
ismail
donald
ministry
blizzard
khan
bomber
bangzao
difrancesco
We expect also many
applications
of network
analysis in the implementations
of the
b-2
zhu
midwinter
riaz
chromium
stealth
former_pres_clinton
Semantic
29, 31].
polychlorinated
co-sponsor Web [9, 23, sufi
rodham
mohammad
elimination
hillary
dioxin
kay_bailey_hutchison abdul_salam_zaeef
omar
off-airport
chlorinatesulfur
end
norwood
schumer
curbside
mullah
charles
benzene
oxide
iowa
reclusive
spiritual
curb-side
senator
colorless
firefighter
supreme
iron
john_mccain
bureaucratic ronanten
arizona
mohammad_omar
check-in
charles_grassleyleader
non-passengers
fist
republican
lott mississippi
preflight
tynan
derek
spector
restrict
minority
trent
corvington
inova
clench
routine
somerset
arlington
roland
access
sgt
fairfax county
dobbin
boniormichigan
nearby
47-story
durrkopf
treanor tamillow
james senatehart buildng
ostroff
conyers
sensenbrenner
silkestebner
vermontpatrick
sarasota
mike
tephen
und
modernization
leahyjudiciary
five-sided
wisconsin
wedge
committee
margrit
herald-tribune
jendrzejczyk
feingold
penny
russell
corner
ess
jelinek
watch
hebert
chairman
brownstone
russ
joint_chiefs_ofstaff
katrin
stoop
promiseindiana
cholar
democratic
scheme
herbert
lackland
villalobos
taubmann
gephardt
myers
missouri
alabama
televise
shelbyrichard
andrews
fraudulent
georg
montgomery
possess lesion
base
boucher
defense_dept
grassoair_force
offutt
nebraska
address weekly
driver’s
louisiana
chronicle
barksdale
license
general
sac omaha
tongue
postmaster
radio
johnpotter
pietropaoli
stufflebeem
san_francisco
rashidstopover hq
adm
business-class nausea
hazardous
pervez
admiral
virg
rear
shariat
throatirritation
musharraf
dostum
glocker
pakistani abdulsattar
golden
potoma
cynthia
material
voice
bridge
hai
navy
gate
hudson
president haq
hanan abul
tunnel
himat
deaf ear
river
recordercockpit
witness
megawati
mutmaen
beret
vladimir
sukarnoputri
connect
downtown
darya
jacques
eye
renaissance
door
putin
data
ashrawi
detroit
n
amu hai
russian
indonesian
rushailo
aiman
wary
chirac
french
seed
lafayette
castor
metropolitan
protein
snake
horseback
zawahri
brucellosis
igor
toxin
ayman
observer
bottom
botulinum
bodyguard
giza
nest
plague
ivanov
line
rug
rabbitfever
sergei
front
tularemia
hemorrhagic
onaire
ruin
chunk
6 Conclusions
fuel-laden
bombing
stubborn
wooden
debris
REFERENCES
10
References
[1] V. Batagelj. Efficient algorithms for citation network analysis. Submitted, 2002.
[2] V. Batagelj and A. Mrvar. Pajek – A program for large network analysis. Connections,
21(2):47–57, 1998. http://vlado.fmf.uni-lj.si/pub/networks/pajek/.
[3] V. Batagelj and A. Mrvar. Some analyses of Erdős collaboration graph. Social Networks, 22:173–186, 2000. http://vlado.fmf.uni-lj.si/pub/networks/doc/erdos/.
[4] V. Batagelj and A. Mrvar. Reuters terror news network analysis with pajek. To appear
in JoSS http://www2.heinz.cmu.edu/project/INSNA/joss/, 2002.
[5] V. Batagelj, A. Mrvar, and M. Zaveršnik. Network analysis of dictionaries. To appear
in Information Society’02, Language technologies proceedings, 2002.
[6] V. Batagelj and M. Zaveršnik. An O(m) algorithm for cores decomposition of networks. Submitted, 2001.
[7] V. Batagelj and M. Zaveršnik. Generalized cores. Submitted, 2002.
[8] N. H. F. Beebe. Bibliographies page, 2002.
http://www.math.utah.edu/˜beebe/bibliographies.html.
[9] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American,
2001.
[10] CRA. Analyses of news stories on the terrorist attack, 2001.
http://locks.asu.edu/terror/.
[11] FOLDOC. Free on-line dictionary of computing, 2002.
http://wombat.doc.ic.ac.uk/foldoc/.
[12] E. Garfield, I. H. Sher, and R. J. Torpie. The use of citation data in writing the history
of science, 1964.
http://www.garfield.library.upenn.edu/papers/useofcitdatawritinghistofsci.pdf.
[13] GCIDE XML. The GNU version of the collaborative international dictionary of English, presented in the extensible markup language, 2002.
http://www.ibiblio.org/webster/.
[14] GD01 contest. GD proceedings self-citation network, 2001.
http://www.infosun.fmi.uni-passau.de/GD2001/graphA/.
[15] GD99 contest. ‘Lindenstrasse’ network, 1999.
http://kam.mff.cuni.cz/conferences/GD99/contest/graphs/A.html.
[16] GNUsl. prosto programje in slovenščina, 2002. http://nl.ijs.si/GNUsl/.
REFERENCES
11
[17] N. P. Hummon and P. Doreian. Connectivity in a citation network: The development
of DNA theory. Social Networks, 11:39–63, 1989.
[18] ISI. Web of science, 2002. http://www.isinet.com/isi/products/citation/wos/.
[19] B. Jones. Computational geometry database, 2002.
http://compgeom.cs.uiuc.edu/˜jeffe/compgeom/biblios.html
ftp://ftp.cs.usask.ca/pub/geometry/.
[20] KEDS. Kansas event data system, 2002. http://www.ukans.edu/˜keds/.
[21] D. E. Knuth. The Stanford GraphBase: A platform for combinatorial computing.
ACM Press and Addison-Wesley, New York, 1993.
http://www-cs-faculty.stanford.edu/˜knuth/sgb.html.
[22] Lexical FreeNet. connected thesaurus, 2002. http://www.lexfn.com/.
[23] M. Marko, M. A. Porter, A. Probst, C. Gershenson, and A. Das. Transforming the
World Wide Web into a complexity-based semantic network, 2002.
http://arxiv.org/html/cs.NI/0205080.
[24] NL. Natural language server at dept. of intelligent systems institute “Jožef Stefan”,
2002. http://nl.ijs.si/.
[25] ODLIS. Online dictionary of library and information science, 2002.
http://vax.wcsu.edu/library/odlis.html.
[26] Pajek’s datasets. Citation networks, 2002.
http://vlado.fmf.uni-lj.si/pub/networks/data/cite/.
[27] R. Popping. Computer-assisted Text Analysis. Sage, London, 2000.
[28] Reuters. Corpus XML, 2002. http://about.reuters.com/researchandstandards/corpus/.
[29] SemanticWeb. The semantic web community portal, 2002.
http://www.semanticweb.org/.
[30] TEI. Consortium Website, 2002. http://www.tei-c.org/.
[31] W3C Semantic Web, 2002. http://www.w3.org/2001/sw/.
[32] R. J. Wilson and J. J. Watkins. Graphs, An Introductory Approach. Wiley, 1990.
translation in slovene: DMFA RS, Ljubljana, 1997.
[33] WordNet. A lexical database for the English language, 2002.
http://www.cogsci.princeton.edu/˜wn/.
View publication stats