Network analysis of texts

Vladimir Batagelj

U NIVERSITY OF L JUBLJANA I NSTITUTE OF M ATHEMATICS , P HYSICS AND M ECHANICS D EPARTMENT OF T HEORETICAL C OMPUTER S CIENCE JADRANSKA 19, 1 000 L JUBLJANA , S LOVENIA Preprint series, Vol. 40 (2002), 833 NETWORK ANALYSIS OF TEXTS Vladimir Batagelj, Andrej Mrvar, Matjaž Zaveršnik ISSN 1318-4865 First version: July 28, 2002 Math.Subj.Class.(2000): 68 T 50, 91 F 20, 01 A 90, 05 C 70, 05 C 85, 92 H 30, 93 A 15, 68 T 30. Presented at the Third Language Technologies Conference, October 14-15, 2002, Ljubljana, Slovenia, Supported by the Ministry of Education, Science and Sport of Slovenia, Project J1-8532. Ljubljana, September 22, 2002 Network analysis of texts Vladimir Batagelj University of Ljubljana, Faculty of Mathematics and Physics, Jadranska 19, 1000 Ljubljana vladimir.batagelj@uni-lj.si Andrej Mrvar University of Ljubljana, Faculty of Social Sciences, Kardeljeva ploščad 5, 1000 Ljubljana andrej.mrvar@uni-lj.si Matjaž Zaveršnik University of Ljubljana, Faculty of Mathematics and Physics, Jadranska 19, 1000 Ljubljana matjaz.zaversnik@fmf.uni-lj.si Abstract In the paper different ways to derive networks from the textual data and an overview of (possible) applications of network analysis to the analysis of texts is presented. Several examples of analyses of different text networks are given as illustrations. Key words: text analysis, vocabulary, dictionary, citation, collaboration, core, normalization, temporal network. Math. Subj. Class. (2000): 93 A 15, 68 T 30. 68 T 50, 91 F 20, 01 A 90, 05 C 70, 05 C 85, 92 H 30, 1 Introduction Different kinds of networks can be generated from already existing electronic sources. Text is a special and frequent form of such data. In the paper an overview of different ways to derive networks from the textual data and to analyse them are presented. 2 Vocabularies 2 The obtained networks can be very large, having some ten or hundred of thousands of vertices. Therefore special algorithms are needed to analyze and visualize them. All the analyses in the paper were done with Pajek, a program (for Windows) for large network analysis and visualization. It is freely available, for noncommercial use, at its site [2]. We shall assume that the reader is familiar with the basic notions of graph theory (see for example [32]). 2 Vocabularies 2.1 Transforming water into wine In the recreational literature we find problems such as: transform the water into wine by a sequence of words obtained by changing one character (deleting, inserting or replacing it) each time. For example water – wader – wade – wane – wine or water – waver – wave – wive – wine Given a vocabulary of a language (we are using Knuth’s vocabulary of English [21]) we can construct the corresponding transformations graph G = (V, E) in which the set of vertices V consists of words from the dictionary; and there is an edge (u : v) ∈ E linking the words u and v iff v can be obtained from u by changing one character. On this graph the recreational problem turns into a problem of determining a path between given two words. Usually we try to find a shortest such path, also called a geodesic. In Figure 1 the graph of all geodesics leading from black to white is presented. Constructing the transformations graph is interesting also as a computational problem – how to do it efficiently? We used the following approach: for each word w ∈ V a list of pairs (w 0 , w) is produced, where w 0 is a transformation pattern in which the place of transformation is indicated by a star *. For example, for the word brain we obtain the list (*rain, brain), (b*ain, brain), (br*in, brain), (bra*n, brain), (brai*, brain), (*brain, brain), (b*rain, brain), (br*ain, brain), (bra*in, brain), (brai*n, brain), (brain*, brain) 2.1 Transforming water into wine 3 black lack rack lace rick race late rice rate rite back balk bale hale wack walk wale bank bilk lice wick bile whale write lick blank wink wile lank bane bine clack bask blink lane wane line link bait wine while wait whine clank bast wast cline chine click clink chick chink chic chit whit white Figure 1: black – white It holds: (u : v) ∈ E iff there exists a pattern p such that the union L of all lists contains both pairs (p, u) and (p, v). For example (rain: brain) ∈ E since (*rain, rain), (*rain, brain) ∈ L; and (train: brain) ∈ E since (*rain, train), (*rain, brain) ∈ L . To identify efficiently all such pairs we first sort the list L on the first elements of its pairs. In this way pairs with the same pattern are grouped together. We have only to produce the corresponding edges for each such group. Note also that the list L can be viewed also as a 2-mode (bipartite) graph between patterns and words. Pajek 2.2 Things to do 4 Using standard sorting algorithm the complexity of this procedure is of order O(|V | log |V |); it can be made linear by using bin sort. Several transformations graphs in Pajek’s format are available at Pajek’s site. 2.2 Things to do The transformations graphs can be produced also for other languages, provided the language vocabulary is available. For Slovene language only a vocabulary of all word forms is freely available [24, 16] – it is not appropriate according to ’recreational rules’. It is also possible to introduce additional transformations. For example a swap (interchange) of two characters (with empty character allowed): life – file and arc – car Other, linguistic relations between words are also interesting. An example of such data collection is the WordNet: a lexical database for the English language [33, 22]. The Pajek’s version of WordNet data is in preparation. 3 Dictionaries On the web several on-line dictionaries are available in which each term is described using other terms. For example: Online Dictionary of Library and Information Science [25], Free Online Dictionary of Computing [11], and The GNU collaborative international dictionary of English [13]. Such dictionary can be transformed into a directed graph G = (V, A): the terms determine the set of vertices V ; and there is an arc (u, v) ∈ A from term u to term v iff the term v appears in the description of term u (as a marked term). We present some approaches to analysis and visualization of dictionaries in a separate paper [5], demonstrating several options for analysis: searching for important, dense or in some other way interesting parts of network; searching for important (central) words in networks; and visualization of results. 4 Bibliographic networks 4.1 Collaboration networks A ’classical’ example of collaboration network is the Erd ős network [3]. On the Internet many bibliographies in BiBTEX format are available [8]. From such a bibliography an authors colaboration network can be build. Its vertices represent different 4.1 Collaboration networks 5 E.Arkin J.Mitchell M.Bern I.Tollis A.Garg D.Eppstein L.Vismara G.diBattista R.Tamassia M.Goodrich G.Liotta D.Dobkin S.Suri J.O’Rourke J.Vitter J.Hershberger B.Chazelle R.Seidel B.Aronov L.Guibas H.Edelsbrunner M.Sharir F.Preparata J.Snoeyink P.Agarwal R.Pollack D.Halperin J.Pach E.Welzl P.Gupta M.Overmars P.Bose M.vanKreveld J.Matousek C.Yap M.Smid J.Boissonnat O.Devillers M.Yvinec M.deBerg O.Schwarzkopf G.Toussaint R.Janardan J.Majhi J.Schwerdt M.Teillaud J.Urrutia J.Czyzowicz C.Icking R.Klein Pajek Figure 2: Valued core of the colaboration network of Computational Geometry at level 46. authors. Two authors are linked with an edge, iff they wrote a common paper. The weight of the edge is the number of publications they wrote together. As an example we produced the authors collaboration network based on the bibliography obtained from the Computational Geometry Database geombib [19]. Using a simple program in Python, the BiBTEX data were transformed into the corresponding network, and output to the file in Pajek format. The obtained network has 9072 vertices (authors) and 22577 edges / 13567 edges as a simple network. The problem with the obtained network is that it contains several vertices corresponding to the same author (Pankaj K. Agarwal, P. Agarwal, Pankaj Agarwal, and P.K. Agarwal) – that are easy to guess; but an ’insider’ information is needed to know that O. Schwarzkopf and O. Cheong are the same person. We manually produced the name equivalence partition and then shrank the network according to it. The reduced simple network contains 7343 vertices and 11898 edges. To this network we applied the algorithm for determining valued cores – vertex value is the sum of weights in a vertex [7]. The cut at level 46 gave the network presented in 4.2 Citation networks 6 Figure 2. 4.2 Citation networks Another interesting type of networks that can be derived from the bibliographical data are citation networks. Here the vertices are different publications from the selected area; two publications are connected by an arc if the first is cited by the second. The citation networks are almost acyclic. A great source of the necessary data for building citation networks is the Web of Science [18] from where a selection of networks was constructed [26]. The citation network analysis started with the paper [12] in which, on the example of Asimov’s history of DNA, it was shown that the analysis ”demonstrated a high degree of coincidence between an historian’s account of events and the citational relationship between these events”. The next step was made by [17]. They proposed three indices (NPPC, SPLC, SPNP) – weights of arcs that provide us with automatic way to identify the (most) important part of the citation network – the main path analysis. We developed algorithms to efficiently compute the Hummon and Doreian’s weights [1], so that they can be used also for analysis of very large citation networks with several thousands of vertices. In Figure 3 we present the main path determined in the SOM (self-organizing maps) citation network (4470 vertices and 12731 arcs). For the 2001 Graph-Drawing Conference Contest the contest graph A was a self-citing network [14] of GD Conference proceedings. There is a vertex for every paper in the proceedings of GD94 to GD2000, and an arc, if a paper refers to another GD paper. 5 Text analysis networks 5.1 Reuters terror network Centering Resonance Analysis (CRA) is a new text analysis technique developed by Steve Corman and Kevin Dooley at Arizona State University [10]. It uses natural language processing and network text analysis techniques to produce abstract representations of texts. For demonstration of CRA they produced and analyzed several networks. Among them also the Reuters terror news network that is based on all news released during 66 consecutive days by the news agency Reuters concerning the September 11 attack on the U.S., beginning at 9:00 AM EST 9/11/01. The vertices of a network are words; there is an edge between two words iff they appear in the same text unit (sentence). The weight of an edge is its frequency. This network was selected by Viszards (network visualization group) as the case study network for a special visualization session on the Sunbelt XXII International Sunbelt Social 5.1 Reuters terror network 7 POGGIO-T-1975-V19-P201 KOHONEN-T-1976-V21-P85 KOHONEN-T-1976-V22-P159 KOHONEN-T-1977-V2-P1065 COOPER-LN-1979-V33-P9 BIENENSTOCK-EL-1982-V2-P32 ANDERSON-JA-1983-V13-P799 KNAPP-AG-1984-V10-P616 MCCLELLAND-JL-1985-V114-P159 CARPENTER-GA-1987-V37-P54 HECHTNIELSEN-R-1987-V26-P1892 HECHTNIELSEN-R-1987-V26-P4979 HECHTNIELSEN-R-1988-V1-P131 KOHONEN-T-1990-V78-P1464 BAUER-HU-1992-V3-P570 LI-X-1993-V70-P189 GASTEIGER-J-1994-V33-P643 GASTEIGER-J-1994-V116-P4608 BAUKNECHT-H-1996-V36-P1205 SCHNEIDER-G-1998-V70-P175 SCHNEIDER-G-1999-V237-P113 POLANSKI-J-2000-V3-P481 ZUEGGE-J-2001-V280-P19 ROCHE-O-2002-V3-P455 Figure 3: Main path in SOM citation network. Network Conference, New Orleans, USA, 13-17. February 2002. Different approaches to the analysis of the Reuters terror news network were presented. We transformed the sequence of CRA networks into a single Pajek’s temporal network and analyzed it using Pajek [4]. It has n = 13332 vertices (different words in the news) and m = 243447 edges. We present here only two results. We identified, using cores [6], in the total network the most important words and determined their layout. Then we produced a sequence of pictures (one for each day) displaying the changes of the news attention. In Figure 4 a picture for the 58th day is presented. The pictures were realized using SVG with the Javascript support for interactive viewing at different levels. Pajek 5.2 Other 8 Figure 4: Main links in messages on the 58th day. The second picture in Figure 5 presents a segment of the display of the total matrix of 1111 most important vertices (determined by a cut). To nevtralize the most frequent words we normalized the matrix using the geometric normalization. wuv Geouv = √ wuu wvv Different stories appeared as connected components. 5.2 Other Another source of temporal network data are the Keds encodings of news [20]. For the 1999 Graph-Drawing Conference Contest the contest graph A was a temporal network representing different relations among characters in German TV series ‘Lindenstrasse’ [15]. loveland mid-november fast ramadan han bosnia istan-based di-born month 6 Conclusions holy jihad ghana persian hollings carolina 21st hellish gulf f-16 housewife qassem es mobilize terrorism kosovo laryea ernest war zaeef century amal 42-year-old salaam dar reservist tanzania fighter strike-alert coast knife-wieldingjet mobilization east people protection hijacker anti-american commercial miss big plane airliner snowshoeskate sled rink ice cream doughnut shop co form perspective therapist joe occupational bacterial rare disease c saunders lewin violator infectious robert warfare germ mueller akamai scott allb allergy abuse institute biological rights chemical agentfbi director being kenya africa mobility mule 9 deadly cattle infection pig goat horse greedy trace bacterium confirm sheepspore case anthrax canoe skin exposure scare inhalation inhale barbed need accountability human half-mast weapon nuclear plant fence buildngs tall nut homela mass destruction ridge tom pole roof ton slate structure health north caliber brokaw anc embassy blacken extract flagpole crush mast-like pentagon metal arco hijack official concrete shard pulverize heap jag vantagenbc tower smolder twist break united_states skyway rellozai world_trade_ctr steel air attack rooftop glass pile washington strike bio stench military mighty rock rule afghanistan bomb-sniffing plume billow taliban e two-inch new_yorkmayor_giuliani dog sniff tuesday thicksmoke ash gray davis trail browndust twin action rulerstronghold september sniffer pour layer mayor rec cloud kandahar volcano freezerain wreckage plow suicide several-inch-thickmichael hnology-laced williams anthony devita huge southern wind crater sept floor cnn/usa wes columbiafauci frustrate tip finance dealer-broker interdealer-broker week today/gallup 20-feet cantor_fitzgerald oll 11 underline role manhattan 110-story district brokerage new_york low 15-feet key bc_news/washington 756 terror greenwich_village lutnick financial russian-speaking neighborhood stern howard photo teenager despicable market siberia collection editor nakhib tel irwin magic novosibirsk steven ac-130s stock federal_reserveemployee ta aviv dragon four-engined ripped-up life aschle disco schwartz turbo-prop publis norton three-star four-engine exchange ben-gurion cardboard nthrax-laced act box-cutters almond knife jennifer jstars turboprop mercantile nted cutter cowardly janeiro raisin x-laden harbury box hdr daily rio normalization. chicago Figure 5: Geometric kristen dye black swamp de guerre hip 7-year-old rationhumanitarian makeshift o’hare sears_tower moy morgue hair bohm nom terrorist on double meal meatless portable michelle c-17s marion surface-to-air lopez For some additional ideas on text analysis read [27]. N2200-calorie saudal-faisal mit’eb soil francis nayef missile tomahawk ship-fired dec princeabdul-aziz pm_blair turki pearl_barbor archbishop r-la edition 69-year-old cruise bandar cnn bin canberra sultan deputy erly british tional japanese tauzin beckwith koizumi zaid henry_ blair qudratullah ession panam sharon tony information wolfowitz bob barr georgia sheikh al-nahayan moldova billy parent israelijunichiro leventhal board jamalinto a corresponding prime graham The transformation of textual data network is much easier if the data lockerbie -old ncy parallel pope e_booker_school ariel minister qudrutullah paul sermon rev treasury eight-engine transportation colacicco are structured education using some kind of markup such as [30, 28]. The spread of XML based o’neill scotland -month frank differedinburgh swing-wing vietnam-era boat airb priest secretary tommy foreign beryllium jack amir aluminum cem muttaqi yard b-52 norman_mineta thompson applications will contribute a lot in thisstraw direction. b-1 burnett copper deena ismail donald ministry blizzard khan bomber bangzao difrancesco We expect also many applications of network analysis in the implementations of the b-2 zhu midwinter riaz chromium stealth former_pres_clinton Semantic 29, 31]. polychlorinated co-sponsor Web [9, 23, sufi rodham mohammad elimination hillary dioxin kay_bailey_hutchison abdul_salam_zaeef omar off-airport chlorinatesulfur end norwood schumer curbside mullah charles benzene oxide iowa reclusive spiritual curb-side senator colorless firefighter supreme iron john_mccain bureaucratic ronanten arizona mohammad_omar check-in charles_grassleyleader non-passengers fist republican lott mississippi preflight tynan derek spector restrict minority trent corvington inova clench routine somerset arlington roland access sgt fairfax county dobbin boniormichigan nearby 47-story durrkopf treanor tamillow james senatehart buildng ostroff conyers sensenbrenner silkestebner vermontpatrick sarasota mike tephen und modernization leahyjudiciary five-sided wisconsin wedge committee margrit herald-tribune jendrzejczyk feingold penny russell corner ess jelinek watch hebert chairman brownstone russ joint_chiefs_ofstaff katrin stoop promiseindiana cholar democratic scheme herbert lackland villalobos taubmann gephardt myers missouri alabama televise shelbyrichard andrews fraudulent georg montgomery possess lesion base boucher defense_dept grassoair_force offutt nebraska address weekly driver’s louisiana chronicle barksdale license general sac omaha tongue postmaster radio johnpotter pietropaoli stufflebeem san_francisco rashidstopover hq adm business-class nausea hazardous pervez admiral virg rear shariat throatirritation musharraf dostum glocker pakistani abdulsattar golden potoma cynthia material voice bridge hai navy gate hudson president haq hanan abul tunnel himat deaf ear river recordercockpit witness megawati mutmaen beret vladimir sukarnoputri connect downtown darya jacques eye renaissance door putin data ashrawi detroit n amu hai russian indonesian rushailo aiman wary chirac french seed lafayette castor metropolitan protein snake horseback zawahri brucellosis igor toxin ayman observer bottom botulinum bodyguard giza nest plague ivanov line rug rabbitfever sergei front tularemia hemorrhagic onaire ruin chunk 6 Conclusions fuel-laden bombing stubborn wooden debris REFERENCES 10 References [1] V. Batagelj. Efficient algorithms for citation network analysis. Submitted, 2002. [2] V. Batagelj and A. Mrvar. Pajek – A program for large network analysis. Connections, 21(2):47–57, 1998. http://vlado.fmf.uni-lj.si/pub/networks/pajek/. [3] V. Batagelj and A. Mrvar. Some analyses of Erdős collaboration graph. Social Networks, 22:173–186, 2000. http://vlado.fmf.uni-lj.si/pub/networks/doc/erdos/. [4] V. Batagelj and A. Mrvar. Reuters terror news network analysis with pajek. To appear in JoSS http://www2.heinz.cmu.edu/project/INSNA/joss/, 2002. [5] V. Batagelj, A. Mrvar, and M. Zaveršnik. Network analysis of dictionaries. To appear in Information Society’02, Language technologies proceedings, 2002. [6] V. Batagelj and M. Zaveršnik. An O(m) algorithm for cores decomposition of networks. Submitted, 2001. [7] V. Batagelj and M. Zaveršnik. Generalized cores. Submitted, 2002. [8] N. H. F. Beebe. Bibliographies page, 2002. http://www.math.utah.edu/˜beebe/bibliographies.html. [9] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, 2001. [10] CRA. Analyses of news stories on the terrorist attack, 2001. http://locks.asu.edu/terror/. [11] FOLDOC. Free on-line dictionary of computing, 2002. http://wombat.doc.ic.ac.uk/foldoc/. [12] E. Garfield, I. H. Sher, and R. J. Torpie. The use of citation data in writing the history of science, 1964. http://www.garfield.library.upenn.edu/papers/useofcitdatawritinghistofsci.pdf. [13] GCIDE XML. The GNU version of the collaborative international dictionary of English, presented in the extensible markup language, 2002. http://www.ibiblio.org/webster/. [14] GD01 contest. GD proceedings self-citation network, 2001. http://www.infosun.fmi.uni-passau.de/GD2001/graphA/. [15] GD99 contest. ‘Lindenstrasse’ network, 1999. http://kam.mff.cuni.cz/conferences/GD99/contest/graphs/A.html. [16] GNUsl. prosto programje in slovenščina, 2002. http://nl.ijs.si/GNUsl/. REFERENCES 11 [17] N. P. Hummon and P. Doreian. Connectivity in a citation network: The development of DNA theory. Social Networks, 11:39–63, 1989. [18] ISI. Web of science, 2002. http://www.isinet.com/isi/products/citation/wos/. [19] B. Jones. Computational geometry database, 2002. http://compgeom.cs.uiuc.edu/˜jeffe/compgeom/biblios.html ftp://ftp.cs.usask.ca/pub/geometry/. [20] KEDS. Kansas event data system, 2002. http://www.ukans.edu/˜keds/. [21] D. E. Knuth. The Stanford GraphBase: A platform for combinatorial computing. ACM Press and Addison-Wesley, New York, 1993. http://www-cs-faculty.stanford.edu/˜knuth/sgb.html. [22] Lexical FreeNet. connected thesaurus, 2002. http://www.lexfn.com/. [23] M. Marko, M. A. Porter, A. Probst, C. Gershenson, and A. Das. Transforming the World Wide Web into a complexity-based semantic network, 2002. http://arxiv.org/html/cs.NI/0205080. [24] NL. Natural language server at dept. of intelligent systems institute “Jožef Stefan”, 2002. http://nl.ijs.si/. [25] ODLIS. Online dictionary of library and information science, 2002. http://vax.wcsu.edu/library/odlis.html. [26] Pajek’s datasets. Citation networks, 2002. http://vlado.fmf.uni-lj.si/pub/networks/data/cite/. [27] R. Popping. Computer-assisted Text Analysis. Sage, London, 2000. [28] Reuters. Corpus XML, 2002. http://about.reuters.com/researchandstandards/corpus/. [29] SemanticWeb. The semantic web community portal, 2002. http://www.semanticweb.org/. [30] TEI. Consortium Website, 2002. http://www.tei-c.org/. [31] W3C Semantic Web, 2002. http://www.w3.org/2001/sw/. [32] R. J. Wilson and J. J. Watkins. Graphs, An Introductory Approach. Wiley, 1990. translation in slovene: DMFA RS, Ljubljana, 1997. [33] WordNet. A lexical database for the English language, 2002. http://www.cogsci.princeton.edu/˜wn/. View publication stats

RELATED PAPERS

RELATED TOPICS

Log In

Network analysis of texts

Network analysis of texts

Related Papers

RELATED PAPERS

RELATED TOPICS