CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
UNIT I
INTRODUCTION
Introduction to Web - Limitations of current Web – Development of Semantic Web – Emergence
of the Social Web – Statistical Properties of Social Networks -Network analysis - Development
of Social Network Analysis - Key concepts and measures in network analysis - Discussion
networks - Blogs and online communities - Web-based networks.
1
CP5074 – SOCIAL NETWORK ANALYSIS
2
CP5074 – SOCIAL NETWORK ANALYSIS
Fig.1 Search results for the keyword CS6010 Social network analysis using Google
2. Show me photo of Paris
Typing “Paris photos” in search engine returned the result in google image as below. The search
engine fails to discriminate two categories of images: i. related to the city of Paris and ii.
showing Paris Hilton While the search engine does a good job with retrieving documents, the
results of image searches in general are disappointing. For the keyword Paris most of us would
expect photos of places in Paris or maps of the city. In reality only about half of the photos on
the first page, a quarter of the photos on the second page and a fifth on the third page are directly
related to our concept of Paris. The rest are about clouds, people, signs, diagrams etc
Problems:
Associating photos with keywords is a much more difficult task than simply looking for
keywords in the texts of documents.
Automatic image recognition is currently a largely unsolved research problem.
Search engines attempt to understand the meaning of the image solely from its context
Find new music that I (might) like This is a difficult query. From the perspective of
automation, music retrieval is just as problematic as image search. search engines do not exist for
different reasons: most music on the internet is shared illegally through peer-to-peer systems that
are completely out of reach for search engines. Music is also a fast moving good; search engines
typically index the Web once a month and therefore too slow for the fast moving world of music
releases. On the other hand, our musical taste might change in which case this query would need
to change its form. A description of our musical taste is something that we might list on our
3
CP5074 – SOCIAL NETWORK ANALYSIS
homepage but it is not something that we would like to keep typing in again for accessing
different music-related services on the internet.
4
CP5074 – SOCIAL NETWORK ANALYSIS
5
CP5074 – SOCIAL NETWORK ANALYSIS
community, researchers have submitted publications or held an organizing role at any of the past
International Semantic Web Conferences.
The complete list of individuals in this community consists of 608 researchers mostly from
academia (79%) and to a lesser degree from industry (21%). Geographically, the community
covers much of the United States, Europe, with some activity in Japan and Australia.
The core technology of the Semantic Web, logic-based languages for knowledge
representation and reasoning has been developed in the research field of Artificial Intelligence.
As the potential for connecting information sources on a Web-scale emerged, the languages
that have been used in the past to describe the content of the knowledge bases of stand-alone
expert systems have been adapted to the open, distributed environment of the Web.
Since the exchange of knowledge in standard languages is crucial for the interoperability of tools
and services on the Semantic Web, these languages have been standardized by the W3C.
Technology adoption
The Semantic Web was originally conceptualized as an extension of the current Web, i.e. as the
application of metadata for describing Web content. In this vision, the content that is already on
the Web.
This vision was soon considered to be less realistic.
The alternative view predicted that the Semantic Web will first break through behind the
scenes and not with the ordinary users, but among large providers of data and services.
The second vision predicts that the Semantic Web will be primarily a “web of data” operated
by data and service providers.
That the Semantic Web is formulated as a vision points to the problem of bootstrapping the
Semantic Web.
Difficulties:
The problem is that as a technology for developers, users of the Web never experiences the
Semantic Web directly, which makes it difficult to convey Semantic Web technology to
stakeholders. Further, most of the times the gains for developers are achieved over the long term,
i.e. when data and services need to reused and re-purposed. The semantic web suffers from Fax-
effect.
When the first fax machines were introduced, they came with a very hefty price tag. Yet they
were almost useless. The usefulness of a fax comes from being able to communicate with other
fax users. In this sense every fax unit sold increases the value of all fax machines in use.
With the Semantic Web the beginning the price of technological investment is very high. One
has to adapt the new technology which requires an investment in learning. The technology
needs time to become more reliable.
6
CP5074 – SOCIAL NETWORK ANALYSIS
It required a certain kind of agreement to get the system working on a global scale: all fax
machines needed to adopt the same protocol for communicating over the telephone line. This is
similar to the case of the Web where global interoperability is guaranteed by the standard
protocol for communication (HTTP).
In order to exchange meaning there has to be a minimal external agreement on the meaning of
some primitive symbols, i.e. on what is communicated through the network.
Our machines can also help in this task to the extent that some of the meaning can be described
in formal rules (e.g. if A is true, B should follow). But formal knowledge typically captures only
the smaller part of the intended meaning and thus there needs to be a common grounding in an
external reality that is shared by those at separate ends of the line.
To follow the popularity of Semantic Web related concepts and Semantic Web standards on
the Web, have executed a set of temporal queries using the search engine Altavista.
The queries contained single terms plus a disambiguation term where it was necessary. Each
query measured the number of documents with the given term(s) at the given point in time.
The below figure shows the number of documents with the terms basketball, Computer Science,
and XML. The flat curve for the term basketball validates this strategy: the popularity of
basketball to be roughly stable over this time period. Computer Science takes less and less share
of the Web as the Web shifts from scientific use to everyday use. The share of XML, a popular
pre-semantic web technology seems to grow and stabilize as it becomes a regular part of the
toolkit of Web developers.
Fig2. Number of webpage with the terms basketball, Computer Science, and XML over time
and as a fraction of the number of pages with the term web.
Against this general backdrop there was a look at the share of Semantic Web related terms
and formats, in particular the terms RDF, OWL and the number of ontologies (Semantic Web
Documents) in RDF or OWL. As Figure 1.3.b shows most of the curves have flattened out
after January, 2004. It is not known at this point whether the dip in the share of Semantic
7
CP5074 – SOCIAL NETWORK ANALYSIS
Web is significant. While the use of RDF has settled at a relatively high level, OWL has yet
to break out from a very low trajectory.
Fig3. Number of WebPages with the terms RDF, OWL and the number of ontologies in RDF or
OWL over time. Again, the number is relative to the number of pages with the term web.
The share of the mentioning of Semantic Web formats versus the actual number of Semantic
Web documents using that format. The resulting talking vs. doing curve shows the phenomenon
of technology hype in both the case of XML, RDF and OWL. this is the point where the
technology “makes the press” and after which its becoming increasingly used on the Web.
Fig.4 The hype cycle of Semantic Web related technologies as shown by the number of web
pages about a given technology relative to its usage
The five-stage hype cycle of Gartner Research is defined as follows: The first phase of a Hype
Cycle is the “technology trigger” or breakthrough, product launch or other event that generates
significant press and interest. In the next phase, a frenzy of publicity typically generates over-
8
CP5074 – SOCIAL NETWORK ANALYSIS
Although the word hype has attracted some negative connotations, hype is unavoidable for the
adoption of network technologies such as the Semantic Web.
While standardization of the Semantic Web is mostly complete, Semantic Web technology is
not reaching yet the mainstream user and developer community of the Web.
In particular, the adoption of RDF is lagging behind XML, even though it provides a better
alternative and thus many hoped it would replace XML over time.
The recent support for Semantic Web standards by vendors such as Oracle23 will certainly
inspire even more confidence in the corporate world. This could lead an earlier realization of the
vision of the Se mantic Web as a “web of data”, which could ultimately result in a resurgence of
general interest on the Web.
The Web was a read-only medium for a majority of users. The web of the 1990s was much like
the combination of a phone book and the yellow pages and despite the connecting power of
hyperlinks it instilled little sense of community among its users. This passive attitude toward the
Web was broken by a series of changes in usage patterns and technology that are now referred to
as Web 2.0, a buzzword coined by Tim O’Reilly.
Blogs The first wave of socialization on the Web was due to the appearance of blogs, wikis and
other forms of web-based communication and collaboration. Blogs and wikis attracted mass
popularity from around 2003
9
CP5074 – SOCIAL NETWORK ANALYSIS
Social networks
The first online social networks also referred to as social networking services. It entered the field
at the same time as blogging and wikis started to take off. Attracted over five million registered
users followed by Google and Microsoft. These sites allow users to post a profile with basic
information, to invite others to register and to link to the profiles of their friends. The system also
makes it possible to visualize and browse the resulting network in order to discover friends in
common, friends thought to be lost or potential new friendships based on shared interests.
The latest services are thus using user profiles and networks to stimulate different exchanges:
photos are shared in Flickr, bookmarks are exchanged in del.icio.us, plans and goals unite
members at 43Things. The idea of network based exchange is based on the sociological
observation that social interaction creates similarity and vice versa, interaction creates
similarity: friends are likely to have acquired or develop similar interests.
10
CP5074 – SOCIAL NETWORK ANALYSIS
User profiles
Explicit user profiles make it possible for these systems to introduce rating mechanism whereby
either the users or their contributions are ranked according to usefulness or trustworthiness.
Ratings are explicit forms of social capital that regulate exchanges in online communities such
that reputation moderates exchanges in the real world. In terms of implementation, the new web
sites are relying on new ways of applying some of the pre-existent technologies. Asynchronous
JavaScript and XML, or AJAX, which drives many of the latest websites is merely a mix of
technologies that have been supported by browsers for years. User friendliness is a preference for
formats, languages and protocols that are easy to use and develop with, in particular script
languages, formats such as JSON, protocols such as REST.
This is to support rapid development and prototyping. For example: flickr Also, borrowing much
of the ideology of the open source software movement, Web 2.0 applications open up their data
and services for user experimentation: Google, Yahoo and countless smaller web sites. through
lightweight APIs content providers do the same with information in the form of RSS feeds. The
results of user experimentation with combinations of technologies are the so-called mashups.
Mashups is a websites based on combinations of data and services provided by others. The best
example of this development are the mashups based on Google’s mapping service such as
HousingMaps.
11
CP5074 – SOCIAL NETWORK ANALYSIS
users to encode facts in the text of articles while writing the text. This additional, machine
processable markup of facts would enable to easily extract, query and aggregate the knowledge
of Wikipedia.
Similar works on entirely new Wiki systems that combine free-text authoring with the
collaborative editing of structured information.
Information about the choices, preferences, tastes and social networks of users means that the
new breed of applications are able to build on a much richer user profiles. Clearly, semantic
technology can help in matching users with similar interests as well as matching users with
available content.
Lastly, in terms of technology what the Semantic Web can offer to the Web 2.0 community is
a standard infrastructure for the building creative combinations of data and services. Standard
formats for exchanging data and schema information, support for data integration, along with
standard query languages and protocols for querying remote data sources provide a platform
for the easy development of mashups.
12
CP5074 – SOCIAL NETWORK ANALYSIS
Network analysis provides a vocabulary for describing social structures, provides formal models
that capture the common properties of all (social) networks and a set of methods applicable to the
analysis of networks in general. The concepts and methods of network analysis are grounded in a
formal description of networks as graphs.
Methods of analysis primarily originate from graph theory as these are applied to the graph
representation of social network data. (Network analysis also applies statistical and probabilistic
methods and to a lesser extent algebraic techniques.) It is interesting to note that the
formalization of network analysis has brought much of the same advantages that the
formalization of knowledge on the Web (the Semantic Web) is expected to bring to many
application domains. Previously vaguely defined concepts such as social role or social group
could now be defined on a formal model of networks, allowing to carry out more precise
discussions in the literature and to compare results across studies.
The methods of data collection in network analysis are aimed at collecting relational data in a
reliable manner. Data collection is typically carried out using standard questionnaires and
observation techniques that aim to ensure the correctness and completeness of network data.
Often records of social interaction (publication databases, meeting notes, newspaper articles,
documents and databases of different sorts) are used to build a model of social networks
13
CP5074 – SOCIAL NETWORK ANALYSIS
Despite the various efforts, each of the early studies used a different set of concepts and different
methods of representation and analysis of social networks. However, from the 1950s network
analysis began to converge around the unique world view that distinguishes network analysis
from other approaches to sociological research.
This convergence was facilitated by the adoption of a graph representation of social networks
usually credited to Moreno. What Moreno called a sociogram was a visual representation of
social networks as a set of nodes connected by directed links. The nodes represented individuals
in Moreno’s work, while the edges stood for personal relations. However, similar representations
can be used to depict a set of relationships between any kind of social unit such as groups,
organizations, nations etc. While 2D and 3D visual modeling is still an important technique of
network analysis, the sociogram is honored mostly for opening the way to a formal treatment of
network analysis based on graph theory.
The following decades have seen a tremendous increase in the capabilities of network analysis
mostly through new applications. SNA gains its relevance from applications and these settings in
turn provide the theories to be tested and greatly influence the development of the methods and
the interpretation of the outcomes. For example, one of the relatively new areas of network
analysis is the analysis of networks in entrepreneurship, an active area of research that builds and
contributes to organization and management science.
The vocabulary, models and methods of network analysis also expand continuously through
applications that require to handle ever more complex data sets. An example of this process is
the advances in dealing with longitudinal data. New probabilistic models are capable of
modeling the evolution of social networks and answering questions regarding the dynamics of
communities. Formalizing an increasing set of concepts in terms of networks also contributes to
both developing and testing theories in more theoretical branches of sociology.
The increasing variety of applications and related advances in methodology can be best observed
at the yearly Sunbelt Social Networks Conference series, which started in 1980.
4. The field of Social Network Analysis also has a journal of the same name since 1978,
dedicated largely to methodological issues.
5. However, articles describing various applications of social network analysis can be found in
almost any field where networks and relational data play an important role.
While the field of network analysis has been growing steadily from the beginning, there have
been two developments in the last two decades that led to an explosion in network literature.
First, advances in information technology brought a wealth of electronic data and significantly
increased analytical power.
Second, the methods of SNA are increasingly applied to networks other than social networks
such as the hyperlink structure on the Web or the electric grid. This advancement —brought
forward primarily by physicists and other natural scientists— is based on the discovery that
many networks in nature share a number of commonalities with social networks.
14
CP5074 – SOCIAL NETWORK ANALYSIS
In the following, we will also talk about networks in general, but it should be clear from the text
that many of the measures in network analysis can only be strictly interpreted in the context of
social networks or have very different interpretation in networks of other kinds.
Fig.6 The upper part shows the location of the workers in the wiring room, while the lower part
is a network image of fights about the windows between workers (W), solderers (S) and
inspectors (I).
The term socialnetwork“” has been introduced by Barnes in 1954. This convergence was
facilitated by the adoption of a graph representation of social networks called as
Sociogram usually credited to Moreno.
Sociogram was a visual representation of social networks as a set of nodes connected by
directed links. The nodes represented individuals while the edges stood for personal relations.
The sociogram is honored mostly for opening the way to a formal treatment of network analysis
based on graph theory.
15
CP5074 – SOCIAL NETWORK ANALYSIS
The vocabulary, models and methods of network analysis also expand continuously through
applications that require to handle ever more complex data sets.
An example of this process are the advances in dealing with longitudinal data. New probabilistic
models are capable of modeling the evolution of social networks and answering questions
regarding the dynamics of communities.
While the field of network analysis has been growing steadily from the beginning, there have
been two developments in the last two decades that led to an explosion in network literature
First, advances in information technology brought a wealth of electronic data and significantly
increased analytical power.
Second, the methods of SNA are increasingly applied to networks other than social networks
such as the hyperlink structure on the Web or the electric grid
This advancement is based on the discovery that many networks in nature share a number of
commonalities with social networks.
Social Network Analysis has developed a set of concepts and methods specific to the analysis of
social networks.
A Social network can be represented as a Graph G = (V,E) where V denotes finite set of vertices
and E denoted finite set of Edges.
Each graph can be associated with its characteristic matrix M: =(mi,j)n*n where n =|V|
A component is a maximal connected subgraph. Two vertices are in the same (strong)
component if and only if there exists a (directed) path between them.
American psychologist Stanley Milgram experiment about the structure of social networks.
Milgram calculated the average of the length of the chains and concluded that the experiment
showed that on average Americans are no more than six steps apart from each other. While this
is also the source of the expression six degrees of separation the actual number is rather dubious:
16
CP5074 – SOCIAL NETWORK ANALYSIS
A practical impact of Milgram’s finding structures is as that possible models for social networks.
The two dimensional lattice model shown in Figure.
Fig.8 The 2D lattice model of networks (left). By connecting the nodes on the opposite
borders of the lattice we get a toroidal lattice (right).
Clustering for a single vertex can be measured by the actual number of the edges between
the neighbors of a vertex divided by the possible number of edges between the neighbors. When
taken the average over all vertices we get to the measure known as clustering coefficient. The
clustering coefficient of tree is zero, which is easy to see if we consider that there are no triangles
of edges (triads) in the graph. In a tree, it would never be the case that our friends are friends
17
CP5074 – SOCIAL NETWORK ANALYSIS
Fig.10 Most real world networks show a structure where densely connected
subgroups are linked together by relatively few bridges
Clustering a graph into subgroups allows us to visualize the connectivity at a group level.
Core-Periphery (C/P) structure is one where nodes can be divided in two distinct subgroups:
nodes in the core are densely connected with each other and the nodes on the periphery, while
18
CP5074 – SOCIAL NETWORK ANALYSIS
peripheral nodes are not connected with each other, only nodes in the core (see Figure 1.7.e). The
matrix form of a core periphery structure is a
matrix
The result of the optimization is a classification of the nodes as core or periphery and a
measure of the error of the solution.
Fig.11
The structural dimension of social capital refers to patterns of relationships or positions that
provide benefits in terms of accessing large, important parts of the network.
Degree centrality equals the graph theoretic measure of degree, i.e. the number of
(incoming, outgoing or all) links of a node.
Closeness centrality, which is obtained by calculating the average (geodesic) distance of
a node to all other nodes in the network. In larger networks it makes sense to constrain the size of
the neighborhood in which to measure closeness centrality. It makes little sense, for example, to
talk about the most central node on the level of a society. The resulting measure is called local
closeness centrality.
Two other measures of power and influence through networks are broker positions and weak
ties.
Betweenness is defined as the proportion of paths — among the geodesics between all pairs of
nodes—that pass through a given actor.
A structural hole occurs in the space that exists between closely clustered communities.
Lastly, he proves that the structural holes measure correlates with creativity by establishing a
linear equation between the network measure and the individual characteristics on one side of the
equation and creativity on the other side.
19
CP5074 – SOCIAL NETWORK ANALYSIS
The studies of electronic communication networks based on email data are limited by
privacy concerns. For example, in the HP case the content of messages had to be ignored by the
researchers and the data set could not be shared with the community.
Public forums and mailing lists can be analyzed without similar concerns. The W3C —
which is also the organization responsible for the standardization of Semantic Web
technologies—is unique among standardization bodies in its commitment to transparency toward
the general public of the Internet and part of this commitment is the openness of the discussions
within the working groups.
Content analysis has also been the most commonly used tool in the computer aided
analysis of blogs (web logs), primarily with the intention of trend analysis for the purposes of
marketing. While blogs are often considered as “person themselves know that blogs are much
more than that: modern blogging tools allow to easily comment and react to the comments of
other bloggers, resulting in webs of communication among bloggers.
These discussion networks also lead to the establishment of dynamic communities, which often
manifest themselves through syndicated blogs (aggregated blogs that collect posts from a set of
authors blogging on similar topics), blog rolls (lists of discussion partners on a personal blog)
and even result in real world meetings such as the Blog Walk series of meetings.
Word Press. Yes, there are other blogging platforms and some of them may be easier for new
20
CP5074 – SOCIAL NETWORK ANALYSIS
The 2004 US election campaign represented a turning point in blog research as it has been the
first major electoral contest where blogs have been exploited as a method of building networks
among individual activists and supporters. Blog analysis has suddenly shed its image as relevant
only to marketers interested in understanding product choices of young demographics; following
this campaign there has been explosion in research on the capacity of web logs for creating and
maintaining stable, long distance social networks of different kinds.
Online community spaces and social networking services such as MySpace, LiveJournal
cater to socialization even more directly than blogs with features such as social networking
(maintaining lists of friends, joining groups), messaging and photo sharing.4 As
they are typically used by a much younger demographic they offer an excellent
opportunity for studying changes in youth culture.
21
CP5074 – SOCIAL NETWORK ANALYSIS
Tie strength was calculated by dividing the number of co-occurrences with the number of pages
returned for the two names individually (see Figure).
Also known as the Jaccard-coefficient, this is basically the ratio of the sizes of two sets: the
intersection of the sets of pages and their union.
The resulting value of tie strength is a number between zero (no co-occurrences) and one (no
separate mentioning, only co-occurrences). If this number has exceeded a certain fixed threshold
it was taken as evidence for the existence of a tie.
The number of pages that can be found for the given individuals or combination of individuals.
The reason is that the Jaccard-coefficient is a relative measure of co-occurrence and it does not
take into account the absolute sizes of the sets. In case the absolute sizes are very low we can
easily get spurious results.
A disadvantage of the Jaccard-coefficient is that it penalizes ties between an individual whose
name often occurs on theWeb and less popular individuals (see Figure 3.4).
22
CP5074 – SOCIAL NETWORK ANALYSIS
A Hybrid Approach Using PSO and K-Means for Semantic Clustering of Web Documents. J.
Web Eng. 12(3&4): 249-264 (2013)
Associate researchers with topics in a slightly different way. The system calculates the strength
of association between the name of a given person and a certain topic.
There have been several approaches to deal with name ambiguity. Instead of a single name they
assume to have a list of names related to each other. They disambiguate the appearances by
clustering the combined results returned by the search engine for the individual names. The
clustering can be based on various networks between the returned webpages, e.g. based on
hyperlinks between the pages, common links or similarity in content.
The idea is that such key phrases can be added to the search query to reduce the set of results to
those related to the given target individual.
When computing the weight of a directed link between
two persons.
We consider an ordered list of pages for the first person and a set of pages for the second (the
relevant set) as shown in Figure:
23
CP5074 – SOCIAL NETWORK ANALYSIS
There are four different sets: The records which were retrieved, the records which were
not retrieved, the relevant records and the irrelevant records (as annotated in the test set). The
intersections of these sets (A,B,C,D) represent the following: A is the number of
irrelevant records not retrieved (true negatives), B is the number of irrelevant records
retrieved(false positives), C is the number of relevant records not retrieved (false negatives) and
D is the number of relevant records retrieved (true positives). Recall is defined as: Recall = TP /
(TP + F N)
Precision is defined as: Precision = TP / (TP + FP)
We ask the search engine for the top N pages for both persons but in the case of the
second person the order is irrelevant() as the relevance for at the position compue t n, where
rel(n) is 1 if the document at position n is the relevant set and zero otherwise (1 ≤ n ≤
N).
The average precision method is more sophisticated in that it takes into account the order in
which the search engine returns document for a person: it assumes that names of other persons
that occur closer to the top of the list represent more important contacts than names that occur in
pages at the bottom of the list.
This strength is determined by taking the number of the pages where the name of an interest and
the name of a person co-occur divided by the total number of pages about the person.
Assign the expertise to an individual if this value is at least one standard deviation higher than
the mean of the values obtained for the same concept.
The biggest technical challenge in social network mining is the disambiguation of person names
Persons names exhibit the same problems of polysemy and synonymy that we have seen in the
24
CP5074 – SOCIAL NETWORK ANALYSIS
general case of web search. Queries for researchers who commonly use different variations of
their name (e.g. Jim Hendler vs. James Hendler).
Polysemy is the association of one word with two or more distinct meanings. A polyseme is a
word or phrase with multiple meanings. In contrast, a one-to-one match between a word and a
meaning is called monosemy. According to some estimates, more than 40% of English words
have more than one meaning.The semantic qualities or sense relations that exist between words
with closely related meanings is Synonymy.
25