Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SNSW Unit-5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Social Networks

&
Semantic Web

CO 6: Evaluation of Web-Based Social Network Extraction


Introduction
• Social Network
• A network of social interactions and personal relationships.
• A social structure made up of a set of social actors (such as individuals or
organizations), sets of dyadic ties, and other social interactions between
actors.
• When, we Computer Science people look as social network in the first way
and study online networks as the equivalents of real-world networks.
• On the other hand, network analysts with a social science background
apply an extreme caution and most often still treat electronic
communication networks and online communities as a separate field of
investigation.
Introduction
• Nevertheless, surprisingly little is known about the exact relationship
between real world networks and their online reflections.
• To what extent electronic data obtained from the Web reveal social
structures?
• Different forms of electronic data could serve as a source for
obtaining information about different types of social relations.
• For electronic data such as email traffic that represent message exchanges
between individuals a higher degree of correspondence with social network
data is plausible.
• While for others such as web logs a more distant perspective seems to be
warranted.
• Social network mining from the Web based on co-occurrence is an
interesting method as it is likely to produce evidence of ties of different
strengths based on the variety of the underlying data.
Introduction
• Bearing in mind the limitations, in this unit, we are going to evaluate web-based social
network extraction using members of a research organization as the subjects of our
study.
• The authors choose to evaluate electronic data extraction against the results of a
survey method, which is the dominant approach to data collection for network analysis.
• Standard questionnaires are preferred in theory testing for their fixed structure, which
guarantees the comparability of results among test subjects and across studies.
• Various forms of asking for one’s social relations have been tested through the years for
reliability.
• The fixed structure of questionnaires also allows to directly extract relational data and
attributes for processing with network analysis tools and statistical packages.
• Questionnaires are also minimally invasive and can be easily mailed, emailed or
administered online.
Differences between survey methods and
electronic data extraction
• Differences in what is measured
• What is not on the Web cannot be extracted from the Web, which limits the
scope of extraction. Also, these data can be biased in case part of the
community to be studied is better represented on the Web than other parts.
• Our network extraction method is likely to find evidence for different kinds
of relationships resulting in what is called a multiplex network. These
relationships are not easily entangled although some progress can be made
by applying machine learning to disambiguate relationships.
• Authors measure a number of relationships in our survey and use these data
to understand the composition of relationships we find on the Web.
• The equivalent problem in survey methods is the difficulty of precisely
formulating those questions that address the relationship the researcher
actually wants to study.
Differences between survey methods and
electronic data extraction
• Errors introduced by the extraction method
• There are errors that affect the extraction of particular cases. Homonymy affects
common names (e.g. J. Smith or Xi Li), but can be reduced somewhat by adding
disambiguation terms to queries. Synonymy presents a problem whenever a person
uses different variations of his or her name. Different variations of first names (e.g.
James Hendler vs Jim Hendler), different listing of first and middle names, foreign
accentuation, different alphabets (e.g. Latin vs. Chinese) etc. can all lead to different
name forms denoting the same person.
• They addressed this problem by experimenting with various measures that could
predict if a particular name is likely to be problematic in terms of extraction. We can
test such measures by detecting whether the personal network of such persons is
in fact more difficult to extract than the networks of other persons.
Differences between survey methods and
electronic data extraction
• Errors introduced by survey data collection
• Unlike network mining from the Web, surveys almost never cover a network
completely. Although a response rate lower than 100% is not necessarily an
error, it does require some proof that either the non-respondents are not
significantly different from the respondents with respect to the survey or that
the collected results are so robust that the response from the non-
respondents could not have affected it significantly.
• The respondents are not likely to be equally co-operative either. There are
most likely differences in the level of cooperativeness and fatigue.
• The mere fact of observation can introduce a bias. At best this bias affects all
subjects in an equal manner.
• Not only the type of relationship that is considered by the respondent but
also the recall of contacts is affected by the way a question is formulated.
Context of the empirical study
• The researchers used network data on the social networks of the 123
researchers working at the Department of Computer Science of the Vrije
Universiteit, Amsterdam in September 2006.
• The Department is organized is six Sections of various sizes, which are in
decreasing order of size: Computer Systems (38), Artificial Intelligence (33),
Information Management and Software Engineering (22), Business
Informatics (17), Theoretical Computer Science (9) and Bioinformatics (4).
• The Sections are further divided internally into groups5, each led by a
professor. Researchers in each section include part- or full-time PhD
students, postdocs, associate and full professors, but the study excluded
master students, support staff (scientific programmers) and administrative
support.
Context of the empirical study
• They have chosen this community as a subject of our study because
the author is a member of the Business Informatics group of the
Department.
• Some participants (9) felt that this study should not have been carried
out by “one of their own” as they did not feel comfortable with
providing personal information to a colleague.
• From the remaining 114 researchers we have collected 79 responses
(a response rate of 64%), with above average response rates in the BI
group (88%) and the closely interlinked AI (79%) group.
Data Collection**
• The authors went with electronic data extraction and survey
approach to compare.
• We collected personal and social information using a custom-built
online survey system.
• An online survey offers several advantages compared to a paper
questionnaire:
• Easy accessibility for the participants.
• Greater flexibility in design, allowing for a better survey experience. Using an
electronic survey it is possible to adapt questions presented to the user based
on the answers to previous questions.
• Easier processing for the survey administrator. Our system recorded electronic
data directly in RDF using the FOAF-based semantic representations.
Data Collection
• The survey is divided over several pages.
• The first page asks the participant to enter basic personal information: his or her full-time or
part-time status, age, years at the organization, name of the direct supervisor and research
interests.
• The second and third pages contain standard questions for determining the level of self-
monitoring and the extent someone identifies with the different levels of the organization.
• The fourth page asks the participant the select the persons he or she knows from a
complete list of Department members. This question is included to pre-select those persons
the participant might have any relationship with.
• The next page asks the participant to specify the nature of the relationship with the
persons selected. In particular, the participant is suggested to consider six types of
relationships and asked to specify for each person which type of relationship applies to that
person. The six types of relations we surveyed were advice seeking, advice giving, future
cooperation, similarity perceptions, friendship, and adversarial ties.
• Upon completion of the last page, the survey software stored the results in a Sesame RDF
store.
Preparing Data
• Number of nodes in the network is 79.
• Constructed 6 networks of 6 relations.
• Removed directionality from our survey networks and our web-based network
Optimizing goodness of fit
• We had to filter the nodes and edges from the web-based network
also to compare it with survey-based network.
• Filtering of the web-based network requires to specify cut-off values
for two parameters:
• The minimal number of pages one must have on the Web to be included
(page count)
• The minimal strength of the relationships (used to filter out ties with too little
support).
• Filtering is either carried out before removing directionality or one
needs to aggregate the weights of the edges going in different
directions before the edges can be filtered.
Histograms Representing
No. of Web Pages & strength of relationships

Figure 1. Histogram for the number of web pages per


individual. Note that the x-axis is on a logarithmic scale
Histogram for the strength of relationships based on the web extraction method.
Optimizing goodness of fit
• Finding the appropriate parameters for filtering can be considered as
an optimization task where we would like to maximize the similarity
between our survey networks and the extracted network.
• We can consider relationship extraction as an information retrieval
task and apply well-known measures from the field of information
retrieval.
• Let’s denote our graphs to be compared as G1(V1, E1) and G2(V2, E2).
• Precision, recall and the F-measure are common measures in
information retrieval, while the Jaccard-coefficient is also used for
example in UCINET
Optimizing goodness of fit

Similarity measures for graphs based on edge sets


Optimizing
goodness of fit
Once we have chosen a measure,
we can visualize the effect of the
parameters on the similarity using
surface plots.
These figures show the changes in
the similarity between the advice
seeking network and the network
obtained from the Web using the
F-measure. The similarity (plotted
on the vertical, z axis) depends on
the value of the two parameters of
the algorithm.
The network obtained from web
mining as we change the page
count and strength thresholds.
Optimizing goodness of fit
• The F-measure, which is the harmonic mean of precision and recall, has a
single highest peak (optimum) and a second peak representing a different
trade-off between precision and recall.
• In general, we note that it seems easier to achieve high recall than high
precision using a two-stage acquisition process where we first collect a
social network using web mining and then apply a survey in which we ask
respondents to remove the incorrect relations and add the few missing
ones.
• Such a pre-selection approach can be particularly useful in large networks
where listing all names in a survey would result in an overly large table.
• Further, subjects are more easily motivated to correct lists than to provide
lists themselves.
Comparison across methods and networks
• Our Standard survey data also allows a direct comparison of methods
for social network mining.
• In this case they compare the best possible results obtainable by two
methods, i.e. they choose the parameters for each method separately
such that some jjis optimized.
• They have subjected to this test our benchmark method of co-
occurrence analysis and the method based on average precision.
• The results confirm our intuition that the average precision method
produces higher precision, but lower recall resulting in only slightly
higher F-measure values.
Comparison across methods and networks
Comparison across methods and networks
• It is likely that the relationships we extract from the Web reflect a
number of underlying relationships, including those we asked in our
survey and possibly others we did not.
• To measure to what extent each of our surveyed relationships is
present on the Web it would be possible to perform a p* analysis,
where we assume that the Web-based network is a linear
combination of our survey networks.

• The analysis would then return the coefficients of this linear equation
Predicting the goodness of fit
• Many personal factors influence the success of extracting social networks
from the Web.
• For example, as we already saw the amount of information available about
a person or the commonality of someone’s name on the Web.
• The author has collected attribute data from survey, so he used it to
investigate whether some of these factors can be surly linked to the
success of obtaining the person’s social network from the Web.
• If we find measures that help us to predict when the extraction is likely to
fail we can exclude those individuals from the Web-based extraction and
try other methods or data sources for obtaining information about their
relations.
Predicting the goodness of fit
• We have to measure the similarity between personal networks from
the survey and the Web and correlate it with attributes of the
subjects.
• The attributes the author has consider are those from the survey,
• The number of relations mentioned (surveydegree).
• The age of the individual.
• The number of years spent at the VU (variables age and entry).
• He also look at Web-based indicators such as
• The number of relations extracted (miningdegree).
• The number of pages for someone’s name, which we recode based on its
distribution by taking the logarithm of the values (pagecount).
Predicting the goodness of fit
• Last, experimented with measures for name ambiguity based
on the existing literature on name disambiguation using
clustering methods.
• The author used two measures
• NC1: Jaccard-coefficient between the first name and the last
name.
• NC2: The ratio of the number of pages for a query that includes
the full name and the term Vrije Universiteit divided by the
number of pages for the full name only.
Predicting the goodness of fit
• Findings:
• None of the survey attributes has a direct influence on the result.
• The NC1 measure has no significant effect.
• The NC2 measure has a negative effect on the F-measure.
Evaluation through analysis
• If the network extraction is optimized to match the results of the survey it
will give similar results in analysis.
• However, we will see that a 100% percent match is not required for
obtaining relevant results in applications: most network measures are
statistical aggregates and thus relatively robust to missing or incorrect
information.
• Group-level analysis, for example, is typically insensitive to errors in the
extraction of specific cases.
• The macrolevel social structure of our department can be retrieved by
collapsing this network to show the relationships between groups using
the affiliations or by clustering the network.
• The two networks reveal the same underlying organization:
Evaluation through two of the groups (the AI and BI sections) built close
relationships with each other and with the similarly densely
analysis linked Computer Systems group.
Evaluation through analysis
• Our experiments also show the robustness of centrality measures
such as degree, closeness and betweenness.
• For example, if we compute the list of the top 20 nodes by degree,
closeness and betweenness we find an overlap of 55%, 65% and 50%,
respectively.
• While this is not very high, it signifies a higher agreement than would
have been expected: the correlation between the values is 0.67, 0.49,
0.22, respectively. The higher correlation of degrees can be explained
by the fact that we optimize on degree when we adjust our networks
on precision/recall.

You might also like