Web Science: An Interdisciplinary Approach To Understanding The Web
Web Science: An Interdisciplinary Approach To Understanding The Web
Web Science: An Interdisciplinary Approach To Understanding The Web
Web Science:
An Interdisciplinary
Approach to
Understanding
the Web
60
| J U LY 2008 | vo l . 5 1 | no. 7
contributed articles
Figure 1: The social interactions enabled by the Web put demands on the Web applications
behind them, in turn putting further demands on the Webs infrastructure.
Social Interactions
Application Needs
Infrastructure Reqs
com municatio ns o f th e ac m
creativity
Design
Technology
Social
micro
Idea
Issues
| J U LY 2008 | vo l . 5 1 | no. 7
analysis
macro
complexity
contributed articles
of computer use. However, in many of
these courses, the Web itself is treated as a specific instantiation of more
general principals. In other cases, the
Web is treated primarily as a dynamic
content mechanism that supports the
social interactions among multiple
browser users. Whether in CS studies
or in information-school courses, the
Web is often studied exclusively as the
delivery vehicle for content, technical
or social, rather than as an object of
study in its own right.
Here, we present the emerging interdisciplinary field of Web science5,
6
taking the Web as its primary object
of study. We show there is significant
interplay among the social interactions enabled by the Webs design, the
scalable and open applications development mandated to support them,
and the architectural and data requirements of these large-scale applications
(see Figure 1). However, the study of
the relationships among these levels
is often hampered by the disciplinary
boundaries that tend to separate the
study of the underlying networking
from the study of the social applications. We identify some of these relationships and briefly review the status
of Web-related research within computing, We primarily focus on identifying emerging and extremely challenging problems researchers (in their role
as Web scientists) need to explore.
What Is It?
Where physical science is commonly
regarded as an analytic discipline that
aims to find laws that generate or explain observed phenomena, CS is predominantly (though not exclusively)
synthetic, in that formalisms and algorithms are created in order to support
specific desired behaviors. Web science
deliberately seeks to merge these two
paradigms. The Web needs to be studied and understood as a phenomenon
but also as something to be engineered
for future growth and capabilities.
At the micro scale, the Web is an infrastructure of artificial languages and
protocols; it is a piece of engineering.
However, it is the interaction of human
beings creating, linking, and consuming information that generates the
Webs behavior as emergent properties at the macro scale. These properties often generate surprising proper-
A large-scale
system may
have emergent
properties not
predictable by
analyzing micro
technical and/or
social effects.
JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm
63
contributed articles
tion, and later an industry, in its own
right. In other cases, the large-scale system may have emergent properties that
were not predictable by analyzing the
micro technical and/or social effects.
Dealing with these issues can lead to
subsequent generations of technology.
For example, the enormous success of
search engines has inevitably yielded
techniques to game the algorithms (an
unexpected result) to improve search
rank, leading, in turn, to the development of better search technologies to
defeat the gaming.
The essence of our understanding of
what succeeds on the Web and how to
develop better Web applications is that
we must create new ways to understand
how to design systems to produce the
effect we want. The best we can do today
is design and build in the micro, hoping for the best, but how do we know if
weve built in the right functionality to
ensure the desired macroscale effects?
How do we predict other side effects
and the emergent properties of the
macro? Further, as the success or failure of a particular Web technology may
involve aspects of social interaction
among users, a topic we return to later,
understanding the Web requires more
than a simple analysis of technological
issues but also of the social dynamic of
perhaps millions of users.
Given the breadth of the Web and its
inherently multi-user (social) nature,
its science is necessarily interdisciplinary, involving at least mathematics, CS,
artificial intelligence, sociology, psychology, biology, and economics. We
invite computer scientists to expand
the discipline by addressing the challenges following from the widespread
adoption of the Web and its profound
influence on social structures, political
systems, commercial organizations,
and educational institutions.
Beneath the Web Graph
One way to understand the Web, familiar to many in CS, is as a graph whose
nodes are Web pages (defined as static
HTML documents) and whose edges
are the hypertext links among these
nodes. This was named the Web
graph in 22, which also included the
first related analysis. The in-degree
of the Web graph was shown in Kleinberg et al.3 and Kumar et al.24 to follow a
power-law distribution; a similar effect
64
communicatio ns o f th e ac m
was shown in Broder et al.10 for the outbranching of vertices in the graph. An
important result in Dill et al.12 showed
that large samples of the Web, generated through a variety of methods, all
had similar propertiesimportant as
the Web graph grows, reported in 2005
to be on the order of seven million new
pages a day.17 Various models have
been proposed as to how the Web graph
grows and which models best capture
its evolution; see Donato et al.14 for an
analysis of a number of these models
and their properties.
Along with analyses of this graph
and its growth, a number of algorithms
have been devised to exploit various
properties of the graph. For example,
the HITS algorithm23 and PageRank9
assume that the insertion of a hyperlink from one page to another can be
taken as a sort of endorsement of the
authority of the page being linked to,
an assumption that led to the development of powerful search engines for
finding pages on the Web. While modern search engines use a number of
heuristics beyond these page-authority calculations, due in part to competitive pressure from those trying to
spoof the algorithms and get a higher
rank, these Web-graph-based models
still form the heart of the critical crawlers and rank-assessment algorithms
behind Web search.
The links in this Web graph represent single instantiations of the
results of calling the HTTP protocol
with a GET request that returns a particular representation (in this case an
HTML page) of a document based on
a universal resource identifier (URI)
that serves as an identifier common
across the entire Web. So, for example,
the URI http://www.acm.org/publications/cacm typed into a standard Web
browser invokes the hypertext transfer
protocol (HTTP) and returns an HTML
page that contains content describing
the publication known as Communications of the ACM. Note, however, that
the content itself contains other URIs
that are themselves pointers to objects
that are also displayed (such as icons
and images) and that the formatting of
the page itself may require retrieving
other resources (such as cascaded style
sheets) or XML DTD documents. So
what we might naively view as a single
link from, say, a research groups Web
| J U LY 2008 | vo l . 5 1 | no. 7
contributed articles
plex URIs that use GET requests to pass
on statea, thus obscuring the identity of
the actual resources.
URIs that carry state are used heavily in Web applications but are, to
date, largely unanalyzed. For example, in a June 2007 talk, Udi Manber,
Googles VP of engineering, addressed
the issue of why Web search is so difficult,25 explaining that on an average
day, 20%25% of the searches seen by
Google have never been submitted before and that each of these searches
generates a unique identifier (using
server-specific encoding information).
So a Web-graph model would represent only the requesting document
(whether a user request or a request
generated by, for example, a dynamic
advertisement content request) linked
to the www.google.com node. However if, as is widely reported, Google
receives more than 100 million queries
per day, and if 20% of them are unique,
then more than 20 million links, represented as new URIs that encode the
search term(s), should show up in the
Web graph every day, or around 200 per
second. Do these links follow the same
power laws? Do the same growth models explain these behaviors? We simply
dont know.
Analyzing the Web solely as a graph
also ignores many of its dynamics (especially at short timescales). Many
phenomena known to Web users (such
as denial-of-service attacks caused by
flooding a server and the need to click
the same link multiple times before getting a response) cannot be explained by
the Web-graph model and often cant
be expressed in terms amenable to
such graph-based analysis. Representing them at the networking level, ignoring protocols and how they work, also
misses key aspects of the Web, as well
as a number of behaviors that emerge
from the interactions of millions of requests hitting many thousands of servers every second. Web dynamics were
analyzed more than a decade ago,20 but
the combination of (i) the exponential
growth in the amount of Web content,
(ii) the change in the number, power,
and diversity of Web servers and applia. These characters, including ?.#, =, and &, followed by keywords, may follow the last slash
in the URI, thus making for the long URIs often generated by dynamic content servers.
Todays interactive
applications are
very early social
machines, limited
by the fact that they
are largely isolated
one from another.
cations, and (iii) the increasing number of diverse users from everywhere
in the world makes a similar analysis
impossible today without creating and
validating new models of the Webs
dynamics. Such models must also pay
special attention to the details of the
Webs architecture, as well as to the
complexity of the interactions actually
taking place there.
Additionally, modern, sophisticated Web sites provide powerful
user-interface functionality by running large script systems within the
browser. These applications access the
underlying remote data model through
Web APIs. This application architecture allows users and entrepreneurs
to quickly build many new forms of
global systems using the processing
power of users machines and the storage capacity of a mass of conventional
Web servers. Like the basic Web, each
such system is interesting mainly for
its emergent macro-scale properties,
of which we have little understanding.
Are such systems stable? Are they fair?
Do they effectively create a new form
of currency? And if they do should it
be regulated?
Similarly, many user-generated
content sites now store personal information yet have rather simplistic
systems to restrict access to a persons
friends. This information is not available to wide-scale analysis. Some other
sites must be allowed to access the sites
by posing as the user or as a friend; a
number of three-party authentication
protocols are being deployed to allow
this. A complex system is thus being
built piece by piece, with no invariants
(such as my employer will never see
this picture) assured for the user.
The purpose of this discussion is not
to go into the detail of Web protocols
or the relative merits of Web-modeling
approaches but to stress that they are
critical to the current and continued
working of the Web. Understanding
the protocols and issues is important
to understanding the Web as a technical construct and to analyzing and
modeling its dynamic nature. Our ability to engineer Web systems with desirable properties at scale requires that
we understand these dynamics. This
analysis and modeling are thus an important challenge to computer scientists if they are to be able to understand
JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm
65
contributed articles
the growth and behaviors of the future
Web, as well as to engineer systems
with desired properties in a way that is
significantly less hit or miss.
From Power Laws to People
Mathematically based analysis of the
Web involves another potential failing.
Whereas the structure and use of various Web sites (taken mathematically)
may have interesting properties, these
properties may not be very useful in explaining the behavior of the sites over
time. Consider the following example:
Wikipedia (www.wikipedia.org), the
0.1
P(k)
0.01
0.001
0.0001
1e-05
1
10
100
1000
10000
100000
1e+06
1e+07
communicatio ns o f th e ac m
| J U LY 2008 | vo l . 5 1 | no. 7
contributed articles
ect or site; rather, technology is needed
to allow user communities to construct,
share, and adapt social machines so
successful models evolve through trial,
use, and refinement.
A number of research challenges
and questions must be resolved before
a new generation of interacting social
machines can be created and evolved
this way:
What are the fundamental theoretical properties of social machines, and
what kinds of algorithms are needed to
create them?;
What underlying architectural
principles are needed to guide the design and efficient engineering of new
Web infrastructure components for
this social software?;
How can we extend the current
Web infrastructure to provide mechanisms that make the social properties
of information-sharing explicit and
guarantee that the use of this information conforms to relevant social-policy
expectations?; and
How do cultural differences affect the development and use of social
mechanisms on the Web? As the Web
is indeed worldwide, the properties
desired by one culture may be seen as
counterproductive by others. Can Web
infrastructure help bridge cultural divides and/or increase cross-cultural
understanding?
In addition, a crucial aspect of human interaction with information is
our ability to represent and reason
over such attributes as trustworthiness, reliability, and tacit expectations
about the use of information, as well as
about privacy, copyright, and other legal rules. While some of this information is available on the Web today, we
lack structures for formally representing and computing over them. Traditional cryptographic security research
and well-known access-control-policy
frameworks have failed to meet these
challenges in todays online environment and are thus insufficient as a
foundation for the social machines of
the future. Recent work on formal models for privacyb has demonstrated that
traditional cryptographic approaches
to privacy protection can fail in open
Web environments. Similar problems
with copyright enforcement have
also hampered the flow of commercial and scholarly information on the
JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm
67
contributed articles
though they are not good general
search terms. On the other hand, in a
specific social context (such as a particular persons photos), the same tag
can be useful since it can designate a
particular individual. The use of a tag
as metadata often depends on such a
context, and the network effect in
these cites is thus socially organized.19
A more ambitious use of metadata
involves recent applications of semantic Web technologies7 and represents
an important paradigm shift that is a
significant element of emerging Web
technologies. The semantic Web represents a new level of abstraction from
the underlying network infrastructure,
as the Internet and Web did earlier.
The Internet allowed programmers to
create programs that could communicate without concern for the network
of cables through which the communication had to flow. The Web allows programmers and users to work with a set
of interconnected documents without
concern for the details of the computers storing and exchanging them.
The semantic Web will allow programmers and users alike to refer to
real-world objectspeople, chemicals,
agreements, stars, whateverwithout
concern for the underlying documents
in which these things, abstract and
concrete, are described. While basic
semantic Web technologies have been
defined and are being deployed more
widely, little work has sought to explain
the effect of these new capabilities on
the connections within the Web of people who use them.28
The semantic Web arena reflects two
principle nexuses of activity. One tends
to involve data (and the Web), and the
other on the domain (and semantics).
The first, based largely on innovation
in data-integration applications, focuses on developing Web applications that
employ only limited semantics but provide a powerful mechanism for linking
data entities using the URIs that are
the basis of the Web. Powered by the
RDF, these applications focus largely
on querying graph-oriented triple-store
databases using the emerging SPARQL
language, which helps create Web applications and portals that use RESTbased models, integrating data from
multiple sources without preexisting
schema. The second, based largely on
the Web Ontology Language, or OWL,
68
communicatio ns o f th e ac m
| J U LY 2008 | vo l . 5 1 | no. 7
contributed articles
interactions of people enabled by
the Webs technology base. We must
therefore understand the social machines that may be the critical difference between the success or failure of Web applications and learn to
build them in a way that allows interlinking and sharing.
Acknowledgments
Figure 2 is taken from talks Tim
Berners-Lee gave in 2007 (www.w3.
org/2007/Talks/1018-websci-mit-tbl/
Overview.html). We also thank the
other members of the WSRI Scientific
Council (webscience.org/about/people/) for input relating to the goals of
Web science and the interaction of the
Web and computer and information
sciences. We are indebted to Konstantin Mertsalov of Rensselaer Polytechnic
Institute for the DBpedia analysis discussed in the section on power laws.
References
1. Abadi, D., Marcus, A., Madden, S., and Hollenbach,
K. Scalable semantic Web data management using
vertical partitioning. In Proceedings of the 33rd
International Conference on Very Large Data Bases
(Vienna, Austria, Sept. 2327). VLDB Endowment,
Heidelberg, 2007.
2. Backstron, L., Dwork, C., and Kleinberg, J. Wherefore
art thou R3579X? Anonymized social networks,
hidden patterns, and structural steganography. In
Proceedings of the 16th International World Wide Web
Conference (Banff, Alberta, Canada, May 812). ACM
Press, New York, 2007.
3. Barabasi, A. and Albert, A. Emergence of scaling in
random networks. Science 286 (1999).
4. Berners-Lee, T., Connolly, D., Kagal, L., Scharf, Y., and
Hendler, J. N3Logic: A logical framework for the World
Wide Web. Theory and Practice of Logic Programming
(2008).
5. Berners-Lee, T., Hall, W., Hendler, J., Shadbolt, N., and
Wietzner, D. Creating a science of the Web. Science
311 (2006).
6. Berners-Lee, T., Hall, W., Hendler, J., OHara, K.,
Shadbolt, N., and Weitzner, D. A framework for Web
science. Foundations and Trends in Web Science 1, 1
(Sept. 2006).
7. Berners-Lee, T., Hendler, J., and Lassila, O. The
semantic Web. Scientific American (May 2001).
8. Berners-Lee, T. and Fischetti, M. Weaving the Web:
The Original Design and Ultimate Destiny of the World
Wide Web. Harper Collins, New York, 1999.
9. Brin, S. and Page, L. The anatomy of large-scale
hypertextural Web search engine. Presented at the
Sixth International World Wide Web Conference
(Santa Clara, CA, Apr. 711, 1997).
10. Broder, A., Kumar, R., Maghoul, F., Raghavan, P.,
Rajagopalan, S., Stata, R., Tomkins, A., and Wiener,
J. Graph structure in the Web. In Proceedings of
the Ninth International World Wide Web Conference
(Amsterdam, The Netherlands, May 1519). Elsevier,
Amsterdam, The Netherlands, 2000.
11. Dean, J. and Ghemawat, S. MapReduce: Simplified
data processing on large clusters. In Proceedings of
the Sixth Symposium on Operating System Design and
Implementation (San Francisco, Dec. 68). USENIX
Association, Berkeley, CA, 2004.
12. Dill, S., Kumar, R., McCurley, K., Rajagopalan, S.,
Sivakumar, D. and Tomkins, A. Self-similarity in
the Web. In Proceedings of the 27th International
Conference on Very Large Data Bases (Rome, Italy,
Sept. 1114). Morgan Kaufmann Publishers, Inc., San
Francisco, 2001.
13. Domingos, P., Golbeck, J., Mika, P., and Nowak, A.
Social networks and intelligent systems. IEEE
JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm
69