Module I
Module I
L T P J C
3 0 0 4 4
Dr. J ANURADHA
Associate Professor,
School of Computer Science and Engineering,
VIT University, Vellore, TN, India – 632 014.
SYLLABUS
1. INTRODUCTION TO WEB
2. WEB CRAWLING
3. INDEXING
7. QUERY PROCESSING
8. RECENT TRENDS
VIT University
2
SYLLABUS
➢ WEB CRAWLING
➢ WEB INDEXING
whole.
VIT University
3
SYLLABUS
➢ WEB STRUCTURE MINING
services.
VIT University
5
WEB MINING
data warehouses.
VIT University
6
WEB MINING
Some of the valuable end users for web mining results are:
VIT University
7
Module – 1
INTRODUCTION TO WWW
1. Architecture of WWW
3. Web Security
VIT University
8
WWW
❖ What is WWW?
❖ WWW vs. Internet
❖ Web Language
❖ HTML
❖ Web Protocols
❖ HTTP / FTP
❖ URL
❖ HTTP / HTTPS
❖ FTP / MAILTO / USENET
❖ URI / URL / URN
❖ W3C
❖ Web Standards / Web Accessibility Standards
VIT University
9
BASIC COMPONENTS OF THE WEB
• Web Servers
• Servers
• Web Clients
• HTTP Protocols
• Browser
VIT University 10
WEB ARCHITECTURE
VIT University 11
WEB ARCHITECTURE
• Uniform Resource Identifier (URI) is used to uniquely identify
resources on the web and UNICODE makes it possible to built web
pages that can be read and write in human languages.
VIT University 12
WEB ARCHITECTURE
• RDF Schema (RDFS) allows more standardized description
of taxonomies and other ontological constructs.
VIT University 13
WEB ARCHITECTURE
• RIF and SWRL offers rules beyond the constructs that are available
from RDFs and OWL. Simple Protocol and RDF Query Language
(SPARQL) is SQL like language used for querying RDF data and OWL
Ontologies.
• All semantic and rules that are executed at layers below Proof and their
result will be used to prove deductions.
VIT University 14
WEB OPERATION
VIT University 15
WEB CHALLENGES
❖ As originally proposed by Tim Berners-Lee, the Web was intended
to improve the management of general information about
accelerators and experiments at CERN.
VIT University 16
WEB CHALLENGES
❖ The framework proposed by Berners-Lee was very general and
would work very well for any set of documents, providing flexibility
and convenience in accessing large amounts of text.
VIT University 17
TimBL Proposed Web vs. Today’s Web
❖ The recent Web is huge and grows incredibly fast.
❖ About 10 years after the Berners-Lee proposal, the Web was
estimated to have 150 million nodes (pages) and 1.7 billion
edges (links).
VIT University 19
WEB CHALLENGES
❖ There are several ways to achieve this:
❖ Use the existing Web and apply sophisticated search
techniques.
❖ Change the way in which we create web pages.
VIT University 20
WEB SEARCH ENGINE
A web search engine is a coordinated set of programs that includes:
VIT University 22
comScore report February 2016
comScore Explicit Core Search Report* (Desktop Only)
Google is also dominating the mobile/tablet search engine market share with 89%!
VIT University 23
WEB SEARCH ENGINE
❖ Web search engines explore the existing (semantics-free) structure
of the Web and try to find documents that match user search
criteria.
❖ The basic idea is to use a set of words (or terms) that the user
specifies and retrieve documents that include (or do not include)
those words. – Keyword Search
❖ Further IR techniques are used to avoid terms that are too general
and too specific and to take into account term distribution
throughout the entire body of documents as well as to explore
document similarity.
VIT University 24
WEB SEARCH ENGINE
❖ Natural language processing approaches are also used to analyze
term context or lexical information, or to combine several terms into
phrases.
VIT University 25
TOPIC DIRECTORIES
➢ Web pages are organized into hierarchical structures that reflect
their meaning. These are known as topic directories, or simply
directories, and are available from almost all web search portals.
VIT University 26
TOPIC DIRECTORIES
➢ The directories are usually created manually with the help of
thousands of web page creators and editors.
VIT University 27
SEMANTIC WEB
➢ Semantic web is a initiative led by the web consortium (w3c.org).
VIT University 28
SEMANTIC WEB
➢ The main idea behind the semantic web is to add formal descriptive
material to each web page that although invisible to people would
make its content easily understandable by computers.
➢ Thus, the Web would be organized and turned into the largest
knowledge base in the world, which with the help of advanced
reasoning techniques developed in the area of artificial intelligence
would be able not just to provide ranked documents that match a
keyword search query, but would also be able to answer questions
and give explanations.
http://www.w3.org/2001/sw/
VIT University 29
WEB SEARCH ENGINE CHALLENGES
• Spam
• Content Quality
• Quality Evaluation
• Web Convention
• Duplicate Hosts
• Vaguely-Structured Data
➢ Search Engine Optimization (SEO) involves getting a site the exposure it deserves
on the most targeted keywords, leading to satisfactory search experiences.
➢ [Silverstein et al., 1999] showed that for 85% of the queries only the first
result screen is requested.
➢ Thus, inclusion in the first result screen, which usually shows the top 10
results, can lead to an increase in traffic to a web site, while exclusion
means that only a small fraction of the users will actually see a link to the
web site.
VIT University 31
➢ The result of this process is commonly called search engine spam. To
achieve high rankings, authors use either of these following methods
➢ a text-based approach
➢ a link-based approach
➢ a cloaking approach (spamdexing technique - SEO)
➢ a combination thereof.
SEARCH ENGINE SPAM
➢ [Silverstein et al., 1999] showed that for 85% of the queries, only the first
result screen is requested.
➢ Thus, inclusion in the first result screen, which usually shows the top 10
results, can lead to an increase in traffic to a web site, while exclusion
means that only a small fraction of the users will actually see a link to the
web site.
➢ a text-based approach
➢ a link-based approach
➢ a cloaking approach (spamdexing technique)
➢ a combination thereof.
VIT University 33
SEARCH ENGINE SPAM
➢ Unfortunately, spamming has become so prevalent that every
commercial search engine has had to take measures to identify and
remove spam. Without such measures, the quality of the rankings
suffers severely.
VIT University 34
CONTENT QUALITY
➢ The web is full of noisy, low-quality, unreliable, and indeed
contradictory content.
VIT University 35
CONTENT QUALITY
➢ There have been link-based approaches, for instance PageRank
[Brin and Page, 1998], for estimating the quality of web pages.
However, PageRank only uses the link structure of the web to
estimate page quality.
VIT University 36
QUALITY EVALUATION
➢ Evaluating the quality of different ranking algorithms is a notoriously
difficult problem.
➢ Users usually will not make the effort to give explicit feedback but
nonetheless leave implicit feedback information such as the results
on which they clicked.
➢ Since most authors behave this way, we will refer to these rules as web
conventions, even though there has been no formalization or
standardization of such rules.
➢ The main issue here is to identify the various conventions that have evolved
originally and to develop techniques for accurately determining when the
conventions are being violated.
VIT University 38
DUPLICATE HOSTS
➢ Web search engines try to avoid crawling and indexing duplicate
and near-duplicate pages, as they do not add new information to the
search results and clutter up the results.
VIT University 39
VAGUELY-STRCUTRED DATA
➢ The degree of structure present in data has a strong influence
on techniques used for search and retrieval.
➢ Of late, there has been some movement toward the middle with the
database literature considering the imposition of structure over
almost-structured data.
VIT University 40
VAGUELY-STRCUTRED DATA
➢ In a similar vein, document management systems use accumulated
meta-information to introduce more structure.
VIT University 41
SEMANTIC WEB vs. SEMANTIC SEARCH
• The Semantic Web is a set of technologies for representing,
storing, and querying information. Although these technologies can
be used to store textual data, they typically are used to store
smaller bits of data.
VIT University 42
SEMANTIC WEB vs. SEMANTIC SEARCH
• Semantic search focuses on the text, but the semantic web focuses
on pulling data from multiple sources and multiple formats.
• For example, if you type in the word “Blackhawks” into your search
bar you don’t just want to get listings that have the word
“Blackhawks” in them. A semantic search will return listings about
the Native American tribe as well as the Chicago hockey. You will
also get supporting terms like “hockey lessons” and “Stanley Cup,”
even if they never mention anything about the Blackhawks exactly.
VIT University 43
SEMANTIC WEB SEARCH ENGINE CHALLENGES
➢ The architecture of a semantic search engine must scale to the
Web.
➢ Data from the Web is of varying quality, which poses challenges for
data cleansing and entity consolidation.
➢ The schema of the data is not known a priori, which makes building
user interfaces difficult.
VIT University 44
SEMANTIC WEB SEARCH ENGINE ARCHITECTURE
VIT University 45
WEB CRAWLERS
❖ The approach is to have web pages organized by topic or to search a
collection of pages indexed by keywords.
❖ The former is done by topic directories and the latter, by search
engines.
❖ Ideally, all web pages are linked (there are no unconnected parts of the
web graph) and there are no multiple links and nodes. Then the job of a
crawler is simple:
❖ to run a complete graph search algorithm, such as depth-first or
breadth-first search, and store all visited pages.
❖ The problem with using structured data is the cost associated with the
process of structuring them. The information that people use is
available primarily in unstructured form.
VIT University 47
INDEXING AND KEYWORD SEARCH
VIT University 48
INDEXING AND KEYWORD SEARCH
VIT University 49
INDEXING AND KEYWORD SEARCH
❖ The first step is to fetch the documents from the Web, remove the
HTML tags, and store the documents as plain text files.
❖ Find documents that contain the word computer and the word
programming.
❖ Find documents that contain the word program, but not the word
programming.
❖ Find documents where the words computer and lab are adjacent.
This query is called proximity query, because it takes into account
the lexical distance between words. Another way to do it is by
searching for the phrase computer lab.
VIT University 50
WEB DOCUMENT REPRESENTATION
❖ Documents are tokenized; that is, all punctuation marks are removed
and the character strings without spaces are considered as tokens
(words, also called terms).
❖ All characters in the documents and in the query are converted to upper
or lower case.
VIT University 51
WEB DOCUMENT REPRESENTATION
❖ This process, called stemming, uses morphological information to
allow matching different variants of words.
❖ The collection of words that are left in the document after all those steps
is different from the original document and may be considered as a
formal representation of the document.
VIT University 52
WEB DOCUMENT
REPRESENTATION
VIT University 53
WEB DOCUMENT REPRESENTATION
❖ The words are counted after tokenizing the plain text versions of the
documents (without the HTML structures). The term counts are taken
after removing the stopwords but without stemming.
❖ The terms that occur in a document are in fact the parameters (also
called features, attributes, or variables in different contexts) of the
document representation.
VIT University 54
TYPE OF WEB DOCUMENT REPRESENTATION
❖ The simplest way to use a term as a feature in a document
representation is to check whether or not the term occurs in the
document. Thus, the term is considered as a Boolean attribute, so the
representation is called Boolean.
VIT University 55
WEB DOCUMENT REPRESENTATION
❖ Term positions may be included along with the frequency. This is a
“complete” representation that preserves most of the information and
may be used to generate the original document from its representation.
VIT University 56
WEB DOCUMENT
REPRESENTATION
VIT University 57
WEB DOCUMENT
REPRESENTATION
VIT University 58
WEB DOCUMENT REPRESENTATION
❖ For example, stemming of programming would change the second
query and allow the first one to return more documents (its original
purpose is to identify the Computer Science department, but stemming
would allow more documents to be returned, as they all include the
word program or programs in the sense of “program of study”).
VIT University 59
WEB DOCUMENT REPRESENTATION
❖ This problem is also related to the issue of context (lexical or semantic),
which is generally lost in keyword search.
❖ A partial solution to the latter problem is the use of proximity information
or lexical context.
❖ Partial solution to the latter problem is the use of proximity
information or lexical context.
❖ For this purpose a richer document representation can be used that
preserves term positions.
❖ Some punctuation marks can be replaced by placeholders (tokens
that are left in a document but cannot be used for searching), so
that part of the lexical structure of the document, such as sentence
❖ For example, the word can usually appears in the stopword list, but
as a noun it may be important for a query.
VIT University 61
WEB SECURITY
VIT University 62
WEB APPLICATION SECURITY
“This Site is Secure”
• Most applications state that they are secure because they use SSL.
For example:
This site is absolutely secure. It has been designed to use
128-bit Secure Socket Layer (SSL) technology to prevent
unauthorized users from viewing any of your information. You
may use this site with peace of mind that your data is safe with
us.
VIT University 64
WEB APPLICATION SECURITY
• Broken authentication (62%) — This category of vulnerability
encompasses various defects within the application’s login
mechanism, which may enable an attacker to guess weak
VIT University 65
WEB APPLICATION SECURITY
• SQL injection (32%) — This vulnerability enables an attacker to
submit crafted input to interfere with the application’s interaction
with back-end databases. An attacker may be able to retrieve
arbitrary data from the application, interfere with its logic, or execute
commands on the database server itself.
VIT University 66
WEB APPLICATION SECURITY
• Information leakage (78%) — This involves cases where an
application divulges sensitive information that is of use to an
attacker in developing an assault against the application, through
defective error handling or other behaviour.
VIT University 67
WEB SECURITY MODEL
➢ HTML DOM
➢ SAME ORIGIN POLICY
VIT University 68
SAME ORIGIN POLICY
VIT University 69
RELAXING SAME ORIGIN POLICY
➢ document.domain property
➢ For example, cooperating scripts in documents loaded from
orders.example.com and catalog.example.com might set their
document.domain properties to “example.com”, thereby making
the documents appear to have the same origin and enabling
each document to read properties of the other.
VIT University 70
RELAXING SAME ORIGIN POLICY
➢ Cross-document messaging
➢ Calling the postMessage() method on a Window object
asynchronously fires an "onmessage" event in that window,
triggering any user-defined event handlers.
➢ WebSockets
➢ they recognize when a WebSocket URI is used, and insert
an Origin: header into the request that indicates the origin of
the script requesting the connection.
VIT University 71
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Remember that several characters have special meaning in
different parts of the HTTP request. When you are modifying the
data within requests, you should URL-encode these characters to
ensure that they are interpreted in the way you intend.
VIT University 73
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Applications typically accumulate an amount of state from previous
requests, which affects how they respond to further requests.
Sometimes, when you are trying to investigate a tentative
vulnerability and isolate the precise cause of a particular piece of
anomalous behaviour, you must remove the effects of any
accumulated state.
VIT University 74
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Some applications use a load-balanced configuration in which
consecutive HTTP requests may be handled by different back-end
servers at the web, presentation, data, or other tiers.
➢ some successful attacks will result in a change in the state of
the specific server that handles your requests — such as the
creation of a new file within the web root.
➢ To isolate the effects of particular actions, it may be necessary
to perform several identical requests in succession, testing the
result of each until your request is handled by the relevant
server.
VIT University 75
WEB APPLICATION HACKER’S METHODOLOGY
❖ Map the Application’s Content
❖ Analyse the Application
❖ Test Client-Side Controls
❖ Test the Authentication Mechanism
❖ Test the Session Management Mechanism
❖ Test Access Controls
❖ Test for Input-Based Vulnerabilities
❖ Test for Function-Specific Input Vulnerabilities
❖ Test for Logic Flaws
❖ Test for Shared Hosting Vulnerabilities
❖ Test for Application Server Vulnerabilities
❖ Miscellaneous Checks
❖ Follow Up Any Information Leakage
VIT University 76
WEB APPLICATION HACKER’S METHODOLOGY
Map the Application’s Content
➢ Explore Visible Content
➢ Consult Public Resources
➢ Discover Hidden Content
➢ Discover Default Content
➢ Enumerate Identifier-Specified Functions
➢ Test for Debug Parameters
Analyse the Application
➢ Identify Functionality
➢ Identify Data Entry Points
➢ Identify the Technologies Used
➢ Map the Attack Surface
VIT University 77
WEB APPLICATION HACKER’S METHODOLOGY
Test Client-Side Controls
➢ Test Transmission of Data Via the Client
➢ Test Client-Side Controls Over User Input
➢ Test Browser Extension Components
Test the Authentication Mechanism
➢ Understand the Mechanism
➢ Test Password Quality
➢ Test for Username Enumeration
➢ Test Resilience to Password Guessing
➢ Test Any Account Recovery Function
➢ Test Any Remember Me Function
➢ Test Any Impersonation Function
77
VIT University
WEB APPLICATION HACKER’S METHODOLOGY
➢ Test Username
➢ Test Predictability of Autogenerated Credentials
➢ Check for Unsafe Transmission of Credentials
➢ Check for Unsafe Distribution of Credentials
➢ Test for Insecure Storage
➢ Test for Logic Flaws
➢ Exploit Any Vulnerabilities to Gain Unauthorized Access
Test the Session Management Mechanism
➢ Understand the Mechanism
➢ Test Tokens for Meaning
➢ Test Tokens for Predictability
➢ Check for Insecure Transmission of Tokens
VIT University 79
WEB APPLICATION HACKER’S METHODOLOGY
➢ Check for Disclosure of Tokens in Logs
➢ Check Mapping of Tokens to Sessions
➢ Test Session Termination
➢ Check for Session Fixation
➢ Check for CSRF
➢ Check Cookie Scope
Test Access Controls
➢ Understand the Access Control Requirements
➢ Test with Multiple Accounts
➢ Test with Limited Access
➢ Test for Insecure Access Control Methods
VIT University 80
WEB APPLICATION HACKER’S METHODOLOGY
VIT University 82
WEB APPLICATION HACKER’S METHODOLOGY
Miscellaneous Checks
➢ Check for DOM-Based Attacks
➢ Check for Local Privacy Vulnerabilities
➢ Check for Weak SSL Ciphers
➢ Check Same-Origin Policy Configuration
VIT University 84
Thank You
VIT University 85