Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
123 views

Module I

This document contains a syllabus for the course CSE3024 Web Mining taught at VIT University. The syllabus outlines 8 topics covered in the course, including introduction to the web, web crawling, indexing, web structure mining, web content mining, web usage mining, query processing, and recent trends. It provides brief descriptions of the topics and additional details on concepts like web crawling, indexing, web structure mining, and more.

Uploaded by

Vaibhav Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Module I

This document contains a syllabus for the course CSE3024 Web Mining taught at VIT University. The syllabus outlines 8 topics covered in the course, including introduction to the web, web crawling, indexing, web structure mining, web content mining, web usage mining, query processing, and recent trends. It provides brief descriptions of the topics and additional details on concepts like web crawling, indexing, web structure mining, and more.

Uploaded by

Vaibhav Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

CSE3024 WEB MINING

L T P J C
3 0 0 4 4

Dr. J ANURADHA
Associate Professor,
School of Computer Science and Engineering,
VIT University, Vellore, TN, India – 632 014.
SYLLABUS

1. INTRODUCTION TO WEB

2. WEB CRAWLING

3. INDEXING

4. WEB STRUCTURE MINING

5. WEB CONTENT MINING

6. WEB USAGE MINING

7. QUERY PROCESSING

8. RECENT TRENDS

VIT University
2
SYLLABUS

➢ WEB CRAWLING

❖ A web crawler (also known as a web spider or web robot) is a

program or automated script which browses the World Wide

Web in a methodical, automated manner.

➢ WEB INDEXING

❖ Web indexing (or Internet indexing) refers to various methods

for indexing the contents of a websiteor of the Internet as a

whole.

VIT University
3
SYLLABUS
➢ WEB STRUCTURE MINING

❖ Web structure mining tries to discover useful knowledge from

the structure of hyperlinks.

➢ WEB CONTENT MINING

❖ Web content mining aims to extract/mine useful information or

knowledge from web page contents.

➢ WEB USAGE MINING

❖ Web usage mining refers to the discovery of user access

patterns from Web usage logs.


VIT University
4
WEB MINING

➢ Web mining is the use of data mining techniques to automatically

discover and extract information from Web documents and

services.

➢ There are three general classes of information that can be

discovered by web mining:

Web Graph Web Content Web Usage.

➢ It is quite different from Data Mining / Text Mining

(Structure, Scale, Speed, Access )

VIT University
5
WEB MINING

➢ The web is not a relation.

❖ Textual information and linkage structure

➢ Usage data is huge and growing rapidly.

❖ Google’s usage logs are bigger than their web crawl.

❖ Data generated per day is comparable to largest conventional

data warehouses.

➢ Ability to react in real-time to usage patterns

❖ No human in the loop

VIT University
6
WEB MINING

Some of the valuable end users for web mining results are:

❖ Search (Biggest Web Miner by far)


❖ Business intelligence
❖ Competitive intelligence
❖ Pricing analysis
❖ Events
❖ Product data
❖ Popularity
❖ Reputation

VIT University
7
Module – 1
INTRODUCTION TO WWW

1. Architecture of WWW

2. Web Search Engine

3. Web Security

4. HTTP, HTTPS, URL

5. JAVA and HTML

VIT University
8
WWW
❖ What is WWW?
❖ WWW vs. Internet
❖ Web Language
❖ HTML
❖ Web Protocols
❖ HTTP / FTP
❖ URL
❖ HTTP / HTTPS
❖ FTP / MAILTO / USENET
❖ URI / URL / URN
❖ W3C
❖ Web Standards / Web Accessibility Standards
VIT University
9
BASIC COMPONENTS OF THE WEB
• Web Servers
• Servers
• Web Clients
• HTTP Protocols
• Browser

VIT University 10
WEB ARCHITECTURE

VIT University 11
WEB ARCHITECTURE
• Uniform Resource Identifier (URI) is used to uniquely identify
resources on the web and UNICODE makes it possible to built web
pages that can be read and write in human languages.

• XML (Extensible Markup Language) helps to define common syntax


in semantic web.

• Resource Description Framework (RDF) framework helps in defining


core representation of data for web. RDF represents data about
resource in graph form.

VIT University 12
WEB ARCHITECTURE
• RDF Schema (RDFS) allows more standardized description
of taxonomies and other ontological constructs.

• Web Ontology Language (OWL) offers more constructs over RDFS. It


comes in following three versions:
➢ OWL Lite for taxonomies and simple constraints.
➢ OWL DL for full description logic support.
➢ OWL for more syntactic freedom of RDF

VIT University 13
WEB ARCHITECTURE
• RIF and SWRL offers rules beyond the constructs that are available
from RDFs and OWL. Simple Protocol and RDF Query Language
(SPARQL) is SQL like language used for querying RDF data and OWL
Ontologies.

• All semantic and rules that are executed at layers below Proof and their
result will be used to prove deductions.

• Cryptography means such as digital signature for verification of the


origin of sources is used. On the top of layer User interface and
Applications layer is built for user interaction.

VIT University 14
WEB OPERATION

VIT University 15
WEB CHALLENGES
❖ As originally proposed by Tim Berners-Lee, the Web was intended
to improve the management of general information about
accelerators and experiments at CERN.

❖ His suggestion was to organize the information used at that


institution in a graph-like structure where the nodes are documents
describing objects, such as notes, articles, departments, or
persons, and the links are relations among them, such as “depends
on,” “is part of,” “refers to,” or “uses.”

VIT University 16
WEB CHALLENGES
❖ The framework proposed by Berners-Lee was very general and
would work very well for any set of documents, providing flexibility
and convenience in accessing large amounts of text.

❖ A very important development of this idea was that the documents


need not be stored at the same computer or database but rather,
could be distributed over a network of computers.

❖ Luckily, the infrastructure for this type of distribution, the Internet,


had already been developed.
In short, this is how the Web was born.

VIT University 17
TimBL Proposed Web vs. Today’s Web
❖ The recent Web is huge and grows incredibly fast.
❖ About 10 years after the Berners-Lee proposal, the Web was
estimated to have 150 million nodes (pages) and 1.7 billion
edges (links).

❖ The formal semantics of the Web is very restricted.


❖ nodes are simply web pages and links are of a single type (e.g.,
“refer to”). The meaning of the nodes and links is not a part of
the web system; rather, it is left to web page developers to
describe in the page content what their web documents mean
and what types of relations they have with the documents to
which they are linked.
VIT University 18
WEB CHALLENGES
❖ The Web is the largest repository of knowledge in the world, so
everyone is tempted to use it. But the big question is how to find it.
Answering this question has been the basic driving force in
developing web search technologies (Web Search Engines).

❖ The next Challenge is to bring back the semantics of hypertext


documents (something that was a part of the original web proposal
of Berners-Lee) so that we can easily use the vast amount of
information available.

TURN WEB DATA INTO WEB KNOWLEDGE

VIT University 19
WEB CHALLENGES
❖ There are several ways to achieve this:
❖ Use the existing Web and apply sophisticated search
techniques.
❖ Change the way in which we create web pages.

VIT University 20
WEB SEARCH ENGINE
A web search engine is a coordinated set of programs that includes:

❖ A spider (also called a "crawler" or a "bot") that goes to every page or


representative pages on every Web site that wants to be searchable
and reads it, using hypertext links on each page to discover and read
a site's other pages.

❖ A program that creates a huge index (sometimes called a "catalog")


from the pages that have been read.

❖ A program that receives your search request, compares it to the


entries in the index, and returns results to you.
VIT University 21
WEB SEARCH ENGINE
List of Top 10 Most popular search engine:
• Google
• Bing
• Yahoo
• Ask
• AOL
• Baidu
• Wolframalpha
• DuckDuckGo
• Internet Archive
• Chacha.com

VIT University 22
comScore report February 2016
comScore Explicit Core Search Report* (Desktop Only)

Total U.S. – Desktop Home & Work Locations

Search Entity Explicity Core Explicity Core


Search Share Search Queries
Google 64.0% 10,762
Bing 21.4% 3,594
Yahoo 12.2% 2,048
Ask Network 1.6% 273
AOL, Inc. 0.9% 145

Google is also dominating the mobile/tablet search engine market share with 89%!

*“Explicit Core Search” excludes contextually driven searches that do not


reflect specific user intent to interact with the search results.

VIT University 23
WEB SEARCH ENGINE
❖ Web search engines explore the existing (semantics-free) structure
of the Web and try to find documents that match user search
criteria.

❖ The basic idea is to use a set of words (or terms) that the user
specifies and retrieve documents that include (or do not include)
those words. – Keyword Search

❖ Further IR techniques are used to avoid terms that are too general
and too specific and to take into account term distribution
throughout the entire body of documents as well as to explore
document similarity.
VIT University 24
WEB SEARCH ENGINE
❖ Natural language processing approaches are also used to analyze
term context or lexical information, or to combine several terms into
phrases.

❖ After retrieving a set of documents ranked by their degree of


matching the keyword query, they are further ranked by importance
(popularity, authority), usually based on the web link structure.

VIT University 25
TOPIC DIRECTORIES
➢ Web pages are organized into hierarchical structures that reflect
their meaning. These are known as topic directories, or simply
directories, and are available from almost all web search portals.

➢ The largest is being developed under the Open Directory Project


(dmoz.org) and is used by Google in their Web Directory.

➢ The directory structure is often used in the process of web search


to better match user criteria or to specialize a search within a
specific set of pages from a given category.

VIT University 26
TOPIC DIRECTORIES
➢ The directories are usually created manually with the help of
thousands of web page creators and editors.

➢ There are also approaches to do this automatically by applying


machine learning methods for classification and clustering.

VIT University 27
SEMANTIC WEB
➢ Semantic web is a initiative led by the web consortium (w3c.org).

➢ Its main objective is to bring formal knowledge representation


techniques into the Web.

➢ It is widely acknowledged that the Web is like a “fancy fax machine”


used to send good-looking documents worldwide.
➢ The problem here is that the nice format of web pages is very
difficult for computers to understand—something that we
expect search engines to do.

VIT University 28
SEMANTIC WEB
➢ The main idea behind the semantic web is to add formal descriptive
material to each web page that although invisible to people would
make its content easily understandable by computers.

➢ Thus, the Web would be organized and turned into the largest
knowledge base in the world, which with the help of advanced
reasoning techniques developed in the area of artificial intelligence
would be able not just to provide ranked documents that match a
keyword search query, but would also be able to answer questions
and give explanations.

http://www.w3.org/2001/sw/

VIT University 29
WEB SEARCH ENGINE CHALLENGES
• Spam
• Content Quality
• Quality Evaluation
• Web Convention
• Duplicate Hosts
• Vaguely-Structured Data

Source: Henzinger, Monika R., Rajeev Motwani, and Craig Silverstein.


"Challenges in web search engines." ACM SIGIR Forum. Vol. 36. No.
2. ACM, 2002.
VIT University 30
SEARCH ENGINE SPAM
➢ Spamming involves getting a site more exposure than it deserves for its keywords,
leading to unsatisfactory search experiences.

➢ Search Engine Optimization (SEO) involves getting a site the exposure it deserves
on the most targeted keywords, leading to satisfactory search experiences.

➢ [Silverstein et al., 1999] showed that for 85% of the queries only the first
result screen is requested.

➢ Thus, inclusion in the first result screen, which usually shows the top 10
results, can lead to an increase in traffic to a web site, while exclusion
means that only a small fraction of the users will actually see a link to the
web site.

VIT University 31
➢ The result of this process is commonly called search engine spam. To
achieve high rankings, authors use either of these following methods

➢ a text-based approach
➢ a link-based approach
➢ a cloaking approach (spamdexing technique - SEO)
➢ a combination thereof.
SEARCH ENGINE SPAM
➢ [Silverstein et al., 1999] showed that for 85% of the queries, only the first
result screen is requested.

➢ Thus, inclusion in the first result screen, which usually shows the top 10
results, can lead to an increase in traffic to a web site, while exclusion
means that only a small fraction of the users will actually see a link to the
web site.

➢ The result of this process is commonly called search engine spam. To


achieve high rankings, authors use either of these following methods

➢ a text-based approach
➢ a link-based approach
➢ a cloaking approach (spamdexing technique)
➢ a combination thereof.
VIT University 33
SEARCH ENGINE SPAM
➢ Unfortunately, spamming has become so prevalent that every
commercial search engine has had to take measures to identify and
remove spam. Without such measures, the quality of the rankings
suffers severely.

➢ One approach to deal with the spam problem is to construct a spam


classifier that tries to label pages as spam or not-spam

VIT University 34
CONTENT QUALITY
➢ The web is full of noisy, low-quality, unreliable, and indeed
contradictory content.

➢ A reasonable approach for relatively high-quality content would be


to assume that every document in a collection is authoritative and
accurate, design techniques for this context, and then tweak the
techniques to incorporate the possibility of low-quality content.

➢ In designing a high-quality search engine, one has to start with the


assumption that a typical document cannot be "trusted" in isolation;
rather, it is the synthesis of a large number of low-quality documents
that provides the best set of results.

VIT University 35
CONTENT QUALITY
➢ There have been link-based approaches, for instance PageRank
[Brin and Page, 1998], for estimating the quality of web pages.
However, PageRank only uses the link structure of the web to
estimate page quality.

➢ It seems to us that a better estimate of the quality of a page requires


additional sources of information, both within a page (e.g., the
reading level of a page) and across different pages (e.g., correlation
of content).

VIT University 36
QUALITY EVALUATION
➢ Evaluating the quality of different ranking algorithms is a notoriously
difficult problem.

➢ Commercial search engines have the benefit of large amounts of


user-behaviour data they can use to help evaluate ranking.

➢ Users usually will not make the effort to give explicit feedback but
nonetheless leave implicit feedback information such as the results
on which they clicked.

➢ The research issue is to exploit the implicit feedback to evaluate


different ranking strategies.
VIT University 37
WEB CONVENTIONS
➢ For Example, many webmasters use the anchor text of a link to provide a
succinct description of the target page.

➢ Since most authors behave this way, we will refer to these rules as web
conventions, even though there has been no formalization or
standardization of such rules.

➢ Search engines rely on these web conventions to improve the quality of


their results. Consequently, when webmasters violate these conventions
they can confuse search engines.

➢ The main issue here is to identify the various conventions that have evolved
originally and to develop techniques for accurately determining when the
conventions are being violated.

VIT University 38
DUPLICATE HOSTS
➢ Web search engines try to avoid crawling and indexing duplicate
and near-duplicate pages, as they do not add new information to the
search results and clutter up the results.

➢ However, if a search engine can avoid crawling the duplicate


content in the first place, the gain is even larger.

➢ In general, predicting whether a page will end up being a duplicate


of an already-crawled page is chancy work, but the problem
becomes more tractable if we limit it to finding duplicate hosts, that
is, two hostnames that serve the same content.

VIT University 39
VAGUELY-STRCUTRED DATA
➢ The degree of structure present in data has a strong influence
on techniques used for search and retrieval.

➢ At one extreme, the database community has focused on highly-


structured, relational data, while at the other the information retrieval
community has been more concerned with essentially unstructured
text documents.

➢ Of late, there has been some movement toward the middle with the
database literature considering the imposition of structure over
almost-structured data.

VIT University 40
VAGUELY-STRCUTRED DATA
➢ In a similar vein, document management systems use accumulated
meta-information to introduce more structure.

➢ The emergence of XML has led to a flurry of research involving


extraction, imposition, or maintenance of partially-structured data.

VIT University 41
SEMANTIC WEB vs. SEMANTIC SEARCH
• The Semantic Web is a set of technologies for representing,
storing, and querying information. Although these technologies can
be used to store textual data, they typically are used to store
smaller bits of data.

• Semantic search is the process of typing something into a search


engine and getting more results than just those that feature the
exact keyword you typed into the search box. Semantic search will
take into account the context and meaning of your search terms. It’s
about understanding the assumptions that the searcher is making
when typing in that search query.

VIT University 42
SEMANTIC WEB vs. SEMANTIC SEARCH
• Semantic search focuses on the text, but the semantic web focuses
on pulling data from multiple sources and multiple formats.

• For example, if you type in the word “Blackhawks” into your search
bar you don’t just want to get listings that have the word
“Blackhawks” in them. A semantic search will return listings about
the Native American tribe as well as the Chicago hockey. You will
also get supporting terms like “hockey lessons” and “Stanley Cup,”
even if they never mention anything about the Blackhawks exactly.

VIT University 43
SEMANTIC WEB SEARCH ENGINE CHALLENGES
➢ The architecture of a semantic search engine must scale to the
Web.

➢ Dealing with data rather than documents requires a different


indexing approach compared to traditional information retrieval
systems.

➢ Data from the Web is of varying quality, which poses challenges for
data cleansing and entity consolidation.

➢ The schema of the data is not known a priori, which makes building
user interfaces difficult.
VIT University 44
SEMANTIC WEB SEARCH ENGINE ARCHITECTURE

VIT University 45
WEB CRAWLERS
❖ The approach is to have web pages organized by topic or to search a
collection of pages indexed by keywords.
❖ The former is done by topic directories and the latter, by search
engines.

❖ Collecting “all” web documents can be done by browsing the Web


systematically and exhaustively and storing all visited pages. This is
done by crawlers (also called spiders or robots).

❖ Ideally, all web pages are linked (there are no unconnected parts of the
web graph) and there are no multiple links and nodes. Then the job of a
crawler is simple:
❖ to run a complete graph search algorithm, such as depth-first or
breadth-first search, and store all visited pages.

❖ A good example of such a crawler is WebSPHINX.


(https://www.cs.cmu.edu/~rcm/websphinx/)
VIT University 46
INDEXING AND KEYWORD SEARCH
❖ Generally, there are two types of data: structured and unstructured.

❖ Structured data have keys (attributes, features) associated with each


data item that reflect its content, meaning, or usage.
❖ A typical example of structured data is a relational table in a
database.

❖ The problem with using structured data is the cost associated with the
process of structuring them. The information that people use is
available primarily in unstructured form.

❖ This is the keyword search approach known from the area of


information retrieval (IR).
❖ The idea of IR is to retrieve documents by using a simple Boolean
criterion: the presence or absence of specific words (keywords,
terms) in the documents

VIT University 47
INDEXING AND KEYWORD SEARCH

VIT University 48
INDEXING AND KEYWORD SEARCH

VIT University 49
INDEXING AND KEYWORD SEARCH
❖ The first step is to fetch the documents from the Web, remove the
HTML tags, and store the documents as plain text files.

❖ Then the keyword search approach can be used to answer such


queries as:

❖ Find documents that contain the word computer and the word
programming.

❖ Find documents that contain the word program, but not the word
programming.

❖ Find documents where the words computer and lab are adjacent.
This query is called proximity query, because it takes into account
the lexical distance between words. Another way to do it is by
searching for the phrase computer lab.

VIT University 50
WEB DOCUMENT REPRESENTATION
❖ Documents are tokenized; that is, all punctuation marks are removed
and the character strings without spaces are considered as tokens
(words, also called terms).

❖ All characters in the documents and in the query are converted to upper
or lower case.

❖ Words are reduced to their canonical form (stem, base, or root).


❖ For example, variant forms such as is and are are replaced with be,
various endings are removed, or the words are transformed into
their root form, such as programs and programming into program.

VIT University 51
WEB DOCUMENT REPRESENTATION
❖ This process, called stemming, uses morphological information to
allow matching different variants of words.

❖ Articles, prepositions, and other common words that appear frequently


in text documents but do not bring any meaning or help distinguish
documents are called stopwords. These words are usually removed.

❖ Examples are a, an, the, on, in, and at.

❖ The collection of words that are left in the document after all those steps
is different from the original document and may be considered as a
formal representation of the document.

VIT University 52
WEB DOCUMENT
REPRESENTATION

VIT University 53
WEB DOCUMENT REPRESENTATION
❖ The words are counted after tokenizing the plain text versions of the
documents (without the HTML structures). The term counts are taken
after removing the stopwords but without stemming.

❖ To emphasize this difference, we call the words in this collection terms.


The collection of words (terms) in the entire set of documents is called
the text corpus.

❖ The terms that occur in a document are in fact the parameters (also
called features, attributes, or variables in different contexts) of the
document representation.

VIT University 54
TYPE OF WEB DOCUMENT REPRESENTATION
❖ The simplest way to use a term as a feature in a document
representation is to check whether or not the term occurs in the
document. Thus, the term is considered as a Boolean attribute, so the
representation is called Boolean.

❖ The value of a term as a feature in a document representation may be


the number of occurrences of the term (term frequency) in the
document or in the entire corpus.
❖ Document representation that includes the term frequencies but not
the term positions is called a bag-of-words representation because
formally it is a multiset or bag.

VIT University 55
WEB DOCUMENT REPRESENTATION
❖ Term positions may be included along with the frequency. This is a
“complete” representation that preserves most of the information and
may be used to generate the original document from its representation.

❖ The purpose of the document representation is to help the process of


keyword matching.
❖ However, it may also result in loss of information, which generally
increases the number of documents in response to the keyword
query. Thus some irrelevant documents may also returned.

VIT University 56
WEB DOCUMENT
REPRESENTATION

VIT University 57
WEB DOCUMENT
REPRESENTATION

VIT University 58
WEB DOCUMENT REPRESENTATION
❖ For example, stemming of programming would change the second
query and allow the first one to return more documents (its original
purpose is to identify the Computer Science department, but stemming
would allow more documents to be returned, as they all include the
word program or programs in the sense of “program of study”).

❖ Therefore, stemming should be applied with care and even avoided,


especially for Web searches, where a lot of common words are used
with specific technical meaning.

VIT University 59
WEB DOCUMENT REPRESENTATION
❖ This problem is also related to the issue of context (lexical or semantic),
which is generally lost in keyword search.
❖ A partial solution to the latter problem is the use of proximity information
or lexical context.
❖ Partial solution to the latter problem is the use of proximity
information or lexical context.
❖ For this purpose a richer document representation can be used that
preserves term positions.
❖ Some punctuation marks can be replaced by placeholders (tokens
that are left in a document but cannot be used for searching), so
that part of the lexical structure of the document, such as sentence

boundaries, can be preserved.


VIT University 60
WEB DOCUMENT REPRESENTATION
❖ Another approach, called part-of-speech tagging, is to attach to words
tags that reflect their part-of-speech roles (e.g., verb or noun).

❖ For example, the word can usually appears in the stopword list, but
as a noun it may be important for a query.

VIT University 61
WEB SECURITY

GOTO WEB SECURITY

VIT University 62
WEB APPLICATION SECURITY
“This Site is Secure”
• Most applications state that they are secure because they use SSL.
For example:
This site is absolutely secure. It has been designed to use
128-bit Secure Socket Layer (SSL) technology to prevent
unauthorized users from viewing any of your information. You
may use this site with peace of mind that your data is safe with
us.

• Increasingly, organizations also cite their compliance with Payment


Card Industry (PCI) standards to reassure users that they are
secure.
VIT University 63
WEB APPLICATION SECURITY
• In fact, the majority of web applications are insecure, despite the
widespread usage of SSL technology and the adoption of regular
PCI scanning.

VIT University 64
WEB APPLICATION SECURITY
• Broken authentication (62%) — This category of vulnerability
encompasses various defects within the application’s login
mechanism, which may enable an attacker to guess weak

passwords, launch a brute-force attack, or bypass the login.

• Broken access controls (71%) — This involves cases where the


application fails to properly protect access to its data and
functionality, potentially enabling an attacker to view other users’
sensitive data held on the server or carry out privileged actions.

VIT University 65
WEB APPLICATION SECURITY
• SQL injection (32%) — This vulnerability enables an attacker to
submit crafted input to interfere with the application’s interaction
with back-end databases. An attacker may be able to retrieve
arbitrary data from the application, interfere with its logic, or execute
commands on the database server itself.

• Cross-site scripting (94%) — This vulnerability enables an


attacker to target other users of the application, potentially gaining
access to their data, performing unauthorized actions on their
behalf, or carrying out other attacks against them.

VIT University 66
WEB APPLICATION SECURITY
• Information leakage (78%) — This involves cases where an
application divulges sensitive information that is of use to an
attacker in developing an assault against the application, through
defective error handling or other behaviour.

• Cross-site request forgery (92%) — This flaw means that


application users can be induced to perform unintended actions on
the application within their user context and privilege level. The
vulnerability allows a malicious web site visited by the victim user to
interact with the application to perform actions that the user did not
intend.

VIT University 67
WEB SECURITY MODEL

➢ HTML DOM
➢ SAME ORIGIN POLICY

VIT University 68
SAME ORIGIN POLICY

VIT University 69
RELAXING SAME ORIGIN POLICY

➢ document.domain property
➢ For example, cooperating scripts in documents loaded from
orders.example.com and catalog.example.com might set their
document.domain properties to “example.com”, thereby making
the documents appear to have the same origin and enabling
each document to read properties of the other.

➢ Cross-Origin Resource Sharing


➢ It allows servers to use a header to explicitly list origins that
may request a file or to use a wildcard and allow a file to be
requested by any site

VIT University 70
RELAXING SAME ORIGIN POLICY

➢ Cross-document messaging
➢ Calling the postMessage() method on a Window object
asynchronously fires an "onmessage" event in that window,
triggering any user-defined event handlers.

➢ WebSockets
➢ they recognize when a WebSocket URI is used, and insert
an Origin: header into the request that indicates the origin of
the script requesting the connection.

VIT University 71
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Remember that several characters have special meaning in
different parts of the HTTP request. When you are modifying the
data within requests, you should URL-encode these characters to
ensure that they are interpreted in the way you intend.

➢ Furthermore, note that entering URL-encoded data into a form


usually causes your browser to perform another layer of encoding.
For example, submitting %00 in a form will probably result in a
value of %2500 being sent to the server. For this reason it is
normally best to observe the final request within an intercepting
proxy.
VIT University 72
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Many tests for common web application vulnerabilities involve
sending various crafted input strings and monitoring the
• vulnerability is present.
application’s responses for anomalies, which indicate that a
➢ In any case where specific crafted input results in behaviour
associated with a vulnerability (such as a particular error message),
you should double-check whether submitting benign input in the
relevant parameter also causes the same behaviour. If it does, your
tentative finding is probably a false positive.

VIT University 73
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Applications typically accumulate an amount of state from previous
requests, which affects how they respond to further requests.
Sometimes, when you are trying to investigate a tentative
vulnerability and isolate the precise cause of a particular piece of
anomalous behaviour, you must remove the effects of any
accumulated state.

VIT University 74
WEB APPLICATION HACKER’S METHODOLOGY
General Guidelines
➢ Some applications use a load-balanced configuration in which
consecutive HTTP requests may be handled by different back-end
servers at the web, presentation, data, or other tiers.
➢ some successful attacks will result in a change in the state of
the specific server that handles your requests — such as the
creation of a new file within the web root.
➢ To isolate the effects of particular actions, it may be necessary
to perform several identical requests in succession, testing the
result of each until your request is handled by the relevant
server.

VIT University 75
WEB APPLICATION HACKER’S METHODOLOGY
❖ Map the Application’s Content
❖ Analyse the Application
❖ Test Client-Side Controls
❖ Test the Authentication Mechanism
❖ Test the Session Management Mechanism
❖ Test Access Controls
❖ Test for Input-Based Vulnerabilities
❖ Test for Function-Specific Input Vulnerabilities
❖ Test for Logic Flaws
❖ Test for Shared Hosting Vulnerabilities
❖ Test for Application Server Vulnerabilities
❖ Miscellaneous Checks
❖ Follow Up Any Information Leakage
VIT University 76
WEB APPLICATION HACKER’S METHODOLOGY
Map the Application’s Content
➢ Explore Visible Content
➢ Consult Public Resources
➢ Discover Hidden Content
➢ Discover Default Content
➢ Enumerate Identifier-Specified Functions
➢ Test for Debug Parameters
Analyse the Application
➢ Identify Functionality
➢ Identify Data Entry Points
➢ Identify the Technologies Used
➢ Map the Attack Surface
VIT University 77
WEB APPLICATION HACKER’S METHODOLOGY
Test Client-Side Controls
➢ Test Transmission of Data Via the Client
➢ Test Client-Side Controls Over User Input
➢ Test Browser Extension Components
Test the Authentication Mechanism
➢ Understand the Mechanism
➢ Test Password Quality
➢ Test for Username Enumeration
➢ Test Resilience to Password Guessing
➢ Test Any Account Recovery Function
➢ Test Any Remember Me Function
➢ Test Any Impersonation Function
77
VIT University
WEB APPLICATION HACKER’S METHODOLOGY
➢ Test Username
➢ Test Predictability of Autogenerated Credentials
➢ Check for Unsafe Transmission of Credentials
➢ Check for Unsafe Distribution of Credentials
➢ Test for Insecure Storage
➢ Test for Logic Flaws
➢ Exploit Any Vulnerabilities to Gain Unauthorized Access
Test the Session Management Mechanism
➢ Understand the Mechanism
➢ Test Tokens for Meaning
➢ Test Tokens for Predictability
➢ Check for Insecure Transmission of Tokens
VIT University 79
WEB APPLICATION HACKER’S METHODOLOGY
➢ Check for Disclosure of Tokens in Logs
➢ Check Mapping of Tokens to Sessions
➢ Test Session Termination
➢ Check for Session Fixation
➢ Check for CSRF
➢ Check Cookie Scope
Test Access Controls
➢ Understand the Access Control Requirements
➢ Test with Multiple Accounts
➢ Test with Limited Access
➢ Test for Insecure Access Control Methods
VIT University 80
WEB APPLICATION HACKER’S METHODOLOGY

Test for Input-Based Vulnerabilities


➢ Fuzz All Request Parameters
➢ Test for SQL Injection
➢ Test for XSS and Other Response Injection
➢ Test for OS Command Injection
➢ Test for Path Traversal
➢ Test for Script Injection
➢ Test for File Inclusion
Test for Function-Specific Input Vulnerabilities
➢ Test for SMTP Injection
➢ Test for Native Software Vulnerabilities
➢ Test for SOAP Injection VIT University 81
WEB APPLICATION HACKER’S METHODOLOGY

➢ Test for LDAP Injection


➢ Test for XPath Injection
➢ Test for Back-End Request Injection
➢ Test for XXE Injection
Test for Logic Flaws
➢ Identify the Key Attack Surface
➢ Test Multistage Processes
➢ Test Handling of Incomplete Input
➢ Test Trust Boundaries
➢ Test Transaction Logic

VIT University 82
WEB APPLICATION HACKER’S METHODOLOGY

Test for Shared Hosting Vulnerabilities


➢ Test Segregation in Shared Infrastructures
➢ Test Segregation Between ASP-Hosted Applications
Test for Application Server Vulnerabilities
➢ Test for Default Credentials
➢ Test for Default Content
➢ Test for Dangerous HTTP Methods
➢ Test for Proxy Functionality
➢ Test for Virtual Hosting Misconfiguration
➢ Test for Web Server Software Bugs
➢ Test for Web Application Firewalling
VIT University 83
WEB APPLICATION HACKER’S METHODOLOGY

Miscellaneous Checks
➢ Check for DOM-Based Attacks
➢ Check for Local Privacy Vulnerabilities
➢ Check for Weak SSL Ciphers
➢ Check Same-Origin Policy Configuration

VIT University 84
Thank You

VIT University 85

You might also like