Web Mining

Web Mining
1.1 Introduction to Web Mining:Web Mining is the process of Data Mining techniques to
automatically discover and extract information from Web documents and services. The
main purpose of web mining is discovering useful information from the World-Wide Web
and its usage patterns.
Applications of Web Mining:
Web mining is the process of discovering patterns, structures, and relationships in web
data. It involves using data mining techniques to analyze web data and extract valuable
insights. The applications of web mining are wide-ranging and include:
E-commerce
Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by
recommending products based on customer preferences.
Search engine optimization:
Web mining can be used to analyze search engine queries and search engine results pages
(SERPs). This information can be used to improve the visibility of websites in search engine
results and increase traffic to the website.
Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This information can be
used to prevent financial fraud, identity theft, and other types of online fraud.
Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment from posts,
comments, and reviews. This information can be used to understand customer sentiment
towards products and services and make informed business decisions.
Web content analysis:
Web mining can be used to analyze web content and extract valuable information such as
keywords, topics, and themes. This information can be used to improve the relevance of
web content and optimize search engine rankings.
Customer service:
Web mining can be used to analyze customer service interactions on websites and social
media platforms. This information can be used to improve the quality of customer service
and identify areas for improvement.
Healthcare:
Web mining can be used to analyze health-related websites and extract valuable
information about diseases, treatments, and medications. This information can be used to
improve the quality of healthcare and inform medical research.
Process of Web Mining:
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
1. Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of
several types of data – text, image, audio, video etc. Content data is the group of
facts that a web page is designed. It can provide effective and interesting
patterns about user needs. Text documents are related to text mining, machine
learning and natural language processing. This mining is also known as text
mining. This type of mining performs scanning and mining of the text, images
and groups of web pages according to the content of the input.

2. Web Structure Mining: Web structure mining is the application of discovering
structure information from the web. The structure of the web graph consists of
web pages as nodes, and hyperlinks as edges connecting related pages.
Structure mining basically shows the structured summary of a particular
website. It identifies relationship between web pages linked by information or
direct link connection. To determine the connection between two commercial
websites, Web structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or
discovering interesting usage patterns from large data sets. And these patterns
enable you to understand the user behaviors or something like that. In web
usage mining, user access data on the web and collect data in form of logs. So,
Web usage mining is also called log mining.

1.2 Web Content Mining:
Web Content Mining. Mining, extraction, and integration of useful data, information, and
knowledge from Web page content are known as Web Mining.
It describes the discovery of useful information from web content. In simple words, it is the
application of web mining that extracts relevant or useful information content from the
Web. Web Content mining is somehow related but different from other mining techniques
like data mining and text mining. Due to heterogeneity and the absence of web data,
automated discovery of new knowledge patterns can be challenging to some extent.
Web data are generally semi-structured and/or unstructured, while data mining is primarily
concerned with structured data . It performs scanning and mining of text, image and
images, and groups of web pages according to the content of input by displaying the list in
search engines.For Example: if the user is searching for a particular song then the search
engine will display or provide suggestions relevant to it.
Web content mining deals with different kinds of data such as text, audio, video, image, etc.
Web Content Mining Techniques:
1. Pre-processing
2. Clustering
3. Classifying
4. Identifying the associations

5. Topic identification, tracking, and drift analysis
Applications of Web Content Mining:
1. Classifying the web documents into categories.
2. Identify topics of web documents.
3. Finding similar web pages across the different web servers.
4. Applications related to relevance.
1.3 Crawlers: Web Crawler is a bot that downloads the content from the internet and
indexes it. The main purpose of this bot is to learn about the different web pages on the
internet. This kind of bots is mostly operated by search engines. By applying the search
algorithms to the data collected by the web crawlers, search engines can provide the
relevant links as a response for the request requested by the user. In this article, let’s
discuss how the web crawler is implemented.
Approach: The idea behind the working of this algorithm is to parse the raw HTML of the
website and look for other URL in the obtained data. If there is a URL, then add it to the
queue and visit them in breadth-first search manner.
Problem caused by web crawler: Web crawlers could accidentally flood websites with
requests to avoid this inefficiency web crawlers use politeness policies. To implement
politeness policy web crawler takes help of two parameters:
1. Freshness: As the content on webpages is constantly updated and modified
web crawler needs to keep revisiting pages. For that freshness uses HTTP
protocol to as HTTP has a special request type called HEAD which returns the
information about the last updated date of webpage by which crawler can
decide the freshness of a webpage.
2. Age: An age of a webpage is T days after it has been last crawled. On average
webpage updating follow Poisson distribution and the older a page gets the
more costs to crawl the web page so Age is more important factor for crawler
than freshness.
Applications: This kind of web crawler is used to acquire the important parameters of the
web like:
1. What are the frequently visited websites?
2. What are the websites that are important in the network as a whole?
3. Useful Information on social networks: Facebook, Twitter… etc.
4. Who is the most popular person in a group of people?
5. Who is the most important software engineer in a company?
1.4 Harvest System: Data harvesting means getting the data and information from an online
resource. It is usually interchangeable with web scraping, web crawling, and data extraction.
Collecting is an agricultural term that means gathering ripe crops from the fields, which
involves the act of collection and relocation. Data harvesting is extracting valuable data from
target websites and putting them into your database in a structured format.
To conduct data harvesting, you need to have an automated crawler to parse the target
websites, capture valuable information, extract the data and finally export it into a structured
format for further analysis. Therefore, data harvesting doesn't involve algorithms, machine
learning, or statistics. Instead, it relies on computer programming like Python, R, and Java to
function.
Many data extraction tools and service providers can conduct web harvesting for you.
Octoparse stands out as the best web scraping tool. Whether you are a first-time self-starter
or an experienced programmer, it is the best choice to harvest the data from the internet.
1.5 Virtual Web View:
1. Multiple layered database (MLDB) is used to handle large amounts of unstructured data
on the Web.
2. This database is massive and distributed. Each layer is more generalised than the layer
beneath it.
3. The MLDB provides an abstracted and condensed view of a portion of the Web. A view of
the MLDB, which is called a Virtual Web View (VWV) can be constructed.
4. Generalisation tools are proposed, and concept hierarchies are used to assist in the
generalisation process for constructing the higher levels of the MLDB.
5. WebML, a web data mining query language, is proposed to provide data mining
operations on the MLDB. It is an extension of DMQL.
1.6 Web Structure Mining: Web Structure Mining is one of the three different types of
techniques in Web Mining. Web Structure Mining is the technique of discovering structure
information from the web. It uses graph theory to analyze the nodes and connections in the
structure of a website.
Depending upon the type of Web Structural data, Web Structure Mining can be categorised
into two types:
1.Extracting patterns from the hyperlink in the Web: The Web works through a system of
hyperlinks using the hyper text transfer protocol (http). Hyperlink is a structural component
that connects the web page according to different location. Any page can create a hyperlink
of any other page and that page can also be linked to some other page. the intertwined or
self-referral nature of web lends itself to some unique network analytical algorithms. The
structure of Web pages could also be analyzed to examine the pattern of hyperlinks among
pages.
2. Mining the document structure. It is the analysis of tree like structure of web page to
describe HTML or XML usage or the tags usage . There are different terms associated with
Web Structure Mining :
● Web Graph: Web Graph is the directed graph representing Web.
● Node: Node represents the web page in the graph.
● Edge(s): Edge represents the hyperlinks of the web page in the graph (Web
graph)
● In degree(s): It is the number of hyperlinks pointing to a particular node in the
graph.
● Degree(s): Degree is the number of links generated from a particular node.
These are also called the Out Degrees.
Page Rank: The page rank algorithm is applicable to web pages. The page rank algorithm
is used by Google Search to rank many websites in their search engine results. The page
rank algorithm was named after Larry Page, one of the founders of Google. We can say that
the page rank algorithm is a way of measuring the importance of website pages. A web
page basically is a directed graph which has two components namely Nodes and
Connections. The pages are nodes and hyperlinks are connections.
1.7 Web Usage Mining: Web usage mining, a subset of Data Mining, is basically the
extraction of various types of interesting data that is readily available and accessible in the
ocean of huge web pages, Internet- or formally known as World Wide Web (WWW).
Being one of the applications of data mining techniques, it has helped to analyse user
activities on different web pages and track them over a period of time.
Types of Web Usage Mining based upon the Usage Data:
1. Web Server Data: The web server data generally includes the IP address, browser logs,
proxy server logs, user profiles, etc. The user logs are being collected by the web server
data.
2. Application Server Data: An added feature on the commercial application servers is to
build applications on it. Tracking various business events and logging them into application
server logs is mainly what application server data consists of.
3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.
Advantages of Web Usage Mining
● Government agencies are benefited from this technology to overcome terrorism.
● Predictive capabilities of mining tools have helped identify various criminal
activities.
● Customer Relationship is being better understood by the company with the aid
of these mining tools. It helps them to satisfy the needs of the customer faster
and efficiently.
Disadvantages of Web Usage Mining
● Privacy stands out as a major issue. Analyzing data for the benefit of customers
is good. But using the same data for something else can be dangerous. Using it
within the individual’s knowledge can pose a big threat to the company.
● Having no high ethical standards in a data mining company, two or more
attributes can be combined to get some personal information of the user which
again is not respectable.
Some Techniques in Web Usage Mining

1. Association Rules:The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that frequently appear
together in users’ sessions. The pages accessed together are always put together into a
single server session. Association Rules help in the reconstruction of websites using the
access logs. Access logs generally contain information about requests which are
approaching the webserver. The major drawback of this technique is that having so many
sets of rules produced together may result in some of the rules being completely
inconsequential. They may not be used for future use too.
2. Classification: Classification is mainly to map a particular record to multiple predefined
classes. The main target here in web usage mining is to develop that kind of profile of
users/customers that are associated with a particular class/category. For this exact thing,
one requires to extract the best features that will be best suitable for the associated class.
Classification can be implemented by various algorithms – some of them include- Support
vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. For
example, having a track record of data of customers regarding their purchase history in the
last 6 months the customer can be classified into frequent and non-frequent
classes/categories. There can be multiclass also in other cases too.
3. Clustering: Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the usage cluster and
the second one is the page cluster. The clustering of pages can be readily performed based
on the usage data. In usage-based clustering, items that are commonly accessed
/purchased together can be automatically organized into groups. The clustering of users
tends to establish groups of users exhibiting similar browsing patterns. In page clustering,
the basic concept is to get information quickly over the web pages.
Applications of Web Usage Mining
1. Personalization of Web Content: The World Wide Web has a lot of information and is
expanding very rapidly day by day. The big problem is that on an everyday basis the
specific needs of people are increasing and they quite often don’t get that query result. So,
a solution to this is web personalization. Web personalization may be defined as catering to
the user’s need-based upon its navigational behavior tracking and their interests. Web
Personalization includes recommender systems, check-box customization, etc.
Recommender systems are popular and are used by many companies.
2. E-commerce: Web-usage Mining plays a very vital role in web-based companies. Since
their ultimate focus is on Customer attraction, customer retention, cross-sales, etc. To build
a strong relationship with the customer it is very necessary for the web-based company to
rely on web usage mining where they can get a lot of insights about customer’s interests.
Also, it tells the company about improving its web-design in some aspects.
3. Prefetching and Catching: Prefetching basically means loading of data before it is
required to decrease the time waiting for that data hence the term ‘prefetch’. All the results
which we get from web usage mining can be used to produce prefetching and caching
strategies which in turn can highly reduce the server response time.

Web Mining

Uploaded by

Copyright:

Available Formats

Web Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Mining

Uploaded by

Copyright:

Available Formats

Web Mining

Search engine optimization:

Web content analysis:

Process of Web Mining:

and groups of web pages according to the content of the input.

web pages as nodes, and hyperlinks as edges connecting related pages.

Structure mining basically shows the structured summary of a particular

website. It identifies relationship between web pages linked by information or

direct link connection. To determine the connection between two commercial

websites, Web structure mining can be very useful.

3. Web Usage Mining: Web usage mining is the application of identifying or

Web usage mining is also called log mining.

Web Content Mining Techniques:

4. Identifying the associations

Applications of Web Content Mining:

1. Classifying the web documents into categories.

2. Identify topics of web documents.

3. Finding similar web pages across the different web servers.

4. Applications related to relevance.

discuss how the web crawler is implemented.

queue and visit them in breadth-first search manner.

politeness policy web crawler takes help of two parameters:

1. Freshness: As the content on webpages is constantly updated and modified

decide the freshness of a webpage.

1. What are the frequently visited websites?

3. Useful Information on social networks: Facebook, Twitter… etc.

4. Who is the most popular person in a group of people?

5. Who is the most important software engineer in a company?

into two types:

Web Structure Mining :

● Web Graph: Web Graph is the directed graph representing Web.

● Node: Node represents the web page in the graph.

● In degree(s): It is the number of hyperlinks pointing to a particular node in the

● Degree(s): Degree is the number of links generated from a particular node.

These are also called the Out Degrees.

Connections. The pages are nodes and hyperlinks are connections.

Types of Web Usage Mining based upon the Usage Data:

server logs is mainly what application server data consists of.

Advantages of Web Usage Mining

● Government agencies are benefited from this technology to overcome terrorism.

● Predictive capabilities of mining tools have helped identify various criminal

Disadvantages of Web Usage Mining

● Having no high ethical standards in a data mining company, two or more

again is not respectable.

Some Techniques in Web Usage Mining

inconsequential. They may not be used for future use too.

2. Classification: Classification is mainly to map a particular record to multiple predefined

Classification can be implemented by various algorithms – some of them include- Support

classes/categories. There can be multiclass also in other cases too.

3. Clustering: Clustering is a technique to group together a set of things having similar

Applications of Web Usage Mining

a solution to this is web personalization. Web personalization may be defined as catering to

Personalization includes recommender systems, check-box customization, etc.

Recommender systems are popular and are used by many companies.

3. Prefetching and Catching: Prefetching basically means loading of data before it is

You might also like