Web Mining
Web Mining
Web Mining
1.1 Introduction to Web Mining:Web Mining is the process of Data Mining techniques to
automatically discover and extract information from Web documents and services. The
main purpose of web mining is discovering useful information from the World-Wide Web
and its usage patterns.
Applications of Web Mining:
Web mining is the process of discovering patterns, structures, and relationships in web
data. It involves using data mining techniques to analyze web data and extract valuable
insights. The applications of web mining are wide-ranging and include:
E-commerce
Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by
recommending products based on customer preferences.
Web mining can be used to analyze search engine queries and search engine results pages
(SERPs). This information can be used to improve the visibility of websites in search engine
results and increase traffic to the website.
Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This information can be
used to prevent financial fraud, identity theft, and other types of online fraud.
Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment from posts,
comments, and reviews. This information can be used to understand customer sentiment
towards products and services and make informed business decisions.
Web mining can be used to analyze web content and extract valuable information such as
keywords, topics, and themes. This information can be used to improve the relevance of
web content and optimize search engine rankings.
Customer service:
Web mining can be used to analyze customer service interactions on websites and social
media platforms. This information can be used to improve the quality of customer service
and identify areas for improvement.
Healthcare:
Web mining can be used to analyze health-related websites and extract valuable
information about diseases, treatments, and medications. This information can be used to
improve the quality of healthcare and inform medical research.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
1. Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of
several types of data – text, image, audio, video etc. Content data is the group of
facts that a web page is designed. It can provide effective and interesting
patterns about user needs. Text documents are related to text mining, machine
learning and natural language processing. This mining is also known as text
mining. This type of mining performs scanning and mining of the text, images
structure information from the web. The structure of the web graph consists of
discovering interesting usage patterns from large data sets. And these patterns
enable you to understand the user behaviors or something like that. In web
usage mining, user access data on the web and collect data in form of logs. So,
It describes the discovery of useful information from web content. In simple words, it is the
application of web mining that extracts relevant or useful information content from the
Web. Web Content mining is somehow related but different from other mining techniques
like data mining and text mining. Due to heterogeneity and the absence of web data,
automated discovery of new knowledge patterns can be challenging to some extent.
Web data are generally semi-structured and/or unstructured, while data mining is primarily
concerned with structured data . It performs scanning and mining of text, image and
images, and groups of web pages according to the content of input by displaying the list in
search engines.For Example: if the user is searching for a particular song then the search
engine will display or provide suggestions relevant to it.
Web content mining deals with different kinds of data such as text, audio, video, image, etc.
1. Pre-processing
2. Clustering
3. Classifying
1.3 Crawlers: Web Crawler is a bot that downloads the content from the internet and
indexes it. The main purpose of this bot is to learn about the different web pages on the
internet. This kind of bots is mostly operated by search engines. By applying the search
algorithms to the data collected by the web crawlers, search engines can provide the
relevant links as a response for the request requested by the user. In this article, let’s
Approach: The idea behind the working of this algorithm is to parse the raw HTML of the
website and look for other URL in the obtained data. If there is a URL, then add it to the
Problem caused by web crawler: Web crawlers could accidentally flood websites with
requests to avoid this inefficiency web crawlers use politeness policies. To implement
web crawler needs to keep revisiting pages. For that freshness uses HTTP
protocol to as HTTP has a special request type called HEAD which returns the
information about the last updated date of webpage by which crawler can
2. Age: An age of a webpage is T days after it has been last crawled. On average
webpage updating follow Poisson distribution and the older a page gets the
more costs to crawl the web page so Age is more important factor for crawler
than freshness.
Applications: This kind of web crawler is used to acquire the important parameters of the
web like:
2. What are the websites that are important in the network as a whole?
1.4 Harvest System: Data harvesting means getting the data and information from an online
resource. It is usually interchangeable with web scraping, web crawling, and data extraction.
Collecting is an agricultural term that means gathering ripe crops from the fields, which
involves the act of collection and relocation. Data harvesting is extracting valuable data from
target websites and putting them into your database in a structured format.
To conduct data harvesting, you need to have an automated crawler to parse the target
websites, capture valuable information, extract the data and finally export it into a structured
format for further analysis. Therefore, data harvesting doesn't involve algorithms, machine
learning, or statistics. Instead, it relies on computer programming like Python, R, and Java to
function.
Many data extraction tools and service providers can conduct web harvesting for you.
Octoparse stands out as the best web scraping tool. Whether you are a first-time self-starter
or an experienced programmer, it is the best choice to harvest the data from the internet.
1.5 Virtual Web View:
1. Multiple layered database (MLDB) is used to handle large amounts of unstructured data
on the Web.
2. This database is massive and distributed. Each layer is more generalised than the layer
beneath it.
3. The MLDB provides an abstracted and condensed view of a portion of the Web. A view of
the MLDB, which is called a Virtual Web View (VWV) can be constructed.
4. Generalisation tools are proposed, and concept hierarchies are used to assist in the
generalisation process for constructing the higher levels of the MLDB.
5. WebML, a web data mining query language, is proposed to provide data mining
operations on the MLDB. It is an extension of DMQL.
1.6 Web Structure Mining: Web Structure Mining is one of the three different types of
techniques in Web Mining. Web Structure Mining is the technique of discovering structure
information from the web. It uses graph theory to analyze the nodes and connections in the
structure of a website.
Depending upon the type of Web Structural data, Web Structure Mining can be categorised
1.Extracting patterns from the hyperlink in the Web: The Web works through a system of
hyperlinks using the hyper text transfer protocol (http). Hyperlink is a structural component
that connects the web page according to different location. Any page can create a hyperlink
of any other page and that page can also be linked to some other page. the intertwined or
self-referral nature of web lends itself to some unique network analytical algorithms. The
structure of Web pages could also be analyzed to examine the pattern of hyperlinks among
pages.
2. Mining the document structure. It is the analysis of tree like structure of web page to
describe HTML or XML usage or the tags usage . There are different terms associated with
● Edge(s): Edge represents the hyperlinks of the web page in the graph (Web
graph)
graph.
Page Rank: The page rank algorithm is applicable to web pages. The page rank algorithm
is used by Google Search to rank many websites in their search engine results. The page
rank algorithm was named after Larry Page, one of the founders of Google. We can say that
the page rank algorithm is a way of measuring the importance of website pages. A web
page basically is a directed graph which has two components namely Nodes and
1.7 Web Usage Mining: Web usage mining, a subset of Data Mining, is basically the
extraction of various types of interesting data that is readily available and accessible in the
ocean of huge web pages, Internet- or formally known as World Wide Web (WWW).
Being one of the applications of data mining techniques, it has helped to analyse user
activities on different web pages and track them over a period of time.
1. Web Server Data: The web server data generally includes the IP address, browser logs,
proxy server logs, user profiles, etc. The user logs are being collected by the web server
data.
2. Application Server Data: An added feature on the commercial application servers is to
build applications on it. Tracking various business events and logging them into application
3. Application-level data: There are various new kinds of events that can be there in an
application. The logging feature enabled in them helps us get the past record of the events.
activities.
● Customer Relationship is being better understood by the company with the aid
of these mining tools. It helps them to satisfy the needs of the customer faster
and efficiently.
● Privacy stands out as a major issue. Analyzing data for the benefit of customers
is good. But using the same data for something else can be dangerous. Using it
within the individual’s knowledge can pose a big threat to the company.
attributes can be combined to get some personal information of the user which
Basically, this technique focuses on relations among the web pages that frequently appear
together in users’ sessions. The pages accessed together are always put together into a
single server session. Association Rules help in the reconstruction of websites using the
access logs. Access logs generally contain information about requests which are
approaching the webserver. The major drawback of this technique is that having so many
sets of rules produced together may result in some of the rules being completely
classes. The main target here in web usage mining is to develop that kind of profile of
users/customers that are associated with a particular class/category. For this exact thing,
one requires to extract the best features that will be best suitable for the associated class.
vector machines, K-Nearest Neighbors, Logistic Regression, Decision Trees, etc. For
example, having a track record of data of customers regarding their purchase history in the
last 6 months the customer can be classified into frequent and non-frequent
features/traits. There are mainly 2 types of clusters- the first one is the usage cluster and
the second one is the page cluster. The clustering of pages can be readily performed based
on the usage data. In usage-based clustering, items that are commonly accessed
/purchased together can be automatically organized into groups. The clustering of users
tends to establish groups of users exhibiting similar browsing patterns. In page clustering,
the basic concept is to get information quickly over the web pages.
1. Personalization of Web Content: The World Wide Web has a lot of information and is
expanding very rapidly day by day. The big problem is that on an everyday basis the
specific needs of people are increasing and they quite often don’t get that query result. So,
the user’s need-based upon its navigational behavior tracking and their interests. Web
2. E-commerce: Web-usage Mining plays a very vital role in web-based companies. Since
their ultimate focus is on Customer attraction, customer retention, cross-sales, etc. To build
a strong relationship with the customer it is very necessary for the web-based company to
rely on web usage mining where they can get a lot of insights about customer’s interests.
Also, it tells the company about improving its web-design in some aspects.
required to decrease the time waiting for that data hence the term ‘prefetch’. All the results
which we get from web usage mining can be used to produce prefetching and caching
strategies which in turn can highly reduce the server response time.