Course Manual on Data Mining_CSC 425_015446
Course Manual on Data Mining_CSC 425_015446
INTRODUCTION
What is Data Mining?
Data Mining is the computational process of discovering patterns in large data sets
involving methods using artificial intelligence, machine learning, statistical analysis,
and database systems to extract information from a data set and transform it into an
understandable structure for further use. One of the most frequently cited definitions
of data mining defines the technique as “the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in data”.
Simply storing information in a data warehouse does not provide the benefits that an
organization is seeking. To realise the value of a data warehouse, it is necessary to
extract the knowledge hidden within the warehouse. However, as the amount and
complexity of the data in a data warehouse grows, it becomes increasingly difficult, if
not impossible, for business analysts to identify trends and relationships in the data
using simple query and reporting tools. Data mining is one of the best ways to extract
meaningful trends and patterns from huge amounts of data. Data mining discovers
information within data warehouses that mere queries and reports cannot effectively
reveal.
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. There is a need to clarify
between Data mining and querying a Database. While the former can reveal the
hidden knowledge in data stored in database and other relevant information not
stored in database; the latter is limited to retrieving information about the data
stored in database. It is on record that statisticians were the first to use the term
“data mining” and recently, computer scientists have looked at data mining as an
algorithmic problem.
In the commercial field, the questions to be asked are not only ‘how many customers
have bought this product in this period?’ but also ‘what is their profile?’, ‘what other
products are they interested in?’ and ‘when will they be interested’? The profiles to
be discovered are generally complex, more complicated combinations, in which the
discriminant variables are not necessarily what we might have imagined at first, and
could not be found by chance, especially in the case of rare behaviours. Data mining
methods are certainly more complex than those of elementary descriptive statistics.
They are based on artificial intelligence tools (neural networks), information theory
(decision trees), machine learning theory, and above all, inferential statistics and
‘conventional’ data analysis including factor analysis, clustering and discriminant
analysis, etc.
Data mining software help to explore the unknown patterns that are significant to the
success of the business. The actual data mining task is an automatic analysis of large
quantities of data to extract previously unknown, interesting patterns such as cluster
analysis, unusual records (anomaly detection), and dependencies (association rule
mining, sequential pattern mining).
Data mining has its origin in various disciplines; the two most important are statistics
and machine learning. The figure below summarizes some disciplines through which
data mining originates.
Data Pre-processing
Data Quality: Why is it necessary to pre-process data?
Data have quality if they satisfy the requirements of the intended use. Data quality
comprised of : accuracy, completeness, consistency and interpretability. Let us look at
a scenario. Imagine that you are a manager at XYZ company and have been charged
with analysing the company’s data with respect to your branch’s sales. You
immediately set out to perform this task. You carefully inspect the company’s
database and data warehouse, identifying and selecting the attributes or dimensions
such as item, price, and units sold to be included in your analysis. Then you notice
that several of the attributes for various tuples have no recorded value. For your
analysis, you would like to include information as to whether each item purchased
was advertised as on sale, yet you discover that this information has not been
recorded. Furthermore, users of your database system have reported errors, unusual
values, and inconsistencies in the data recorded for some transactions. In other
words, the data you wish to analyze by data mining techniques are incomplete i.e
they lack attribute values or certain attributes of interest, or containing only
aggregate data; inaccurate or noisy and inconsistent.
This is a typical example of how real world data look like. This scenario illustrates
three of the elements defining data quality: accuracy, completeness, and consistency.
Inaccurate, incomplete, and inconsistent data are commonplace properties of large
real-world databases and data warehouses. Today’s real-world databases are highly
susceptible to noisy, missing, and inconsistent data due to their typically huge size
which sometimes may be of several gigabytes or more. Data inconsistency may also
be as a result of where it originates from or due to its heterogeneous sources. Low-
quality data will lead to low-quality mining results. However, pre-processing of data
can:
i. improve the quality of the data
ii. improve the mining results
iii. improve the efficiency and ease of the mining process
There are several data pre-processing techniques. Although, not all may be required
when a particular data is to be mined; the data to be explored determines the
specific method that must be applied. Some of the data pre-processing techniques
are:
i. Data cleaning: Cleaning removes noise and resolves inconsistencies in data. The
process involves filling in the missing values, smoothing noisy data, and identifying or
removing outliers, If users believe the data are dirty, they are unlikely to trust the
results of any data mining that has been applied. Furthermore, dirty data can confuse
the mining procedure, resulting in unreliable output. The first step in data cleaning as
a process is discrepancy detection. Discrepancies can be caused by several factors,
including poorly designed data entry forms that have many optional fields, human
error in data entry, and deliberate errors. For instance, if respondents are not willing
to divulge their personal information, and data decay, e.g., out-dated addresses.
Discrepancies may also arise from inconsistent data representations and inconsistent
use of codes.
ii. Data integration: In the process of integrating data, several data are merged from
multiple sources into a coherent data store such as a data warehouse. Getting back to
your task at XYZ company, suppose that you would like to include data from multiple
sources in your analysis. This would involve integrating multiple databases, or files.
Yet some attributes representing a given concept may have different names in
different databases, causing inconsistencies and redundancies. For example, the
attribute for customer identification may be referred to as customer id in one data
store and cust id in another.
In a situation where the data selected for analysis is HUGE, the probability of having a
slow mining process becomes very high. Such data set can be reduced without
jeopardizing the data mining results. This is achievable through data reduction
approach.
iii. Data reduction: This strategy obtains a reduced representation of the data set
that is much smaller in volume and in the number of attributes, yet produces the
same or almost the same analytical results. Data reduction strategies include:
a. dimensionality reduction
b. numerosity reduction.
iv. Data transformations: Again getting back to company’s data, you have decided
that you would like to use a distance based mining algorithm for your analysis, such
as neural networks, nearest-neighbour classifiers, or clustering. Such methods
provide better results if the data to be analysed have been normalized, that is, scaled
to a smaller range such as *0.0, 1.0+. This technique usually improves the accuracy
and efficiency of mining algorithms involving distance measurements.
moderately priced, and expensive, thereby reducing the number of data values to be
handled by the mining process.
Finally, in data transformation, the data are transformed or consolidated into forms
appropriate for mining. Note that, transformation is one of the data pre-processing
techniques, the strategies it uses can be summarized as follows:
1. Smoothing, this is a way of ensuring that data are free from noise. The techniques
used here may include regression and clustering.
4. Normalization, where the attribute data are scaled so as to fall within a smaller
range, such as -1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced
by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior).
determine which products are frequently bought together and use this information for
marketing purposes. This is sometimes referred to as market basket analysis.
Market basket analysis:
This process analyzes customer buying habits by finding associations between the
different items that customers place in their shopping baskets. The discovery of such
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket. Such information can lead to increased
sales by helping retailers to selective marketing and plan their shelf space
Example:
If customers who purchase computers also tend to buy antivirus software at the same
time, then placing the hardware display close to the software display may help
increase the sales of both items. In an alternative strategy, placing hardware and
software at opposite ends of the store may entice customers who purchase such items
to pick up other items along the way. For instance, after deciding on an expensive
computer, a customer may observe security systems for sale while heading toward the
software display to purchase antivirus software and may decide to purchase a home
security system as well. Market basket analysis can also help retailers plan which
items to put on sale at reduced prices. If customers tend to purchase computers and
printers together, then having a sale on printers may encourage the sale of printers as
well as computers.
3. Clustering
This is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Another approach of revealing patterns in the dataset is through description.
Data mining functionalities are used to specify the kinds of patterns to be found in
data mining tasks. In general, such tasks can be classified into two categories:
descriptive and predictive. A typical example of data description is clustering.
Clustering techniques apply when there is no class to be predicted but the instances
are to be divided into natural groups. In other to achieve good cluster results, a
correlation-based analysis method can be used to perform attribute relevance analysis
and filter out statistically irrelevant or weakly relevant attributes from the descriptive
mining process.
i. Scalability: Many clustering algorithms work well on small data sets containing
little above hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios. Clustering on only a
sample of a given large data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
ii. Ability to deal with different types of attributes: Many algorithms are designed
to cluster numeric data. However, applications may require clustering other data
types, such as binary, nominal (categorical), or mixtures of these data types.
Recently, more and more applications need clustering techniques for complex data
types such as graphs, sequences, images, and documents.
v. Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often noisy.
Some readings may be inaccurate due to the sensing mechanisms, and some readings
may be erroneous due to interferences from surrounding transient objects. Clustering
algorithms can be sensitive to such noise and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
In other words, partitioning methods conduct one-level partitioning on data sets. The
basic partitioning methods typically adopt exclusive cluster separation. That is, each
object must belong to exactly one group. Most partitioning methods are distance-
based. Given k, the number of partitions to construct, a partitioning method creates an
initial partitioning. It then uses an iterative relocation technique that attempts to
improve the partitioning by moving objects from one group to another. The general
criterion of a good partitioning is that objects in the same cluster are ―close‖ or related
to each other, whereas objects in different clusters are ―far apart‖ or very different.
a. The agglomerative approach, also called the bottom-up approach, starts with
each object forming a separate group. It successively merges the objects or
groups close to one another, until all the groups are merged into one (the
topmost level of the hierarchy), or a termination condition holds.
b. The divisive approach, also called the top-down approach, starts with all the
objects in the same cluster. In each of the successive iteration, a cluster is split
into smaller clusters, until eventually each object is in one cluster, or a
termination condition holds.
3. Density-based methods: Most partitioning methods cluster objects based on the
distance between objects. Such methods can find only spherical-shaped clusters
and encounter difficulty in discovering clusters of arbitrary shapes. Other
clustering methods have been developed based on the notion of density. Their
general idea is to continue growing a given cluster as long as the density (number
of objects or data points) in the ―neighbourhood‖ exceeds some threshold. For
example, for each data point within a given cluster, the neighbourhood of a given
radius has to contain at least a minimum number of points. Such a method can be
used to filter out noise or outliers and discover clusters of arbitrary shape.
4. Grid-based methods: Grid-based methods quantize the object space into a finite
number of cells that form a grid structure. All the clustering operations are
performed on the grid structure (i.e., on the quantized space). The main advantage
of this approach is its fast processing time, which is typically independent of the
number of data objects and dependent only on the number of cells in each
dimension in the quantized space.
Using grids is often an efficient approach to many spatial data mining problems,
including clustering. Therefore, grid-based methods can be integrated with other
clustering methods such as density-based methods and hierarchical methods.
Some clustering algorithms integrate the ideas of several clustering methods, so
that it is sometimes difficult to classify a given algorithm as uniquely belonging to
only one clustering method category. Furthermore, some applications may have
clustering criteria that require the integration of several clustering techniques.
5. Regression – attempts to find a function which models the data with the least
error. Regression is a statistical method that tries to determine the strength and
character of the relationship between one dependent variable and a series of others.
Although data mining can help to reveal patterns and relationships, it does not tell the
user the value or significance of these patterns. The user must make these types of
determinations. Similarly, the validity of the patterns discovered is dependent on how
they compare to ―real world‖ circumstances.
Another limitation of data mining is that while it can identify connections between
behaviours or variables, it does not necessarily identify causal relationships. For
example, an application may identify that a pattern of behaviour such as the
propensity to purchase an airline ticket just shortly before the flight is scheduled to
depart, is related to characteristics such as income, level of education and internet use.
However, that does not necessarily indicate that the ticket purchasing behaviour is
caused by one or more of these variables. The individual‘s behaviour could be affected
by some additional variable(s) such as occupation (the need to make the trip on short
notice), family status (a sick relative needing care), or a hobby (taking advantage of
last-minute discounts to visit new destinations).
TYPES OF LEARNING
In the domain of Data mining, or Machine learning, learning is basically of three
forms: supervised, unsupervised and reinforcement. Each of this is discussed as
follows:
A. Supervised Learning
The goal of the supervised learning approach is for the model to learn a mapping from
inputs to outputs so it can predict the output for new, unseen data.
Characteristics:
Labelled Data: The dataset contains both input data (features) and the
corresponding correct output (labels).
Learning Task: The model tries to learn the relationship between the input
features and the output labels.
Goal: To predict the output for new data based on the learned relationship.
B. Unsupervised Learning
In unsupervised learning, the agent learns patterns in the input even though no explicit
feedback is supplied. The most common unsupervised learning task is clustering:
detecting potentially useful clusters of input examples. For example, a taxi agent might
gradually develop a concept of "good traffic days" and "bad traffic days" without ever
being given labelled examples of each by a teacher. The goal is to find patterns,
structures, or relationships in the data without predefined outputs. The model tries to
infer the underlying structure or distribution of the data.
Characteristics
Unlabeled Data: The dataset consists only of input data without any associated
labels.
Learning Task: The model attempts to identify patterns, clusters, or
dimensionality reduction techniques from the data itself.
Goal: To discover hidden patterns or intrinsic structures in the data
C. Reinforcement learning
In practice, these distinctions are not always so crisp. In semi-supervised learning, we are
given a few labelled examples and must make what we can of a large collection of un-labelled
examples. Even the labels themselves may not be the oracular truths that we hope for. Imagine
that you are trying to build a system to guess a person's age from a photo. You gather some
labelled examples by snapping pictures of people and asking their age. That's supervised
learning. But in reality, some of the people lied about their age. It's not just that there is
random noise in the data; rather the inaccuracies are systematic, and to uncover them is an
unsupervised learning problem involving images, self-reported ages, and true (un-known)
ages. Thus, both noise and lack of labels create a continuum between supervised and
unsupervised learning.
Example: An airport security screening station is used to deter mine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic pattern
(distance between eyes, size, and shape of mouth, head etc) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that are associated with
known offenders.
1). IF-THEN rules, student ( class , "undergraduate") AND concentration ( level, "high") ==>
class A
3) Neural Network
CLASS A
F6
F3
UG F1
F7 CLASS B
F4
CL F2
F5 F8 CLASS C
UG- Undergraduate
CL – Concentration level
Prediction
This typically find some missing or unavailable data values rather than class labels. Although,
prediction may refer to both data value prediction and class label prediction, it is usually
confined to data value prediction and thus is distinct from classification. It is also referred to as
regression most times. Prediction also encompasses the identification of distribution trends
based on the available data.
Regression can also be used to handle prediction problems. It is about using some
independent variables to predict the dependent variable. For instance, if the task is to predict
house prices based on features like size, location, and age, the model learns from examples of
houses with known prices to predict the price of new houses.
In classification you have a set of predefined classes and want to know which class a new
object belongs to.
Clustering tries to group a set of objects and find whether there is some relationship between
the objects.
In the 1940s foundational efforts at AI involved modelling the neurons in the brain, which
resulted in the field of neural networks. An artificial neural network consists of a large
collection of neural units (artificial neurons), whose behaviour is roughly based on how real
neurons communicate with each other in the brain. Each neural unit is connected with many
other neural units, and links can enhance or inhibit the activation state of adjoining units.
The network architecture consists of multiple layers of neural units. A signal initiates at the
input layer, traverses through hidden layers, and finally culminates at the output layer. Once
the logical approach to AI became dominant in the 1950s, neural networks fell from
popularity. However, new algorithms for training neural networks and dramatically increased
computer processing speed resulted in a re-emergence of the use of neural nets in the field
called deep learning. Deep learning neural network
architectures differs from older neural networks in that they often have more hidden layers.
Furthermore, deep learning networks can be trained using both unsupervised and supervised
learning. Deep learning has been used to solve tasks like computer vision and speech
recognition, which were difficult with other approaches.
Learning takes two forms, supervised and unsupervised learning. In supervised learning, the
training data is composed of input-output pairs. A neural network tries to find a function
which, when given the inputs, produces the outputs. Through repeated application of training
data, the network then approximates a function for that input domain. There are two main
types of neural networks fixed (non-adaptive), and dynamic (adaptive).
Fixed neural networks, sometimes referred to as Pre-Trained Neural Networks (PTNN), are
those that have undergone training and then become set. The internal structure of the network
remains unchanged during operation. After training is complete, all weights, connections, and
node configurations remain the same, and the network reduces to a repeatable function. A
common use of a fixed neural network might be a classification system to identify malformed
products on a manufacturing line where the definition of an undesirable characteristic would
not change and the network would be expected to perform the same classification repeatedly.
Transfer Function: The summation function computes the internal stimulation, or activation
level of the neuron. The relationship between the internal activation level and the output can
be linear or nonlinear. The relationship is expressed by one of several types of transfer
functions. The transfer function combines (i.e., adds up) the inputs coming into a neuron from
other neurons/sources and then produces an output based on the choice of the transfer
function. Selection of the specific function affects the network‘s operation. The sigmoid
activation function (or sigmoid transfer function) is an S-shaped transfer function in the range
of 0 to 1, and it is a popular as well as useful nonlinear transfer function. Other activation
functions that can be used are: step, sign, sigmoid and linear function. The step and sign
activation functions, is sometimes referred to as hard limit functions. They are mostly used in
decision-making neurons for classification and pattern recognition tasks.
Back-propagation Algorithm
Back-propagation is a neural network learning algorithm. During the learning phase, the
network learns by adjusting the weights so as to be able to predict the correct class label of the
input tuples. Neural network learning is also referred to as connectionist learning due to the
connections between units. Neural networks involve long training times and are therefore
more suitable for applications where this is feasible. They require several parameters that are
typically best determined empirically such as the network topology or ―structure.‖ Neural
networks have been criticized for their poor interpretability. For example, it is difficult for
humans to interpret the symbolic meaning behind the learned weights and of ―hidden units‖ in
the network. These features initially made neural networks less desirable for data mining.
Advantages of neural networks, however, include their high tolerance of noisy data as well as
their ability to classify patterns on which they have not been trained. They can be used when
you may have little knowledge of the relationships between attributes and classes. They are
well suited for continuous-valued inputs and outputs, unlike most decision tree algorithms.
They have been successful on a wide array of real-world data, including handwritten character
recognition, pathology and laboratory medicine, and training a computer to pronounce English
text. Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
DECISION TREES
Decision tree induction is one of the simplest and yet most successful forms of machine
learning. A decision tree represents a function that takes as input a vector of attribute values
and returns a "decision" in form of a single output value. The input and output values can be
discrete or continuous. This is a tree-shaped structure that represents set of decisions. Nodes in
a decision tree involve testing a particular attribute. Usually, the test compares an attribute
value with a constant. Leaf nodes give a classification that applies to all instances that reach
the leaf, or a set of classifications, or a probability distribution over all possible
classifications.. Specific decision tree methods include Classification and Regression Trees
(CART). More recently used decision tree algorithms include: ID3, C4.5 and C5.0.
To classify an unknown instance, it is routed down the tree according to the values of the
attributes tested in successive nodes, and when a leaf is reached the instance is classified
according to the class assigned to the leaf. If the attribute is numeric, the test at a node usually
determines whether its value is greater or less than a predetermined constant, giving a two-way
split. Alternatively, a three-way split may be used, in which case there are several different
possibilities. Decision tree can be illustrated as follows:
Data Warehousing
Nowadays, the computing power of massively storing data has reached the point
where virtually every data item generated by an enterprise can be saved. Also,
enterprise databases have become extremely large and architecturally complex. An
appropriate environment for applying various tools and techniques that can achieve a
cleansed, stable, offline repository was needed and data warehouses were born.
As the data warehouses continue to grow, the need to create architecturally compatible
functional subsets, or data marts, has been recognized. The immediate future is
moving everything toward cloud computing. This will include the elimination of many
local storage disks as data is pushed to a vast array of external servers accessible over
the internet. Data mining in the cloud will continue to grow in importance as network
connectivity and data accessibility become virtually infinite.
Integrating data from different sources usually presents many challenges—not deep
issues of principle but nasty realities of practice. Different departments will use
Terminologies
In the context of data warehousing, there are several key terminologies that are
important for understanding how data is structured, processed, and analysed. These
terms help in organizing and managing large amounts of data that are used for
decision-making and business analysis. Here's an overview of the most important
terminologies:
.
1. ETL (Extract, Transform, Load):
Extract: The process of extracting data from different source systems.
Transform: The data is cleaned, validated, and transformed into a format
suitable for analysis.
Load: The transformed data is loaded into the data warehouse for further
processing and analysis.
2. Data Mart: A data mart is a subset of a data warehouse, often focused on a specific
department or business function (like marketing, sales, finance). It is a smaller, more
specialized version of a data warehouse.
4. Fact Table: A fact table is the central table in a data warehouse schema. It contains
quantitative data (measures) for analysis, such as sales figures, revenue, or quantities.
It has the characteristics of been usually large and contains keys to link to dimension
tables.
the facts, such as time, product, location, or customer. For instance, in a sales data
warehouse, dimensions could be Product, Time, and Store.
6. Star Schema: The star schema is a type of database schema used in data
warehouses. It has a central fact table connected to multiple dimension tables, which
resemble a star shape when visualized. Its benefits is the simple structure it produces
and fast query performance.
Data cube
A data cube provides a multidimensional view of data and allows further computations
and fast access of summarized data. The cube can display three dimensions, for
example the customers‘ address or location, the item type and time with quarter values
Q1, Q2, Q3, and Q4. By providing multidimensional data views and the pre-
computation of summarized data; data warehouse systems can provide inherent
support for Online Analytical Processing (OLAP). The operations of OLAP make use
of background knowledge regarding the domain of the data being studied to allow the
presentation of data at different levels of abstraction. Such operations accommodate
different user viewpoints. Examples of OLAP operations include drill-down and roll-
up, which allow the user to view the data at different degrees of summarization.
While data cubes can exist as a simple representation of data, without any extensive
capabilities to analyze large volumes, OLAP data cubes are particularly valuable for
complex data analysis, including business intelligence as they provide a
comprehensive view of information across different dimensions, such as time,
products, locations, or customer segments. For example, if you are looking at a
sales data cube, different dimensions can show you data by year, product category,
locations, customers, etc.
Data cubes support various operations that allow users to examine and analyze data
from different perspectives. Several tools can be used to create, manage, and visualize
data cubes. Some popular software platforms include Microsoft SQL Server Analysis
Services (SSAS), IBM Cognos Analytics, Oracle OLAP etc. All these tools can create
and analyse data cubes for large datasets.
iii. Slicing: When users want to focus on a specific set of facts from a particular
dimension, they can filter the data to focus on that subset. Slicing a sales data
cube to focus on ―Electronics‖ would restrict the data to sales of electronic
products only.
iv. Dicing: Breaking the data into multiple slices from a data cube can isolate a
particular combination of factors for analysis. By selecting a subset of values
from each dimension, the user can focus on the point where the two dimensions
intersect each other. For example, dicing the product dimension to
―Electronics‖ and the region dimension to ―Asia‖ would restrict the data to
sales of electronic products in the Asian region.
v. Pivoting: Pivoting means rotating the cube to view the data from a unique
perspective or reorienting analysis to focus on a different aspect. Pivoting the
sales data cube to swap the product and region dimensions would shift the
focus from sales by product to sales by region.
Banks collect and analyze data on customer interactions with their various products
and services. This data-driven approach allows banks to offer personalized services
and promotions, enhancing customer satisfaction and optimizing business
performance. Here is an example of how banks collect and organize data:
i. Fast: Data cubes are programmed before appending the semantic layer onto it,
which means most of the required calculations reside in the cache memory.
These calculations expedite query response times, which helps users retrieve
and analyse large datasets quickly.
iv. Convenient: By processing data in advance, these cubes ensure that operations
remain smooth irrespective of data volume. As the data grows, some level of
abstraction can creep in from the cracks, but the robust structure and pre-
calculated relationships can still conveniently handle user queries.
Several data warehousing frameworks are commonly used in the industry, each with
its own unique components, but they all focus on ensuring data integration,
consistency, and accessibility. Here's a summary of the main data warehousing
frameworks:
emphasizing flexibility, scalability, and easier integration with big data tools. This
framework is commonly used in cloud-native environments.
Key components:
i. Cloud Storage: AWS Redshift, Google BigQuery, Snowflake, Azure Synapse.
ii. Data Integration Tools: Fivetran, Stitch, Matillion, or cloud-native connectors.
iii. Data Transformation: ELT (Extract, Load, Transform) is more common than
ETL in cloud environments because of cloud computing's powerful processing
capabilities (e.g., SQL-based transformations).
This framework is best for organizations that want to leverage cloud scalability, quick
time-to-market, and lower operational costs.
iv. Historical analysis: Storing historical data allows businesses to track trends
over time and perform advanced analytics.
Web Mining
Introduction
The rapid growth of the Web in the past two decades has made it the largest publicly
accessible data source in the world. Web mining aims to discover useful information or
knowledge from Web hyperlinks, page contents, and usage logs. Based on the primary kinds
of data used in the mining process, Web mining tasks can be categorized into three main
types: Web structure mining, Web content mining and Web usage mining. Web structure
mining discovers knowledge from hyperlinks, which represent the structure of the Web. Web
content mining extracts useful information/knowledge from Web page contents. Web usage
mining mines user activity patterns from usage logs and other forms of logs of user
interactions with Web systems.
The World Wide Web (or the Web for short) has impacted almost every aspect of our lives. It
is the biggest and most widely known information source that is easily accessible and
searchable. It consists of billions of interconnected documents (called Web pages) which are
authored by millions of people. Since its inception, the Web has dramatically changed our
information seeking behaviour. Before the Web, finding information meant asking a friend or
an expert, or buying/borrowing a book to read. However, with the Web, everything is just a
few clicks away from the comfort of our homes or offices. We can not only find needed
information on the Web, but also easily share our information and knowledge with others.
The Web has also become an important channel for conducting businesses. We can buy
almost anything from online stores without needing to go to a physical shop. The Web also
provides a convenient means for us to communicate with each other, to express our views and
opinions, and to discuss with people from anywhere in the world. The Web is truly a virtual
society.
sites. Such information offers new types of data that enable many new mining tasks, e.g.,
opinion mining and social network analysis.
Link Analysis: Studying the hyperlinks between web pages to understand their
structure and influence. Common methods:
i. PageRank Algorithm: Used by Google to rank web pages based on their link
structure.
ii. HITS (Hyperlink-Induced Topic Search): Identifies hubs and authorities in a
web graph.
Graph Mining: Treating the web as a graph where pages are nodes and links are
edges. Techniques like graph clustering, centrality measures, and community
detection help uncover hidden patterns.
Social Network Analysis (SNA): Examining social media platforms or other online
networks to understand user relationships and interactions, often using graph theory
and network analysis algorithms.
3. Web Usage Mining
Web usage mining focuses on analyzing user behaviour on the web, often by
examining web logs to understand how users interact with websites. The techniques
used here include:
Log File Analysis: Analyzing server logs or user interaction logs to gather insights
about user behaviour, popular pages, and browsing patterns.
Clickstream Analysis: Studying the sequence of clicks made by users on a website to
understand their navigation patterns and identify popular areas of interest.
User Profiling: Building user profiles based on their browsing history to personalize
website content, recommend products, or optimize user experience.
Session Analysis: Grouping activities into sessions to better understand how users
interact with a site over time.
TEXT MINING
Text mining can be broadly defined as a knowledge-intensive process in which a user
interacts with a document collection over time by using a suite of analysis tools. In a manner
analogous to data mining, text mining seeks to extract useful information from data sources
through the identification and exploration of interesting patterns. In the case of text mining,
however, the data sources are document collections, and interesting patterns are found not
among formalized database records but in the unstructured textual data in the documents in
these collections.
Certainly, text mining derives much of its inspiration and direction from seminal research on
data mining. Therefore, it is not surprising to find that text mining and data mining systems
evince many high-level architectural similarities. For instance, both types of systems rely on
preprocessing routines, pattern-discovery algorithms, and presentation-layer elements such as
visualization tools to enhance the browsing of answer sets. Further, text mining adopts many
of the specific types of patterns in its core knowledge discovery operations that were first
introduced and vetted in data mining research.
Text mining involves extracting valuable information and patterns from textual data. To do
this, various techniques are used, depending on the goals and complexity of the task. Here are
some key techniques employed in text mining:
1. Text Preprocessing
Tokenization: Breaking text into smaller components like words, sentences, or phrases.
This is a fundamental step for analysis.
Stop-word Removal: Removing common words (e.g., "and", "the", "is") that do not carry
significant meaning in the context of analysis.
Stemming and Lemmatization: Reducing words to their root form. Stemming removes
prefixes or suffixes (e.g., "running" becomes "run"), while lemmatization converts a word to
its base form (e.g., "better" becomes "good").
Normalization: Standardizing text to a common format, like converting all letters to
lowercase or removing punctuation and special characters.
2. Text Representation Techniques
Bag-of-Words (BoW): A simple representation where text is treated as a collection of
words, ignoring grammar and word order, with each word assigned a frequency count.
TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure used
to evaluate the importance of a word in a document relative to a collection of
documents (corpus). It helps highlight unique words in a document while
downplaying frequent but less meaningful words.
Latent Semantic Analysis (LSA): A technique for extracting and representing the
contextual meaning of words by analyzing the relationships between a set of
documents and the terms they contain.
3. Classification
Supervised Learning: Using labelled data to train models to classify new, unseen text
data. Algorithms such as Naive Bayes, Support Vector Machines (SVM), and Random
Forests are commonly applied in text classification.
Text Categorization: Organizing text into predefined categories (e.g., spam vs. non-
spam, sentiment analysis).
4. Clustering
Unsupervised Learning: Grouping similar documents together without predefined
labels. Common algorithms include K-means clustering and hierarchical clustering.
5. Sentiment Analysis
Polarity Detection: Determining the sentiment of a piece of text (positive, negative, or
neutral). This involves analyzing the text to assess the emotional tone behind words or
phrases.
6. Association Rule Mining
Identifying relationships between different elements in large datasets. In text mining,
this could be identifying which terms or phrases tend to appear together across
different documents.
7. Text Summarization
Extractive Summarization: Identifying the most relevant sentences or phrases in a text
and combining them to form a summary.
Abstractive Summarization: Generating new sentences that convey the core ideas of
the text in a condensed form, often using techniques from deep learning.
information in text is unstructured. There is no structured query language like SQL for text
retrieval.
It is safe to say that Web search is the single most important application of IR. To a great
extent, Web search also helped IR. Indeed, the tremendous success of search engines has
pushed IR to the center stage. Search is, however, not simply a straightforward application of
traditional IR models. It uses some IR results, but it also has its unique techniques and
presents many new problems for IR research.
First of all, efficiency is a paramount issue for Web search, but is only secondary in
traditional IR systems mainly due to the fact that document collections in most IR systems
are not very large. However, the number of pages on the Web is huge. For example, at the
moment, Google has indexed more than 8 billion pages. Web users also demand very fast
responses. No matter how effective an algorithm is, if the retrieval cannot be done efficiently,
few people will use it.
Web pages are also quite different from conventional text documents used in traditional
IR systems. First, Web pages have hyperlinks and anchor texts, which do not exist in
traditional documents (except citations in research publications). Hyperlinks are extremely
important for search and play a central role in search ranking algorithms as we will see in the
next chapter. Anchor texts associated with hyperlinks too are crucial because a piece of
anchor text is often a more accurate description of the page that its hyperlink points to.
Second, Web pages are semi-structured.
Finally, spamming is a major issue on the Web, but not a concern for traditional IR. This
is so because the rank position of a page returned by a search engine is extremely important.
If a page is relevant to a query but is ranked very low (e.g., below top 30), then the user is
unlikely to look at the page. If the page sells a product, then this is bad for the business. In
order to improve the ranking of some target pages, ―illegitimate‖ means, called spamming,
are often used to boost their rank positions. Detecting and fighting Web spam is a critical
issue as it can push low quality (even irrelevant) pages to the top of the search rank, which
harms the quality of the search results and the user‘s search experience.
maximum allowed distance between the query terms. Most search engines consider both term
proximity and term ordering in retrieval.
5. Full document queries
When the query is a full document, the user wants to find other documents that are similar to
the query document. Some search engines (e.g., Google) allow the user to issue such a query
by providing the URL of a query page. Additionally, in the returned results of a search
engine, each snippet may have a link called ―more like this‖ or ―similar pages.‖ When the
user clicks on the link, a set of pages similar to the page in the snippet is returned.
6. Natural language questions
This is the most complex case, and also the ideal case. The user expresses his/her information
need as a natural language question. The system then finds the answer. However, such queries
are still hard to handle due to the difficulty of natural language understanding. Nevertheless,
this is an active research area, called question answering. Some search systems are starting to
provide question answering services on some specific types of questions, e.g., definition
questions, which ask for definitions of technical terms. Definition questions are usually easier
to answer because there are strong linguistic patterns indicating definition sentences, e.g.,
―defined as‖, ―refers to‖, etc.
The indexer is the module that indexes the original raw documents in some data structures to
enable efficient retrieval. The retrieval system computes a relevance score for each indexed
document to the query. According to their relevance scores, the documents are ranked and
presented to the user. Note that it usually does not compare the user query with every
document in the collection, which is too inefficient. Instead, only a small subset of the
documents that contains at least one query term is first found from the index and relevance
scores with the user query are then computed only for this subset of documents.
This model is perhaps the best known and most widely used IR model. A document in the
vector space model is represented as a weight vector, in which each component weight is
computed based on some variation of Term Frequency scheme. The weight wij of term ti in
document dj is no longer in {0, 1} as in the Boolean model, but can be any number.
iii. Statistical Language Model
Statistical language models (or simply language models) are based on probability and have
foundations in statistical theory. The basic idea of this approach to retrieval is simple. It first
estimates a language model for each document, and then ranks documents by the likelihood
of the query given the language model.
Evaluation Measures
Evaluation is the key to making real progress in data mining as there is always
the need to examine the resulting outputs of a model for correctness. A number of evaluation
measures are available to determine the present of error in a model, this depends on the type
of model i.e what the model does. In order to evaluate a regression model, i.e a model whose
target is a continuous value, several alternative measures, presented in Table 5.1 can be used
to evaluate the success of the numeric predictions. Four of the evaluation measures that can
be used to compute the accuracy of numeric prediction as enumerated are as follows:
1. Mean Squared Error (MSE): This is the principal and most commonly used
measurement; it is sometimes referred to as objective function. The square root is taken to
give it the same dimensions as the predicted value itself. Many mathematical techniques
such as linear regression use the mean-squared error due to the fact that, it tends to be the
easiest measure to manipulate, the mathematicians usually say, ―well behaved.‖ The MSE
can be used in several instances, but here, it is being used as a performance measure.
Generally, most of the performances are easy to calculate, so mean-squared error has no
exceptional advantage for this purpose.
2. Mean Absolute Error (MAE): This is the average of the magnitude of the individual
errors regardless of their sign. Mean-Squared Error (MSE) tends to exaggerate the effect
of outliers in dataset when the prediction error is larger than the others, but the MAE does
not have this effect. All sizes of error are treated evenly according to their magnitude. In
terms of importance, sometimes it is the relative rather than absolute error values that
may be seen as vital. For example, if a 10% error is equally important whether it is an
error of 50 in a prediction of 500 cases or an error of 0.2 in a prediction of 2 cases, then
averages of absolute error will be meaningless, the relative error appears to be appropriate
in an instance like this.
3. Relative Squared Error (RSE): This differs a bit from the previous error measurements.
Here, the error is made relative to what it would have been if a simple predictor had been
used. The simple predictor in question is just the average of the actual values from the
training data, which is denoted by ‗a’ in Table 5.1. Thus, relative squared error takes the
total squared error and normalizes it by dividing by the total squared error of the default
predictor. The root relative squared error is obtained in the obvious way.
4. Relative Absolute Error (RAE): This is simply the total absolute error, with the same kind
of normalization. In the relative error measures, the errors are normalized by the error of
the simple predictor that predicts average values. These measurements of numeric
predictions are further summarized in Table 5.1
For instance, 99% of the cases are normal in an intrusion detection data set. Then a classifier
can achieve 99% accuracy (without doing anything) by simply classifying every test case as
―not intrusion‖. This is, however, useless.
Precision and recall are more suitable in such applications because they measure how
precise and how complete the classification is on the positive class. It is convenient to
introduce these measures using a confusion matrix. A confusion matrix contains information
about actual and predicted results given by a classifier.
Based on the confusion matrix, the precision (p) and recall (r) of the positive class are defined
as follows:
In words, precision p is the number of correctly classified positive examples divided by the
total number of examples that are classified as positive. Recall r is the number of correctly
classified positive examples divided by the total number of actual positive examples in the
test set. The intuitive meanings of these two measures are quite obvious. However, it is hard
to compare classifiers based on two measures, which are not functionally related. For a test
set, the precision may be very high but the recall can be very low, and vice versa.
Example 1: A test data set has 100 positive examples and 1000 negative examples. After
classification using a classifier, we have the following confusion matrix (Table 5.2),
Table 5.2 Confusion matrix of a classifier.
This confusion matrix gives the precision p = 100% and the recall r = 1% because we only
classified one positive example correctly and classified no negative examples wrongly.
Although in theory precision and recall are not related, in practice high precision is achieved
almost always at the expense of recall and high recall is achieved at the expense of precision.
In an application, which measure is more important depends on the nature of the application.
If we need a single measure to compare different classifiers, the F-score is often used.
The F-score (also called the F1-score) is the harmonic mean of precision and recall.
The harmonic mean of two numbers tends to be closer to the smaller of the two. Thus, for the
F-score to be high, both p and r must be high. There is also another measure, called precision
and recall breakeven point, which is used in the information retrieval community.
The break- even point is when the precision and the recall are equal. This measure assumes
that the test cases can be ranked by the classifier based on their likelihoods of being positive.
For instance, in decision tree classification, we can use the confidence of each leaf node as
the value to rank test cases.
Example 2: We have the following ranking of 20 text documents. 1 represents the highest
rank and 20 represents the lowest rank. ―+‖ (―‖) represents an actual positive (negative)
document.
Data mining can inadvertently expose personal or confidential information, leading to privacy
violations. This is particularly concerning when working with personal, financial, or medical
data. Also, if security measures are not sufficiently robust, unauthorized individuals could
access sensitive datasets, leading to potential data breaches.
Data mining models are only as good as the data they are based on. Incomplete, incorrect, or
inconsistent data can lead to flawed results. Irrelevant or noisy data is included in the
analysis, it can obscure meaningful patterns and reduce the accuracy of predictions or
classifications.
3. Overfitting
Black Box Models: Some advanced algorithms (e.g., deep learning) are hard to
interpret, making it difficult for users to understand how decisions are made. This lack
of transparency can undermine trust in the model‘s outcomes.
5. Misleading Patterns
Correlation vs Causation: Data mining often uncovers correlations, but these may
not imply causality. Relying solely on correlations can lead to erroneous or
misleading conclusions.
Inability to Validate Models: Domain knowledge is also critical for validating and
assessing the effectiveness of data mining models in real-world situations.
THE END