Data Mining Notes
Data Mining Notes
Data mining is a process used by companies to turn raw data into useful information. By using
software to look for patterns in large batches of data, businesses can learn more about their
customers to develop more effective marketing strategies, increase sales and decrease costs. Data
mining depends on effective data collection, warehousing, and computer processing.
Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves several
components, and these components constitute a data mining system architecture.
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files,
and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data warehouses
may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary
source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats,
it can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will
be collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to
focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated with
the mining module, depending on the implementation of the data mining techniques used. For
efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much
as possible into the mining procedure to confine the search to only fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process. The data
mining engine may receive inputs from the knowledge base to make the result more accurate and
reliable. The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
Effective data mining aids in various aspects of planning business strategies and managing
operations. That includes customer-facing functions such as marketing, advertising, sales and
customer support, plus manufacturing, supply chain management, finance and HR. Data mining
supports fraud detection, risk management, cybersecurity planning and many other critical
business use cases. It also plays an important role in healthcare, government, scientific research,
mathematics, sports and more.
There is a number of, different data repositories on .which mining can be performed. In
principle, data mining should be applicable to any kind of data repository, as well as to transient
data, such as data stream. Data repositories will include relational databases; data: warehouses,
transactional databases; advanced databases systems, data streams, and the world wide web.
These data repositories are called data types for data milling. Various-data types for data mining
are as follows-
Relational Database —
A relational database. is a collection of tables, each of which is assigned a unique martial. torch
table consists. of a set of attributes (column or fields), And usually stores a large set of the tuple
(record or row). Each tuple ire a relational, table represents an object identified by a unique key
and described by a set of attribute values. A semantic data model; such as an entity-relationship
(ER.) data model is often constructed for relational databases.
The All Electronics company is described by the following table customer; item, employee. and
branch. Each table. have its own attributes describing its properties When data mining is applied
to relational databases, we can go further, by searching for trends or data patterns. For example
data mining. the system can analyze customer data to predict the credit risk Al new customers
based on theirs. income, age, and previous credit.
information. Data mining system may also detect, deviations, such as items whose dates are far
from those expected in comparison with the previous year such deviations can then be further
investigated.
Data Warehouse —
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.’, Data warehouse use is constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. in below Figure shows the typical framework for construction and use of data
warehouse for All Electronics company described above.
To facilitate decision making, the data in a data warehouse are organized around the major
subjects, such as customer, item, supplier, and activity. The data are stored to provide
information from a historical perspective and are typically summarized. For example, rather than
storing the details of each sale transaction, the data warehouse may store a summary of the
transaction to a higher level, for each sales region.
Transactional Database —
In general transactional database consist of a file where each record represents.a transaction. A
transaction typically. include a unique transaction identity number (trans_ID) and a list of the
items. making up the transaction (such as item. purchase sed in a store)
The transactional database may additional table associated with it which contain other
information regarding, the sale, such as the date of the
‘transaction, the customer’s ID number, the ID number of the salesperson and of the branch at
which. the sale occurred and so on. In the tractional database for all electronics, transactions can
be stored in a table, with one record per transaction. A. fragment of a transactional database for
AllElectronics From a relational database point of view, the sales table is a nested relation
because of the attribute list of items.
1. A set of variables that describes the object. this corresponds to attributes in Entity relationship
models.
2. A set of the message that object can use to communicate with their object or with the rest of the
Database system
3. A set of the method, each method holds the code and implementation of a message.
The object that shares a common set of properties can be grouped into an object class. Each
object is an instance of its class. Object classes can be organized into class/subclass Hierarchies
so that each class represents properties that are common to object in the class, for instance, and
employee class can contain variable like name, address, and birth data suppose that the class
sales_persons is a subclass of a class, employee, a sales_person object would inherit all of the
variable pertaining to the superclass of employees.
For data mining in object-relational’ systems, techniques, need to be developed for handling
complex object structures, complex data types class and subclass hierarchies, property
inheritance, and methods procedures.
2. Temporal Database –
A temporal. the database typically stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different semantics.
3. Sequence Database –
A sequence database stores sequence of ordered events, with or without a concrete notion of
time. Examples include customer shopping sequences, web clickstreams, and biological
sequences.
4. Time-series Database –
A time-series database stores sequences database of values or events obtained over the repeated
measurement of time. Example Include data collected from the stock exchange, inventory
control, and the observation of natural phenomena.
5. Spatial Database –
For example, a 2-D satellite image may be represented as raster data where each pixel registers
the rainfall in a given area. Maps can be represented in vector format, where roads bridges
buildings and lakes are represented as unions or overlays of basics geometric constructs, such as
points, lines, polygons. and the partitions and networks formed by these components.
6. Spatiotemporal Database—
A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which, interesting information can be mind. For example, we may be able to
group the trends of moving objects and. identify some strangely moving vehicles, or .destinuish a
bioterrorist attack from a normal outbreak of the flu based on the geographic spread of a disease
with time.
7. Text Database —
Text databases are databases that contain word descriptions for objects. These word descriptions
are usually not simple keywords rather long sentences or paragraphs such as product
specifications, error or bug reports, warning messages, summary reports, notes’ or other
documents. Text databases may be highly unstructured. Such as some Web pages on the WWW.
Some text databases may be somewhat structured, that is, semistructured whereat, others are
relatively well structured. Text databases with highly regular structures typically can be
implemented using relational database systems.
8. Multimedia Database —
Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice.-email systems, video-on-demand systems, and speech-
based, user interfaces that recognize spoken commands. Multimedia. databases must support
large objects because data objects such as video can require gigabytes of storage. Specialized and
search techniques are also required. Because video and audio data require real-time retrieval at a
steady and predetermined rate in order to avoid picture or sound gaps and system buffer
overflows, such data’ are referred to as continuous-media data.
9. Heterogeneous Database —
Many enterprises acquire legacy databases as a result of the long history of information
technology development (including the application of different hardware and operating systems).
A legacy database a group of heterogeneous, databases that combine different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network
databases, spreadsheet, multimedia database, or file systems. The heterogeneous databases in a
legacy database may be connected by or inter-computer networks.
There is a lot of confusion between data mining and data analysis. Data mining functions are
used to define the trends or correlations contained in data mining activities. While data analysis
is used to test statistical models that fit the dataset, for example, analysis of a marketing
campaign, data mining uses Machine Learning and mathematical and statistical models to
discover patterns hidden in the data. In comparison, data mining activities can be divided into
two categories:
Descriptive Data Mining: It includes certain knowledge to understand what is happening within
the data without a previous idea. The common data features are highlighted in the data set. For
example, count, average etc.
Predictive Data Mining: It helps developers to provide unlabeled definitions of attributes. With
previously available or historical data, data mining can be used to make predictions about critical
business metrics based on data's linearity. For example, predicting the volume of business next
quarter based on performance in the previous quarters over several years or judging from the
findings of a patient's medical examinations that is he suffering from any particular disease.
Data mining functionalities are used to represent the type of patterns that have to be discovered
in data mining tasks. Data mining tasks can be classified into two types: descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database, and
the predictive mining tasks act in inference on the current information to develop predictions.
Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various trends
in data mining. There are several data mining functionalities that the organized and scientific
methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a concept.
A class can be a category of items on a shop floor, and a concept could be the abstract idea on
which data may be categorized like products to be put on clearance sale and non-sale products.
There are two concepts here, one that helps with grouping and the other that helps in
differentiating.
Data Characterization: This refers to the summary of general characteristics or features of the
class, resulting in specific rules that define a target class. A data analysis technique called
Attribute-oriented Induction is employed on the data set for achieving characterization.
Data Discrimination: Discrimination is used to separate distinct data sets based on the disparity
in attribute values. It compares features of a class with features of one or more contrasting
classes.g., bar charts, curves and pie charts.
One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.
Frequent item set:This term refers to a group of items that are commonly found together, such
as milk and sugar.
Frequent substructure: It refers to the various types of data structures that can be combined
with an item set or subsequences, such as trees and graphs.
Frequent Subsequence: A regular pattern series, such as buying a phone followed by a cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:
4. Classification
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a
prediction of missing numerical values or increase or decrease trends in time-related information.
There are primarily two types of predictions in data mining: numeric and class predictions.
Numeric predictions are made by creating a linear regression model that is based on historical
data. Prediction of numeric values helps businesses ramp up for a future event that might impact
the business positively or negatively.
Class predictions are used to fill in missing class information for products using a training data
set where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class
label is not known. Clustering algorithms group data based on similar features and
dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by
the algorithms is pulled up.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes
is related to one another. It refers to the various types of data structures, such as trees and graphs,
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.
(OPTIONAL)
Data mining functionalities are used to represent the type of patterns that have to be discovered
in data mining tasks. In general, data mining tasks can be classified into two types including
descriptive and predictive. Descriptive mining tasks define the common features of the data in
the database and the predictive mining tasks act inference on the current information to develop
predictions.
With Data mining, businesses are found to gain more profit. It has not only helped in
understanding customer demand but also in developing effective strategies to enforce overall
business turnover. It has helped in determining business objectives for making clear decisions.
Data collection and data warehousing, and computer processing are some of the strongest pillars
of data mining. Data mining utilizes the concept of mathematical algorithms to segment the data
and assess the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data mining can be classified into
the following systems:
Classification based on the mined Databases
Classification based on the type of mined knowledge
Classification based on statistics
Classification based on Machine Learning
Classification based on visualization
Classification based on Information Science
Classification based on utilized techniques
Classification based on adapted applications
A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models,
types of data, etc., which further assist in classifying a data mining system.
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
No Coupling
In no coupling schema, the data mining system does not use any database or data warehouse
system functions.
Loose Coupling
In loose coupling, data mining utilizes some of the database or data warehouse system
functionalities. It mainly fetches the data from the data repository managed by these systems and
then performs data mining. The results are kept either in the file or any designated place in the
database or data warehouse.
Semi-Tight Coupling
In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.
Tight Coupling
A data mining system can be effortlessly combined with a database or data warehouse system in
tight coupling.
Data Mining System Classification
A data mining system can be classified according to the following criteria −
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.
We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.
We can classify a data mining system according to the kind of knowledge mined. It means the
data mining system is classified on the basis of functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
We can classify a data mining system according to the kind of techniques used. We can describe
these techniques according to the degree of user interaction involved or the methods of analysis
employed.
We can classify a data mining system according to the applications adapted. These applications
are as follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail
If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In
this scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is stored in
another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions
of database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database
or a data warehouse system and in addition to that, efficient implementations of a few
data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
Performance Issues
Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems
− The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Challenges of Implementation in Data mining
Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more
than $ 500, and the accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in incorrect data. Even
some customers may not be willing to disclose their phone numbers, which results in incomplete
data. The data could get changed due to human or system error. All these consequences (noisy
and incomplete data)makes data mining challenging.
Data Distribution:
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method
that shows the output to the user in a presentable way. The extracted data should convey the
exact meaning of what it intends to express. But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented
to make it successful.