Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
278 views

Data Mining Notes

Data mining is the process of analyzing large datasets to identify patterns and relationships that can help solve business problems. It allows companies to predict trends, make more informed decisions, develop effective marketing strategies, increase sales, and decrease costs. The key components of a data mining system include a data source, data mining engine, data warehouse server, pattern evaluation module, graphical user interface, and knowledge base. Data mining is important because it provides organizations with knowledge-based data and insights to improve operations, decision-making, and strategy.

Uploaded by

aryan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
278 views

Data Mining Notes

Data mining is the process of analyzing large datasets to identify patterns and relationships that can help solve business problems. It allows companies to predict trends, make more informed decisions, develop effective marketing strategies, increase sales, and decrease costs. The key components of a data mining system include a data source, data mining engine, data warehouse server, pattern evaluation module, graphical user interface, and knowledge base. Data mining is important because it provides organizations with knowledge-based data and insights to improve operations, decision-making, and strategy.

Uploaded by

aryan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT 1

What is data mining?


Data mining is the process of sorting through large data sets to identify patterns and relationships
that can help solve business problems through data analysis. Data mining techniques and tools
enable enterprises to predict future trends and make more-informed business decisions.

Data mining is a process used by companies to turn raw data into useful information. By using
software to look for patterns in large batches of data, businesses can learn more about their
customers to develop more effective marketing strategies, increase sales and decrease costs. Data
mining depends on effective data collection, warehousing, and computer processing.

Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves several
components, and these components constitute a data mining system architecture.

Data Mining Architecture

The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files,
and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data warehouses
may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary
source of data is the World Wide Web or the internet.

Different processes:

Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats,
it can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will
be collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.

This segment commonly employs stake measures that cooperate with the data mining modules to
focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated with
the mining module, depending on the implementation of the data mining techniques used. For
efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much
as possible into the mining procedure to confine the search to only fascinating patterns.

Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process. The data
mining engine may receive inputs from the knowledge base to make the result more accurate and
reliable. The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.

Why is data mining important?


Data mining is a crucial component of successful analytics initiatives in organizations. The
information it generates can be used in business intelligence (BI) and advanced analytics
applications that involve analysis of historical data, as well as real-time analytics applications
that examine streaming data as it's created or collected.

Effective data mining aids in various aspects of planning business strategies and managing
operations. That includes customer-facing functions such as marketing, advertising, sales and
customer support, plus manufacturing, supply chain management, finance and HR. Data mining
supports fraud detection, risk management, cybersecurity planning and many other critical
business use cases. It also plays an important role in healthcare, government, scientific research,
mathematics, sports and more.

Advantages of Data Mining


 The Data Mining technique enables organizations to obtain knowledge-based data.
 Data mining enables organizations to make lucrative modifications in operation and
production.
 Compared with other statistical data applications, data mining is a cost-efficient.s
 Data Mining helps the decision-making process of an organization.
 It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
 It can be induced in the new system as well as the existing platforms.
 It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

Various data types for data mining, with their


application

There is a number of, different data repositories on .which mining can be performed. In
principle, data mining should be applicable to any kind of data repository, as well as to transient
data, such as data stream. Data repositories will include relational databases; data: warehouses,
transactional databases; advanced databases systems, data streams, and the world wide web.
These data repositories are called data types for data milling. Various-data types for data mining
are as follows-

Relational Database —

A relational database. is a collection of tables, each of which is assigned a unique martial. torch
table consists. of a set of attributes (column or fields), And usually stores a large set of the tuple
(record or row). Each tuple ire a relational, table represents an object identified by a unique key
and described by a set of attribute values. A semantic data model; such as an entity-relationship
(ER.) data model is often constructed for relational databases.

The All Electronics company is described by the following table customer; item, employee. and
branch. Each table. have its own attributes describing its properties When data mining is applied
to relational databases, we can go further, by searching for trends or data patterns. For example
data mining. the system can analyze customer data to predict the credit risk Al new customers
based on theirs. income, age, and previous credit.

information. Data mining system may also detect, deviations, such as items whose dates are far
from those expected in comparison with the previous year such deviations can then be further
investigated.

Data Warehouse —

A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.’, Data warehouse use is constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. in below Figure shows the typical framework for construction and use of data
warehouse for All Electronics company described above.
To facilitate decision making, the data in a data warehouse are organized around the major
subjects, such as customer, item, supplier, and activity. The data are stored to provide
information from a historical perspective and are typically summarized. For example, rather than
storing the details of each sale transaction, the data warehouse may store a summary of the
transaction to a higher level, for each sales region.

A data warehouse is usually modeled by a multidimensional database structure, where each


dimension corresponds to an attribute or a set of attributes in the schema, and each cell. stores
the value of some aggregate measure, such as count or sales_amount. The actual physical
structure of a data warehouse. maybe relational data store or a multidimensional data cube.

Transactional Database —

In general transactional database consist of a file where each record represents.a transaction. A
transaction typically. include a unique transaction identity number (trans_ID) and a list of the
items. making up the transaction (such as item. purchase sed in a store)

The transactional database may additional table associated with it which contain other
information regarding, the sale, such as the date of the

‘transaction, the customer’s ID number, the ID number of the salesperson and of the branch at
which. the sale occurred and so on. In the tractional database for all electronics, transactions can
be stored in a table, with one record per transaction. A. fragment of a transactional database for
AllElectronics From a relational database point of view, the sales table is a nested relation
because of the attribute list of items.

Advanced database system and its applications


1. Object-relational database –

object-relational databases are constructed based on an object-relational model. this model


extends the relational model by providing a rich data type for handling Complex object and
object orientation. Because most sophisticated database applications need to handle Complex
objects and structure object-relational database are becoming increasingly popular in industry
and applications. Conceptually, the object-relational data model inherits the essential concept of
an object-oriented database where in general terms each entity is considered as an object. Data
and code relating to an object are encapsulated in a single unit. Each object has associated with it
has followed-

1. A set of variables that describes the object. this corresponds to attributes in Entity relationship
models.
2. A set of the message that object can use to communicate with their object or with the rest of the
Database system
3. A set of the method, each method holds the code and implementation of a message.

The object that shares a common set of properties can be grouped into an object class. Each
object is an instance of its class. Object classes can be organized into class/subclass Hierarchies
so that each class represents properties that are common to object in the class, for instance, and
employee class can contain variable like name, address, and birth data suppose that the class
sales_persons is a subclass of a class, employee, a sales_person object would inherit all of the
variable pertaining to the superclass of employees.

For data mining in object-relational’ systems, techniques, need to be developed for handling
complex object structures, complex data types class and subclass hierarchies, property
inheritance, and methods procedures.

2. Temporal Database –

A temporal. the database typically stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different semantics.

3. Sequence Database –

A sequence database stores sequence of ordered events, with or without a concrete notion of
time. Examples include customer shopping sequences, web clickstreams, and biological
sequences.

4. Time-series Database –

A time-series database stores sequences database of values or events obtained over the repeated
measurement of time. Example Include data collected from the stock exchange, inventory
control, and the observation of natural phenomena.

5. Spatial Database –

Spatial databases contain spatial-related information. Examples include geographic (map)


databases, very-large-scale integration (VLSI), or computed aided design database and medical
and satellite image databases. Spatial data may be represented raster formate, consisting of the n-
dimensional bit map or pixel map.

For example, a 2-D satellite image may be represented as raster data where each pixel registers
the rainfall in a given area. Maps can be represented in vector format, where roads bridges
buildings and lakes are represented as unions or overlays of basics geometric constructs, such as
points, lines, polygons. and the partitions and networks formed by these components.

6. Spatiotemporal Database—

A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which, interesting information can be mind. For example, we may be able to
group the trends of moving objects and. identify some strangely moving vehicles, or .destinuish a
bioterrorist attack from a normal outbreak of the flu based on the geographic spread of a disease
with time.

7. Text Database —

Text databases are databases that contain word descriptions for objects. These word descriptions
are usually not simple keywords rather long sentences or paragraphs such as product
specifications, error or bug reports, warning messages, summary reports, notes’ or other
documents. Text databases may be highly unstructured. Such as some Web pages on the WWW.
Some text databases may be somewhat structured, that is, semistructured whereat, others are
relatively well structured. Text databases with highly regular structures typically can be
implemented using relational database systems.

8. Multimedia Database —

Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice.-email systems, video-on-demand systems, and speech-
based, user interfaces that recognize spoken commands. Multimedia. databases must support
large objects because data objects such as video can require gigabytes of storage. Specialized and
search techniques are also required. Because video and audio data require real-time retrieval at a
steady and predetermined rate in order to avoid picture or sound gaps and system buffer
overflows, such data’ are referred to as continuous-media data.

9. Heterogeneous Database —

A heterogeneous database consists of a set of interconnected, autonomous component databases.


The components Communicate, in order to exchange information and answer queries. Objects in
one component database may differ greatly from objects in another component database, making
it difficult to assimilate their semantics into the overall heterogeneous database,
10. Legacy Database-

Many enterprises acquire legacy databases as a result of the long history of information
technology development (including the application of different hardware and operating systems).
A legacy database a group of heterogeneous, databases that combine different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network
databases, spreadsheet, multimedia database, or file systems. The heterogeneous databases in a
legacy database may be connected by or inter-computer networks.

Tasks and Functionalities of Data Mining


Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly
detection and dependencies such as association and sequential pattern. Once patterns are
uncovered, they can be thought of as a summary of the input data, and further analysis may be
carried out using Machine Learning and Predictive analytics. For example, the data mining step
might help identify multiple groups in the data that a decision support system can use. Note that
data collection, preparation, reporting are not part of data mining.

There is a lot of confusion between data mining and data analysis. Data mining functions are
used to define the trends or correlations contained in data mining activities. While data analysis
is used to test statistical models that fit the dataset, for example, analysis of a marketing
campaign, data mining uses Machine Learning and mathematical and statistical models to
discover patterns hidden in the data. In comparison, data mining activities can be divided into
two categories:

 Descriptive Data Mining: It includes certain knowledge to understand what is happening within
the data without a previous idea. The common data features are highlighted in the data set. For
example, count, average etc.
 Predictive Data Mining: It helps developers to provide unlabeled definitions of attributes. With
previously available or historical data, data mining can be used to make predictions about critical
business metrics based on data's linearity. For example, predicting the volume of business next
quarter based on performance in the previous quarters over several years or judging from the
findings of a patient's medical examinations that is he suffering from any particular disease.

Functionalities of Data Mining

Data mining functionalities are used to represent the type of patterns that have to be discovered
in data mining tasks. Data mining tasks can be classified into two types: descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database, and
the predictive mining tasks act in inference on the current information to develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various trends
in data mining. There are several data mining functionalities that the organized and scientific
methods offer, such as:

1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a concept.
A class can be a category of items on a shop floor, and a concept could be the abstract idea on
which data may be categorized like products to be put on clearance sale and non-sale products.
There are two concepts here, one that helps with grouping and the other that helps in
differentiating.

 Data Characterization: This refers to the summary of general characteristics or features of the
class, resulting in specific rules that define a target class. A data analysis technique called
Attribute-oriented Induction is employed on the data set for achieving characterization.
 Data Discrimination: Discrimination is used to separate distinct data sets based on the disparity
in attribute values. It compares features of a class with features of one or more contrasting
classes.g., bar charts, curves and pie charts.

2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.

 Frequent item set:This term refers to a group of items that are commonly found together, such
as milk and sugar.
 Frequent substructure: It refers to the various types of data structures that can be combined
with an item set or subsequences, such as trees and graphs.
 Frequent Subsequence: A regular pattern series, such as buying a phone followed by a cover.

3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:

 It provides which identifies the common item set in the database.


 Confidence is the conditional probability that an item occurs when another item occurs in a
transaction.

4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a
prediction of missing numerical values or increase or decrease trends in time-related information.
There are primarily two types of predictions in data mining: numeric and class predictions.

 Numeric predictions are made by creating a linear regression model that is based on historical
data. Prediction of numeric values helps businesses ramp up for a future event that might impact
the business positively or negatively.
 Class predictions are used to fill in missing class information for products using a training data
set where the class for products is known.

6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class
label is not known. Clustering algorithms group data based on similar features and
dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by
the algorithms is pulled up.

8. Evolution and Deviation Analysis


Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

9. Correlation Analysis

Correlation is a mathematical technique for determining whether and how strongly two attributes
is related to one another. It refers to the various types of data structures, such as trees and graphs,
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.

(OPTIONAL)
Data mining functionalities are used to represent the type of patterns that have to be discovered
in data mining tasks. In general, data mining tasks can be classified into two types including
descriptive and predictive. Descriptive mining tasks define the common features of the data in
the database and the predictive mining tasks act inference on the current information to develop
predictions.

There are various data mining functionalities which are as follows −

 Data characterization − It is a summarization of the general characteristics of an object


class of data. The data corresponding to the user-specified class is generally collected by
a database query. The output of data characterization can be presented in multiple forms.
 Data discrimination − It is a comparison of the general characteristics of target class
data objects with the general characteristics of objects from one or a set of contrasting
classes. The target and contrasting classes can be represented by the user, and the
equivalent data objects fetched through database queries.
 Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the
association rules −
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs in a transaction when
another item occurs.
 Classification − Classification is the procedure of discovering a model that represents
and distinguishes data classes or concepts, for the objective of being able to use the
model to predict the class of objects whose class label is anonymous. The derived model
is established on the analysis of a set of training data (i.e., data objects whose class label
is common).
 Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or increase/decrease trends in
time-related information.
 Clustering − It is similar to classification but the classes are not predefined. The classes
are represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing
the intraclass similarity.
 Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general
behaviour of other data objects. The analysis of this type of data can be essential to mine
the knowledge.
 Evolution analysis − It defines the trends for objects whose behaviour changes over
some time.

Classification of Data Mining Systems


Data mining refers to the process of extracting important data from raw data. It analyses the data
patterns in huge sets of data with the help of several software. Ever since the development of
data mining, it is being incorporated by researchers in the research and development field.

With Data mining, businesses are found to gain more profit. It has not only helped in
understanding customer demand but also in developing effective strategies to enforce overall
business turnover. It has helped in determining business objectives for making clear decisions.

Data collection and data warehousing, and computer processing are some of the strongest pillars
of data mining. Data mining utilizes the concept of mathematical algorithms to segment the data
and assess the possibility of occurrence of future events.

To understand the system and meet the desired requirements, data mining can be classified into
the following systems:
 Classification based on the mined Databases
 Classification based on the type of mined knowledge
 Classification based on statistics
 Classification based on Machine Learning
 Classification based on visualization
 Classification based on Information Science
 Classification based on utilized techniques
 Classification based on adapted applications

Classification Based on the mined Databases

A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models,
types of data, etc., which further assist in classifying a data mining system.

For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.

Classification Based on the type of Knowledge Mined

A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis

Classification Based on the Techniques Utilized

A data mining system can also be classified based on the type of techniques that are being
incorporated. These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.

Classification Based on the Applications Adapted

Data mining systems classified based on adapted applications adapted are as follows:

1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail

Examples of Classification Task

Following is some of the main examples of classification tasks:

 Classification helps in determining tumor cells as benign or malignant.


 Classification of credit card transactions as fraudulent or legitimate.
 Classification of secondary structures of protein as alpha-helix, beta-sheet, or random coil.
 Classification of news stories into distinct categories such as finance, weather, entertainment,
sports, etc.
Integration schemes of Database and Data warehouse systems

No Coupling

In no coupling schema, the data mining system does not use any database or data warehouse
system functions.

Loose Coupling

In loose coupling, data mining utilizes some of the database or data warehouse system
functionalities. It mainly fetches the data from the data repository managed by these systems and
then performs data mining. The results are kept either in the file or any designated place in the
database or data warehouse.

Semi-Tight Coupling

In semi-tight coupling, data mining is linked to either the DB or DW system and provides an
efficient implementation of data mining primitives within the database.

Tight Coupling

A data mining system can be effortlessly combined with a database or data warehouse system in
tight coupling.
Data Mining System Classification
A data mining system can be classified according to the following criteria −

 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines

Apart from these, a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.

Classification Based on the Databases Mined

We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.

For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.

Classification Based on the kind of Knowledge Mined

We can classify a data mining system according to the kind of knowledge mined. It means the
data mining system is classified on the basis of functionalities such as −

 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis

Classification Based on the Techniques Utilized

We can classify a data mining system according to the kind of techniques used. We can describe
these techniques according to the degree of user interaction involved or the methods of analysis
employed.

Classification Based on the Applications Adapted

We can classify a data mining system according to the applications adapted. These applications
are as follows −

 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail

Integrating a Data Mining System with a DB/DW System

If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In
this scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.

The list of Integration Schemes is as follows −

 No Coupling − In this scheme, the data mining system does not utilize any of the
database or data warehouse functions. It fetches the data from a particular source and
processes that data using some data mining algorithms. The data mining result is stored in
another file.
 Loose Coupling − In this scheme, the data mining system may use some of the functions
of database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the
mining result either in a file or in a designated place in a database or in a data warehouse.
 Semi−tight Coupling − In this scheme, the data mining system is linked with a database
or a data warehouse system and in addition to that, efficient implementations of a few
data mining primitives can be provided in the database.
 Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
Data Mining - Issues
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be interested


in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively extract


the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems
− The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Challenges of Implementation in Data mining
Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.

Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more
than $ 500, and the accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in incorrect data. Even
some customers may not be willing to disclose their phone numbers, which results in incomplete
data. The data could get changed due to human or system error. All these consequences (noisy
and incomplete data)makes data mining challenging.

Data Distribution:

Real-worlds data is usually stored on various platforms in a distributed computing environment.


It might be in a database, individual systems, or even on the internet. Practically, It is a quite
tough task to make all the data to a centralized data repository mainly due to organizational and
technical concerns. For example, various regional offices may have their servers to store their
data. It is not feasible to store, all the data from all the offices on a central server. Therefore, data
mining requires the development of tools and algorithms that allow the mining of distributed
data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary method
that shows the output to the user in a presentable way. The extracted data should convey the
exact meaning of what it intends to express. But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented
to make it successful.

You might also like