Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Introduction of Data Science.docx

The document provides a comprehensive overview of Data Science and Data Mining, detailing their definitions, processes, techniques, and applications. It explains the lifecycle of Data Science, including problem framing, data collection, processing, exploration, analysis, and result communication, as well as the differences between Data Science and Data Mining. Additionally, it outlines the uses, advantages, and disadvantages of Data Mining, along with its processes and types of analysis.

Uploaded by

ry6492yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Introduction of Data Science.docx

The document provides a comprehensive overview of Data Science and Data Mining, detailing their definitions, processes, techniques, and applications. It explains the lifecycle of Data Science, including problem framing, data collection, processing, exploration, analysis, and result communication, as well as the differences between Data Science and Data Mining. Additionally, it outlines the uses, advantages, and disadvantages of Data Mining, along with its processes and types of analysis.

Uploaded by

ry6492yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT- 1: Data Science

1.1 Introduction Data Science and Data Mining


1.2 Uses of Data Mining
1.3 Data Mining Techniques Overviews
1.3.1 Artificial Neural Networks
1.3.2 Decision Trees
1.3.3 Genetic Algorithms
1.3.4 Nearest Neighbour Method
1.3.5 Rule Induction
1.4 Data Mining Process
1.5 Data Mining Applications
1.1 Introduction of Data Science

● As the world entered the era of big data, the need for its storage also grew. It was the main challenge
and concern for the enterprise industries until 2010. The main focus was on building a framework and
solutions to store data. Now the focus is diverted to the processing of this data.
What is Data Science?
● Data Science is the future of Artificial Intelligence. Data Science is a blend of various tools, algorithms,
and machine learning principles to achieve the goal for discovering hidden patterns from the big or
large datasets (data may raw). The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.

● Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data.
● Data science enables you to translate a business problem into a research project and then translate it
back into a practical solution.
Why Data Science?
● Traditionally, the data that we had was mostly structured and small in size, which could be analyzed by
using simple BI tools. Unlike data in the traditional systems which was mostly structured, today most of
the data is unstructured or semi-structured.
● Let’s have a look at the data trends in the image given below which shows that by 2020, more than 80
% of the data will be unstructured.
● This data is generated from different sources like financial logs, text files, multimedia forms, sensors,
and instruments.
● Simple BI tools are not capable of processing this huge volume and variety of data. This is why we need
more complex and advanced analytical tools and algorithms for processing, analyzing and drawing
meaningful insights out of it.

Features Business Intelligence (BI) Data Science

Both Structured and Unstructured


Structured(Usually SQL, often
Data Sources (logs, cloud data, SQL, NoSQL,
Data Warehouse)
sendsor’s data, satellite data text)

Statistics, Data Mining, Machine


Approach Statistics and Visualization Learning, Graph Analysis, Neuro-
linguistic Programming (NLP)

Focus Past and Present Present and Future

Pentaho, Microsoft BI, QlikView


Tools Rapid Miner, BigML, Weka etc.
etc.

Life Cycle of Data Science


Data science’s lifecycle consists of six distinct stages, each with its own tasks. As the data science process
stages help in converting raw data into monetary gains and overall profits, any data scientist should be
well aware of the process and its significance.
1) Framing the Problem: Whenever we are trying to solve a Data Science problem, we must first
understand the scope and depth of the problem that we are trying to solve. If we make a mistake in
this step, then we end up solving a problem that we did not need to solve, and we end up spending
a lot of time and resources on a project that will not yield the desired effect.
2) Collecting Data: After defining the problem, you will need to collect the requisite data to derive
insights and turn the business problem into a probable solution. The process involves thinking
through your data and finding ways to collect and get the data you need. It can include scanning
your internal databases or purchasing databases from external sources.
3) Processing the Data: After the first and second steps, when you have all the data you need, you will
have to process it before going further and analyzing it. Data can be messy if it has not been
appropriately maintained, leading to errors that easily corrupt the analysis. These issues can be
values set to null when they should be zero or the exact opposite, missing values, duplicate values,
and many more. You will have to go through the data and check it for problems to get more accurate
insights.
4) Exploring the Data: In this step, you will have to develop ideas that can help identify hidden
patterns and insights. You will have to find more interesting patterns in the data, such as why sales
of a particular product or service have gone up or down. You must analyze or notice this kind of data
more thoroughly. This is one of the most crucial steps in a data science process.
5) Analyzing the Data: This step will test your mathematical, statistical, and technological knowledge.
You must use all the data science tools to crunch the data successfully and discover every insight you
can. You might have to prepare a predictive model that can compare your average customer with
those who are underperforming. You might find several reasons in your analysis, like age or social
media activity, as crucial factors in predicting the consumers of a service or product.
6) Communicating Results: After all these steps, it is vital to convey your insights and findings to the
sales head and make them understand their importance. It will help if you communicate
appropriately to solve the problem you have been given. Proper communication will lead to action.
In contrast, improper contact may lead to inaction.

Tools Used for Data Science:

Data Analyst Vs Data Scientist


✔ Data scientist understands the data from a business point of view. His work is to give the most accurate
predictions. A Data Scientist fosters decision-making in the company. Based on the prediction, a data
scientist contributes to calculated data-driven business decisions.
In artificial intelligence and machine learning, Data scientist has a great role to play. For a Data
scientist, knowledge of machine learning is a must. Machine learning is the most impressive technology
in the world.
A Data Scientist needs to be well versed with machine learning algorithm and must be able to assess
situations in order to apply these algorithms. And finally, a data scientist must know the in-depth
working of the algorithm in order to apply it.

✔ Data Analytics is the basic level of data science. Data Analysts usually deal with static data and perform
descriptive analysis as well as inferential analysis. They are responsible for testing and rejecting models
and hypotheses.
1.1: Introduction Data Mining
What is data mining?
● Data Mining plays vital role in organizing and analysing data.
● Data mining is the process, which extracts hidden and interesting patterns and rules from large
databases.
● Data Mining is also known as “Knowledge Mining”, which is used to mine knowledge from massive
database. Knowledge in the form of pattern and rule is analysed from data mining, that’s why data
mining is popularly used to mine data in almost all kinds of application areas. Data mining has different
techniques like association rule mining, clustering, classification to find the meaningful and useful data
and rules.
● Data Mining is also known as Knowledge Data Discovery (KDD), which refers for mining knowledge from
large amount of data. Data Mining carries different meanings such as knowledge extraction, data or
pattern analysis. Mining can be done on advanced database systems such as Time-series data, spatial
databases, multimedia databases, WWW, text databases, medical databases, criminal databases or
some specific application oriented databases. Data mining functions include clustering, classifications,
predictions, and link analysis (associations) etc.
● Effective data mining aids in various aspects of planning business strategies and managing operations.
That includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cyber security planning and many other critical business use cases. It also plays an
important role in healthcare, government, scientific research, mathematics, sports and more.
Difference between Data Science and Data Mining
● Data science is a broad field that includes the processes of capturing of data, analyzing, and deriving
insights from it. On the other hand, data mining is mainly about finding useful information in a dataset
and utilizing that information to uncover hidden patterns.
● Another major difference between data science and data mining is that the former is a multidisciplinary
field that consists of statistics, social sciences, data visualizations, natural language processing, data
mining etc while the latter is a subset of the former.
● The role of a data science professional can be considered as a combination of an AI researcher, a deep
learning engineer, a machine learning engineer, or a data analyst, to some extent. The person might be
able to perform the role of a data engineer as well. On the contrary, a data mining professional doesn’t
necessarily have to be able to perform all these roles.
● Another notable difference between data science and data mining lies in the type of data used by these
professionals. Usually, data science deals with every type of data whether structured, semi-structured,
or unstructured. On the other hand, data mining mostly deals with structured data.

1.2 Uses of Data Mining

Data mining is used in various fields like research, business, marketing, sales, product development,
education, and healthcare.
● Data mining technique is used by Air France.
● Trip searches, bookings, social media, flight operations, call centers, and interactions in the airport
lounge are analyzed and a 360-degree customer view is created.
● Grocery stores use data mining by giving loyalty cards to customers that make it easy for the
cardholders to avail of special prices that are not made available to non-cardholders.
● The above are a few examples of data mining helping companies to increase efficiency, streamline
operations, cost reduction, and improve profits.
● When used appropriately, data mining provides an extreme advantage over competitive
establishments by providing more information about customers and helps to develop better and
effective strategies in marketing which will raise the revenue and lower the cost.
● In order to achieve excellent results from data mining, a number of tools and techniques are
required.
Advantages of Data Mining:-
✔ The Data Mining technique enables organizations to obtain knowledge-based data.
✔ Data mining enables organizations to make profitable/productive modifications in operation and
production.
✔ Compared with other statistical data applications, data mining is a cost-efficient.
✔ Data Mining helps the decision-making process of an organization.
✔ It facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviours.
✔ It can be induced in the new system as well as the existing platforms.
✔ It is a quick process that makes it easy for new users to analyze huge amounts of data in a short
time.

Disadvantages of Data Mining


✔ There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card purchases of
their customers to other organizations.
✔ Many data mining analytics software is difficult to operate and needs advance training to work
on.
✔ Different data mining instruments operate in distinct ways due to the different algorithms used in
their design. Therefore, the selection of the right data mining tools is a very challenging task.
✔ The data mining techniques are not precise/specific, so that it may lead to severe consequences
in certain conditions.

1.4 Data Mining Process/Life Cycle


The data mining process is divided into two parts i.e. Data Preprocessing and Data Mining. Data
Preprocessing involves data cleaning, data integration, data reduction, and data transformation. The data
mining part performs data mining, pattern evaluation and knowledge representation of data.
There are six main steps for knowledge discovery in databases as shown in above figure.

1) Data cleaning and Data integration:


Data cleaning is the first step in data mining. It holds importance as dirty data if used directly in mining
can cause confusion in procedures and produce inaccurate results. Basically, this step involves the
removal of noisy or incomplete data from the collection. Many methods that generally clean data by
itself are available but they are not robust.

This step carries out the routine cleaning work by:


(i) Fill The Missing Data:
Missing data can be filled by methods such as:
● Ignoring the tuple.
● Filling the missing value manually.
● Use the measure of central tendency, median or
● Filling in the most probable value.
(ii) Remove The Noisy Data: Random error is called noisy data.
Methods to remove noise are :
Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is
performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin. Smoothing by a
median, where each bin value is replaced by a bin median. Smoothing by bin boundaries i.e. The
minimum and maximum values in the bin are bin boundaries and each bin value is replaced by the
closest boundary value.
● Identifying the Outliers
● Resolving Inconsistencies
Data integration: When multiple heterogeneous data sources such as databases, data cubes or files are
combined for analysis, this process is called data integration. This can help in improving the accuracy and
speed of the data mining process.
Different databases have different naming conventions of variables, by causing redundancies in the
databases. Additional Data Cleaning can be performed to remove the redundancies and inconsistencies
from the data integration without affecting the reliability of data.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service Integrator and
Microsoft SQL etc.
2) Data Warehouse: Storage of summary information of collected data from multiple data sources which
are needed for decision.
This technique is applied to obtain relevant data for analysis from the collection of data. The size of the
representation is much smaller in volume while maintaining integrity. Data Reduction is performed using
methods such as Naive Bayes, Decision Trees, Neural network, etc.
Some strategies of data reduction are:

● Dimensionality Reduction: Reducing the number of attributes in the dataset.


● Numerosity Reduction: Replacing the original data volume by smaller forms of data representation.
● Data Compression: Compressed representation of the original data.
3) Data Selection and transformation: In this process, data is transformed into a form suitable for the data
mining process. Data is consolidated/combined so that the mining process is more efficient and the
patterns are easier to understand. Data Transformation involves Data Mapping and code generation
process.
Strategies for data transformation are:
● Smoothing: Removing noise from data using clustering, regression techniques, etc.
● Aggregation: Summary operations are applied to data.
● Normalization: Scaling of data to fall within a smaller range.
● Discretization: Raw values of numeric data are replaced by intervals. For Example, Age.
4) Data Mining: Data Mining is a process to identify interesting patterns and knowledge from a large
amount of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is
represented in the form of patterns and models are structured using classification and clustering
techniques. Patterns: Patterns are achieved by data mining as a result.
5) Pattern Evaluation & Presentation: This step involves identifying interesting patterns representing the
knowledge based on interestingness measures. Data summarization and visualization methods are
used to make the data understandable by the user.
Presentation: Knowledge representation is a step where data visualization and knowledge
representation tools are used to represent the mined data. Data is visualized in the form of reports,
tables, etc.

Type of the Data Mining

The Data Mining Analysis can be divided in two basic parts. They are:

1. Predictive Data Mining Analysis


2. Descriptive Data Mining Analysis

1) Predictive Data Mining Analysis:-


As the name signifies, Predictive Data-Mining analysis works on the data that may help to project what
may happen later in business. Predictive Data-Mining Tasks can be further divided into four type.
✔ Classification Analysis: It is a used to fetch important and relevant information about data and
metadata. It classifies a data in various categories it belongs to. Email provider is the best example of
classification analysis. They use algorithms that can classify the mail as real or mark it as spam.
✔ Regression Analysis: It tries to state the dependency between variables. It is generally used for
forecasting and prediction.
✔ Time Serious Analysis: It is a sequence of well-defined data points measured at consistent time
interval.

✔ Prediction Analysis: It is related with time series but the time is not bound.

2) Descriptive Data Mining Tasks:-


Its purpose is to summarize or turn data into relevant information. Descriptive Data Mining Tasks can be
further divided into four types.
✔ Clustering Analysis: It is the process of identifying data sets that are similar to one other. For example
– clusters of customers with similar buying behaviour can be clubbed with similar products, to
increase the conversion rate.
✔ Summarization Analysis: It involves techniques for finding a compact description of a dataset.
✔ Association Rule Learning: This method helps in identifying some interesting relations different
variables in large databases. The best example is of the retail industry. As and when some festive
season approaches retail store stock, up with the chocolates in which sale increases before any
festival time, which is achieved with the help of data-mining.
✔ Sequence Discovery Analysis: It is about finding a sequence of an activity. For example – In a store
user may often buy shaving gel before razor. It’s all about in what sequence the user buying the
product and based on that store owner can arrange the items.

1.3 Data Mining Techniques Overviews

1. Artificial Neural Network Architecture:-


The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence modelled
after the brain. ANN works on unsupervised learning. An Artificial neural network is usually a computational
network based on biological neural networks that construct the structure of the human brain. Similar to a human
brain has neurons interconnected to each other, artificial neural networks also have neurons that are linked to
each other in various layers of the networks. These neurons are known as nodes.
[Figure: diagram of Biological Neural Network]

[Figure: diagram of Artificial Neural Network]

Relationship between Biological neural network and artificial neural network: Dendrites from Biological Neural
Network represent inputs in Artificial Neural Networks, cell nucleus represents Nodes, synapse represents
Weights, and Axon represents Output.

Artificial neural networks (ANNs) use learning algorithms that can independently make adjustments - or learn, in
a sense - as they receive new input. This makes them a very effective tool for non-linear statistical data
modelling.

Neural networks have been applied in diverse fields including aerospace, banking, defence, electronics,
entertainment, financial, insurance, manufacturing, medical, oil and gas, speech, securities, telecommunications,
transportation, and environment.

The Architecture of an Artificial Neural Network:


To understand the concept of the architecture of an artificial neural network, we have to understand what a
neural network consists of. In order to define a neural network that consists of a large number of artificial
neurons, which are termed units arranged in a sequence of layers.
[Figure: Basic Structure of Artificial Neural Network]

Input Layer: As the name suggests, it accepts inputs in several different formats provided by the programmer.

Hidden Layer: The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.

Output Layer: The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias. This
computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the output. Activation
functions choose whether a node should fire or not. Only those who are fired make it to the output layer. There
are distinctive activation functions available that can be applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN):-


✔ Parallel processing capability: Artificial neural networks have a numerical value that can perform more than
one task simultaneously.

✔ Storing data on the entire network: Data that is used in traditional programming is stored on the whole
network, not on a database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
✔ Capability to work with incomplete knowledge: After ANN training, the information may produce output
even with inadequate data. The loss of performance here relies upon the significance of missing data.

✔ Having a memory distribution: For ANN is to be able to adapt, it is important to determine the examples and
to encourage the network according to the desired output by demonstrating these examples to the network.
The succession of the network is directly proportional to the chosen instances, and if the event can't appear
to the network in all its aspects, it can produce false output.

✔ Having fault tolerance: Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network

✔ Assurance of proper network structure: There is no particular guideline for determining the structure of
artificial neural networks. The appropriate network structure is accomplished through experience, trial, and
error.

✔ Unrecognized behavior of the network: It is the most significant issue of ANN. When ANN produces a testing
solution, it does not provide insight concerning why and how. It decreases trust in the network.

✔ Hardware dependence: Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.

✔ Difficulty of showing the issue to the network: ANNs can work with numerical data. Problems must be
converted into numerical values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's abilities.

✔ The duration of the network is unknown: The network is reduced to a specific value of the error, and this
value does not give us optimum results.

Different Neural Network Method in Data Mining


The neural network model can be broadly divided into the following three types:

● Feed-Forward Neural Networks: In Feed-Forward Network, if the output values cannot be traced back to
the input values and if for every input node, an output node is calculated, then there is a forward flow of
information and no feedback between the layers. In simple words, the information moves in only one direction
(forward) from the input nodes, through the hidden nodes (if any), and to the output nodes. Such a type of
network is known as a feed forward network.

✔ Input Layer: As the name suggests, it accepts inputs in several different formats provided by the programmer.
✔ Hidden Layer: The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
✔ Output Layer: The input goes through a series of transformations using the hidden layer, which finally results
in output that is conveyed using this layer.
● Feedback Neural Network: Signals can travel in both directions network. Feedback neural networks are very
powerful and can complex. Feedback networks are dynamic. The “states” in such in a feedback become very a
network are constantly changing until an equilibrium point is reached. They stay at equilibrium until the input
changes a d a new equilibrium needs to be found. Feedback neural network architectures are also known as
interactive or recurrent. Feedback loops are allowed in such networks. They are used for content addressable
memory.

● Self-Organization Neural Network: Self Organizing Neural Network (SONN) is a type of artificial neural
network but is trained using competitive learning rather than error-correction learning (e.g., back propagation
with gradient decent) used by other artificial neural networks. A Self Organizing Neural Network (SONN) is an
unsupervised learning model in Artificial Neural Network termed as Self-Organizing Feature Maps or Kohonen
Maps. It is used to produce a low-dimensional (typically two-dimensional) representation of a
higher-dimensional data set the topological structure of the data.

Why use Neural Network Method in Data Mining?


Neural networks help in mining large amounts of data in various sector such as retail, banking (Fraud
detection), bioinformatics (genome sequencing), etc. Finding useful information for large data which is
hidden is very challenging and very necessary also. Data Mining uses neural networks to harvest
information from large datasets from data warehousing organizations. Which helps the user in decision
making.
2. Decision Trees:-
Decision Tree is a supervised learning method used in data mining for classification and regression methods. It is a
tree that helps us in decision-making purposes. It separates a data set into smaller subsets, and at the same time,
the decision tree is steadily developed. The final tree is a tree with the decision nodes and leaf nodes. A decision
node has at least two branches. The leaf nodes show a classification or decision. We can't accomplish more split
on leaf nodes-The uppermost decision node in a tree that relates to the best predictor called the root node.
Decision trees can deal with both categorical and numerical data.

The following decision tree is for the concept by computer that indicates whether a customer at company is likely
to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
Why are decision trees useful?
✔ It enables us to analyse the possible consequences of a decision thoroughly.
✔ It provides us a framework to measure the values of outcomes and the probability of accomplishing them.
✔ It helps us to make the best decisions based on existing data and best speculations.
✔ In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of simple decision
rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous population into
smaller, more homogeneous, or mutually exclusive classes.
✔ The attributes of the classes can be any variables from nominal, ordinal, binary, and quantitative values, in
contrast, the classes must be a qualitative type, such as categorical or ordinal or binary. In brief, the given data
of attributes together with its class, a decision tree creates a set of rules that can be used to identify the class.
One rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy is
known as the tree, and each segment is called a node. With each progressive division, the members from the
subsequent sets become more and more similar to each other. Hence, the algorithm used to build a decision
tree is referred to as recursive partitioning. The algorithm is known as CART (Classification and Regression
Trees)
The benefits of having a decision tree are as follows:
✔ A decision tree does not need scaling of information.
✔ Missing values in data also do not influence the process of building a choice tree to any considerable extent.
✔ A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
✔ Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.
✔ A decision tree does not require a standardization of data.
3. Genetic Algorithms:-
Genetic Algorithm (GA) is a search-based optimization technique based on the principles of Genetics and Natural
Selection. It is frequently used to find optimal or near-optimal solutions to difficult problems which otherwise would
take a lifetime to solve. It is frequently used to solve optimization problems, in research, and in machine learning.

Introduction to Optimization

Optimization is the process of making something better. In any process, we have a set of inputs and a set of outputs
as shown in the following figure.

Optimization refers to finding the values of inputs in such a way that we get the “best” output values. The
definition of “best” varies from problem to problem, but in mathematical terms, it refers to maximizing or
minimizing one or more objective functions, by varying the input parameters.

The set of all possible solutions or values which the inputs can take make up the search space. In this search
space, lies a point or a set of points high gives the optimal solution. The aim of optimization is to find that point
or set of points in the search space.

What are Genetic Algorithms?


Nature has always been a great source of inspiration to all mankind. Genetic Algorithms are search based
algorithms based on the concepts of natural selection and genetics. Genetic Algorithms are a subset of a much
larger branch of computation known as Evolutionary Computation.

Genetic Algorithms were developed by John Holland and his students and colleagues at the University of
Michigan, most notably David E. Goldberg and has since been tried on various optimization problems with a high
degree of success.

In GAs, we have a pool or a population of possible solutions to the given problem. These solutions then undergo
recombination and mutation (like in natural genetics), producing new children, and the process is repeated over
various generations. Each individual (or candidate solution) is assigned a fitness value (based on its objective
function value) and the fitter individuals are given a higher chance to mate and yield more “fitter” individuals.
This is in line with the Darwinian Theory of Survival of the “Fittest”.

In this way we keep “evolving” better individuals or solutions over generations, till we reach a stopping criterion.

Genetic Algorithms are sufficiently randomized in nature, but they perform much better than random local
search (in which we just try various random solutions, keeping track of the best so far), as they exploit historical
information as well.
Advantages of Genetic Algorithms

GA’s have various advantages which have made them immensely popular. These include:

✔ Does not require any derivative information (which may not be available for many real-world
problems).
✔ Is faster and more efficient as compared to the traditional methods.
✔ Has very good parallel capabilities.
✔ Optimizes both continuous and discrete functions and also multi-objective problems.
✔ Provides a list of “good” solutions and not just a single solution.
✔ Always gets an answer to the problem, which gets better over the time.
✔ Useful when the search space is very large and there are a large number of parameters involved.

Limitations of Genetic Algorithms


Like any technique, GA’s also suffer from a few limitations. These include:

✔ GA’s are not suited for all problems, especially problems which are simple and for which derivative
information is available.
✔ Fitness value is calculated repeatedly which might be computationally expensive for some problems.
✔ Being stochastic, there are no guarantees on the optimality or the quality of the solution.
✔ If not implemented properly, the GA may not converge to the optimal solution.

4. KNN Algorithm – Finding Nearest Neighbours:-

✔ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
✔ K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into
the category that is most similar to the available categories.
✔ K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means
when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
✔ K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification
problems.
✔ K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
✔ It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
✔ KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data
into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is
a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the
below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbours
Step-2: Calculate the Euclidean distance of K number of neighbours
Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
Step-4: Among these k neighbours, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbour is maximum.
Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

✔ Firstly, we will choose the number of neighbors, so we will choose the k=5.
✔ Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:

✔ By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbours are from category A, hence this new data point must belong to category A.

Advantages of KNN Algorithm:


✔ It is simple to implement.
✔ It is robust to the noisy training data
✔ It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


✔ Always needs to determine the value of K which may be complex some time.
✔ The computation cost is high because of calculating the distance between the data points for all the training
samples.

5. Rule Induction:-
✔ Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic decision rules
explain an inherent relationship between the attributes and class labels in the data set.
✔ Many real-life experiences are based on intuitive rule induction. For example, we can proclaim a rule that states
“if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8 p.m. on a Sunday, then the traffic
will be light.” These rules are not necessarily right all the time. 8 a.m. weekday traffic may be light during a
holiday season. But, in general, these rules hold true and are deduced from real-life experience based on our
everyday observations. Rule induction provides a powerful classification approach...

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following
form −
IF condition THEN conclusion
Let us consider a rule R1,
Points to remember −
● The IF part of the rule is called rule antecedent or precondition.
● The THEN part of the rule is called rule consequent.
● The antecedent part the condition consist of one or more attribute tests and these tests are logically
AND.
● The consequent part consists of class prediction.

1.5 Data Mining Application:-


There are many measurable benefits that have been achieved in different application areas from data mining. So,
let’s discuss different applications of Data Mining:

Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected
from nuclear laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of
these data. Now we can capture and store more new data faster than we can analyse the old data already
accumulated. Example of scientific analysis:

● Sequence analysis in bioinformatics


● Classification of astronomical objects
● Medical decision support.

Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network
intrusions often involve stealing valuable network resources. Data mining technique plays a vital role in searching
intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining useful and
relevant information from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about the foreign invasions in
the system. For example:

● Detect security violations


● Misuse Detection
● Anomaly Detection

Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually
time-related and can be inter-business deals or intra-business operations. The effective and in-time use of the data
in a reasonable time frame for competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to analyze these business
transactions and identify marketing approaches and decision-making. Example :

● Direct mail targeting


● Stock trading
● Customer segmentation
● Churn prediction (Churn prediction is one of the most popular Big Data use cases in business)

Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by
a customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this
analysis task. Example:

● Data mining concepts are in use for Sales and marketing to provide better customer service, to improve
cross-selling opportunities, to increase direct mail response rates.
● Customer Retention in the form of pattern identification and prediction of likely defections is possible by Data
mining.
● Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or unusual
behavior etc.

Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This
method generates patterns that can be used both by learners and educators. By using data mining EDM we can
perform some educational task:

● Predicting students admission in higher education


● Predicting students profiling
● Predicting student performance
● Teachers teaching performance
● Curriculum development
● Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping
of data with perfection in the research area. Rules generated by data mining are unique to find results. In most of
the technical research in data mining, we create a training model and testing model. The training/testing model is a
strategy to measure the precision of the proposed model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data set used to design the training model whereas
testing data set is used in the testing model. Example:

● Classification of uncertain data.


● Information-based clustering.
● Decision support system
● Web Mining
● Domain-driven data mining
● IoT (Internet of Things)and Cybersecurity
● Smart farming IoT(Internet of Things)

Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their
outcomes to improve the focusing of high-value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to predict which
customers will buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of
customers.

● Claims analysis i.e which medical procedures are claimed together.


● Identify successful medical therapies for different illnesses.
● Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can apply data mining to
identify the best prospects for its services. A large consumer merchandise organization can apply information
mining to improve its business cycle to retailers.

● Determine the distribution schedules among outlets.


● Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer transaction data
to identify customers most likely to be interested in a new credit product.

● Credit card fraud detection.


● Identify ‘Loyal’ customers.
● Extraction of information related to customers.
● Determine credit card spending by customer groups.

You might also like