Introduction of Data Science.docx
Introduction of Data Science.docx
● As the world entered the era of big data, the need for its storage also grew. It was the main challenge
and concern for the enterprise industries until 2010. The main focus was on building a framework and
solutions to store data. Now the focus is diverted to the processing of this data.
What is Data Science?
● Data Science is the future of Artificial Intelligence. Data Science is a blend of various tools, algorithms,
and machine learning principles to achieve the goal for discovering hidden patterns from the big or
large datasets (data may raw). The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.
● Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data.
● Data science enables you to translate a business problem into a research project and then translate it
back into a practical solution.
Why Data Science?
● Traditionally, the data that we had was mostly structured and small in size, which could be analyzed by
using simple BI tools. Unlike data in the traditional systems which was mostly structured, today most of
the data is unstructured or semi-structured.
● Let’s have a look at the data trends in the image given below which shows that by 2020, more than 80
% of the data will be unstructured.
● This data is generated from different sources like financial logs, text files, multimedia forms, sensors,
and instruments.
● Simple BI tools are not capable of processing this huge volume and variety of data. This is why we need
more complex and advanced analytical tools and algorithms for processing, analyzing and drawing
meaningful insights out of it.
✔ Data Analytics is the basic level of data science. Data Analysts usually deal with static data and perform
descriptive analysis as well as inferential analysis. They are responsible for testing and rejecting models
and hypotheses.
1.1: Introduction Data Mining
What is data mining?
● Data Mining plays vital role in organizing and analysing data.
● Data mining is the process, which extracts hidden and interesting patterns and rules from large
databases.
● Data Mining is also known as “Knowledge Mining”, which is used to mine knowledge from massive
database. Knowledge in the form of pattern and rule is analysed from data mining, that’s why data
mining is popularly used to mine data in almost all kinds of application areas. Data mining has different
techniques like association rule mining, clustering, classification to find the meaningful and useful data
and rules.
● Data Mining is also known as Knowledge Data Discovery (KDD), which refers for mining knowledge from
large amount of data. Data Mining carries different meanings such as knowledge extraction, data or
pattern analysis. Mining can be done on advanced database systems such as Time-series data, spatial
databases, multimedia databases, WWW, text databases, medical databases, criminal databases or
some specific application oriented databases. Data mining functions include clustering, classifications,
predictions, and link analysis (associations) etc.
● Effective data mining aids in various aspects of planning business strategies and managing operations.
That includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cyber security planning and many other critical business use cases. It also plays an
important role in healthcare, government, scientific research, mathematics, sports and more.
Difference between Data Science and Data Mining
● Data science is a broad field that includes the processes of capturing of data, analyzing, and deriving
insights from it. On the other hand, data mining is mainly about finding useful information in a dataset
and utilizing that information to uncover hidden patterns.
● Another major difference between data science and data mining is that the former is a multidisciplinary
field that consists of statistics, social sciences, data visualizations, natural language processing, data
mining etc while the latter is a subset of the former.
● The role of a data science professional can be considered as a combination of an AI researcher, a deep
learning engineer, a machine learning engineer, or a data analyst, to some extent. The person might be
able to perform the role of a data engineer as well. On the contrary, a data mining professional doesn’t
necessarily have to be able to perform all these roles.
● Another notable difference between data science and data mining lies in the type of data used by these
professionals. Usually, data science deals with every type of data whether structured, semi-structured,
or unstructured. On the other hand, data mining mostly deals with structured data.
Data mining is used in various fields like research, business, marketing, sales, product development,
education, and healthcare.
● Data mining technique is used by Air France.
● Trip searches, bookings, social media, flight operations, call centers, and interactions in the airport
lounge are analyzed and a 360-degree customer view is created.
● Grocery stores use data mining by giving loyalty cards to customers that make it easy for the
cardholders to avail of special prices that are not made available to non-cardholders.
● The above are a few examples of data mining helping companies to increase efficiency, streamline
operations, cost reduction, and improve profits.
● When used appropriately, data mining provides an extreme advantage over competitive
establishments by providing more information about customers and helps to develop better and
effective strategies in marketing which will raise the revenue and lower the cost.
● In order to achieve excellent results from data mining, a number of tools and techniques are
required.
Advantages of Data Mining:-
✔ The Data Mining technique enables organizations to obtain knowledge-based data.
✔ Data mining enables organizations to make profitable/productive modifications in operation and
production.
✔ Compared with other statistical data applications, data mining is a cost-efficient.
✔ Data Mining helps the decision-making process of an organization.
✔ It facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviours.
✔ It can be induced in the new system as well as the existing platforms.
✔ It is a quick process that makes it easy for new users to analyze huge amounts of data in a short
time.
The Data Mining Analysis can be divided in two basic parts. They are:
✔ Prediction Analysis: It is related with time series but the time is not bound.
Relationship between Biological neural network and artificial neural network: Dendrites from Biological Neural
Network represent inputs in Artificial Neural Networks, cell nucleus represents Nodes, synapse represents
Weights, and Axon represents Output.
Artificial neural networks (ANNs) use learning algorithms that can independently make adjustments - or learn, in
a sense - as they receive new input. This makes them a very effective tool for non-linear statistical data
modelling.
Neural networks have been applied in diverse fields including aerospace, banking, defence, electronics,
entertainment, financial, insurance, manufacturing, medical, oil and gas, speech, securities, telecommunications,
transportation, and environment.
Input Layer: As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer: The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.
Output Layer: The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias. This
computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output. Activation
functions choose whether a node should fire or not. Only those who are fired make it to the output layer. There
are distinctive activation functions available that can be applied upon the sort of task we are performing.
✔ Storing data on the entire network: Data that is used in traditional programming is stored on the whole
network, not on a database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
✔ Capability to work with incomplete knowledge: After ANN training, the information may produce output
even with inadequate data. The loss of performance here relies upon the significance of missing data.
✔ Having a memory distribution: For ANN is to be able to adapt, it is important to determine the examples and
to encourage the network according to the desired output by demonstrating these examples to the network.
The succession of the network is directly proportional to the chosen instances, and if the event can't appear
to the network in all its aspects, it can produce false output.
✔ Having fault tolerance: Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
✔ Assurance of proper network structure: There is no particular guideline for determining the structure of
artificial neural networks. The appropriate network structure is accomplished through experience, trial, and
error.
✔ Unrecognized behavior of the network: It is the most significant issue of ANN. When ANN produces a testing
solution, it does not provide insight concerning why and how. It decreases trust in the network.
✔ Hardware dependence: Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.
✔ Difficulty of showing the issue to the network: ANNs can work with numerical data. Problems must be
converted into numerical values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's abilities.
✔ The duration of the network is unknown: The network is reduced to a specific value of the error, and this
value does not give us optimum results.
● Feed-Forward Neural Networks: In Feed-Forward Network, if the output values cannot be traced back to
the input values and if for every input node, an output node is calculated, then there is a forward flow of
information and no feedback between the layers. In simple words, the information moves in only one direction
(forward) from the input nodes, through the hidden nodes (if any), and to the output nodes. Such a type of
network is known as a feed forward network.
✔ Input Layer: As the name suggests, it accepts inputs in several different formats provided by the programmer.
✔ Hidden Layer: The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
✔ Output Layer: The input goes through a series of transformations using the hidden layer, which finally results
in output that is conveyed using this layer.
● Feedback Neural Network: Signals can travel in both directions network. Feedback neural networks are very
powerful and can complex. Feedback networks are dynamic. The “states” in such in a feedback become very a
network are constantly changing until an equilibrium point is reached. They stay at equilibrium until the input
changes a d a new equilibrium needs to be found. Feedback neural network architectures are also known as
interactive or recurrent. Feedback loops are allowed in such networks. They are used for content addressable
memory.
● Self-Organization Neural Network: Self Organizing Neural Network (SONN) is a type of artificial neural
network but is trained using competitive learning rather than error-correction learning (e.g., back propagation
with gradient decent) used by other artificial neural networks. A Self Organizing Neural Network (SONN) is an
unsupervised learning model in Artificial Neural Network termed as Self-Organizing Feature Maps or Kohonen
Maps. It is used to produce a low-dimensional (typically two-dimensional) representation of a
higher-dimensional data set the topological structure of the data.
The following decision tree is for the concept by computer that indicates whether a customer at company is likely
to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
Why are decision trees useful?
✔ It enables us to analyse the possible consequences of a decision thoroughly.
✔ It provides us a framework to measure the values of outcomes and the probability of accomplishing them.
✔ It helps us to make the best decisions based on existing data and best speculations.
✔ In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an
extensive collection of records into smaller sets of the class by implementing a sequence of simple decision
rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous population into
smaller, more homogeneous, or mutually exclusive classes.
✔ The attributes of the classes can be any variables from nominal, ordinal, binary, and quantitative values, in
contrast, the classes must be a qualitative type, such as categorical or ordinal or binary. In brief, the given data
of attributes together with its class, a decision tree creates a set of rules that can be used to identify the class.
One rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy is
known as the tree, and each segment is called a node. With each progressive division, the members from the
subsequent sets become more and more similar to each other. Hence, the algorithm used to build a decision
tree is referred to as recursive partitioning. The algorithm is known as CART (Classification and Regression
Trees)
The benefits of having a decision tree are as follows:
✔ A decision tree does not need scaling of information.
✔ Missing values in data also do not influence the process of building a choice tree to any considerable extent.
✔ A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
✔ Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.
✔ A decision tree does not require a standardization of data.
3. Genetic Algorithms:-
Genetic Algorithm (GA) is a search-based optimization technique based on the principles of Genetics and Natural
Selection. It is frequently used to find optimal or near-optimal solutions to difficult problems which otherwise would
take a lifetime to solve. It is frequently used to solve optimization problems, in research, and in machine learning.
Introduction to Optimization
Optimization is the process of making something better. In any process, we have a set of inputs and a set of outputs
as shown in the following figure.
Optimization refers to finding the values of inputs in such a way that we get the “best” output values. The
definition of “best” varies from problem to problem, but in mathematical terms, it refers to maximizing or
minimizing one or more objective functions, by varying the input parameters.
The set of all possible solutions or values which the inputs can take make up the search space. In this search
space, lies a point or a set of points high gives the optimal solution. The aim of optimization is to find that point
or set of points in the search space.
Genetic Algorithms were developed by John Holland and his students and colleagues at the University of
Michigan, most notably David E. Goldberg and has since been tried on various optimization problems with a high
degree of success.
In GAs, we have a pool or a population of possible solutions to the given problem. These solutions then undergo
recombination and mutation (like in natural genetics), producing new children, and the process is repeated over
various generations. Each individual (or candidate solution) is assigned a fitness value (based on its objective
function value) and the fitter individuals are given a higher chance to mate and yield more “fitter” individuals.
This is in line with the Darwinian Theory of Survival of the “Fittest”.
In this way we keep “evolving” better individuals or solutions over generations, till we reach a stopping criterion.
Genetic Algorithms are sufficiently randomized in nature, but they perform much better than random local
search (in which we just try various random solutions, keeping track of the best so far), as they exploit historical
information as well.
Advantages of Genetic Algorithms
GA’s have various advantages which have made them immensely popular. These include:
✔ Does not require any derivative information (which may not be available for many real-world
problems).
✔ Is faster and more efficient as compared to the traditional methods.
✔ Has very good parallel capabilities.
✔ Optimizes both continuous and discrete functions and also multi-objective problems.
✔ Provides a list of “good” solutions and not just a single solution.
✔ Always gets an answer to the problem, which gets better over the time.
✔ Useful when the search space is very large and there are a large number of parameters involved.
✔ GA’s are not suited for all problems, especially problems which are simple and for which derivative
information is available.
✔ Fitness value is calculated repeatedly which might be computationally expensive for some problems.
✔ Being stochastic, there are no guarantees on the optimality or the quality of the solution.
✔ If not implemented properly, the GA may not converge to the optimal solution.
✔ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
✔ K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into
the category that is most similar to the available categories.
✔ K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means
when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
✔ K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification
problems.
✔ K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
✔ It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
✔ KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data
into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is
a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
✔ Firstly, we will choose the number of neighbors, so we will choose the k=5.
✔ Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
✔ By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbours are from category A, hence this new data point must belong to category A.
5. Rule Induction:-
✔ Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic decision rules
explain an inherent relationship between the attributes and class labels in the data set.
✔ Many real-life experiences are based on intuitive rule induction. For example, we can proclaim a rule that states
“if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8 p.m. on a Sunday, then the traffic
will be light.” These rules are not necessarily right all the time. 8 a.m. weekday traffic may be light during a
holiday season. But, in general, these rules hold true and are deduced from real-life experience based on our
everyday observations. Rule induction provides a powerful classification approach...
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following
form −
IF condition THEN conclusion
Let us consider a rule R1,
Points to remember −
● The IF part of the rule is called rule antecedent or precondition.
● The THEN part of the rule is called rule consequent.
● The antecedent part the condition consist of one or more attribute tests and these tests are logically
AND.
● The consequent part consists of class prediction.
Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected
from nuclear laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of
these data. Now we can capture and store more new data faster than we can analyse the old data already
accumulated. Example of scientific analysis:
Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network
intrusions often involve stealing valuable network resources. Data mining technique plays a vital role in searching
intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining useful and
relevant information from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about the foreign invasions in
the system. For example:
Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually
time-related and can be inter-business deals or intra-business operations. The effective and in-time use of the data
in a reasonable time frame for competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to analyze these business
transactions and identify marketing approaches and decision-making. Example :
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by
a customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the companies and data mining techniques helps to achieve this
analysis task. Example:
● Data mining concepts are in use for Sales and marketing to provide better customer service, to improve
cross-selling opportunities, to increase direct mail response rates.
● Customer Retention in the form of pattern identification and prediction of likely defections is possible by Data
mining.
● Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or unusual
behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. This
method generates patterns that can be used both by learners and educators. By using data mining EDM we can
perform some educational task:
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their
outcomes to improve the focusing of high-value physicians and figure out which promoting activities will have the
best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to predict which
customers will buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of
customers.