Content

CHAPTER 1
INTRODUCTION
1
CHAPTER 1
INTRODUCTION
1. About Institute
1.1. Name
Naresh Information Technologies Hyderabad
1.2. Address
2nd Floor, Durga Bhavani Plaza, Satyam Theatre Road, Ameerpet
1.3. Certifications
Naresh IT is an ISO Certified Training Organization.
1.4. Courses
Naresh IT Provides many courses by experienced trainers. Few courses are, Java Core and
Advanced, Python Core and Advanced, Selenium, Dot Net, C, C++, C#, Data Structures,
ORACLE, Data Science, Big Data, Artificial Intelligence, Web services, DEV Ops, AWS,
Block Chain, etc.
2. About Trainer
Naveen Pathuru is leading Data Science trainer in Hyderabad. He has trained thousands of
students under his 22 years of tremendous work.
2
CHAPTER 2
COURSE
3
CHAPTER 2
COURSE
1. Introduction
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured and unstructured data. Nate
Silver referred to data science as a sexed up term for statistics.
Data science is a "concept to unify statistics, data analysis, machine learning and their
related methods" in order to "understand and analyze actual phenomena" with data. It employs
techniques and theories drawn from many fields within the context of mathematics, statistics,
computer science, and information science. Turing award winner Jim Gray imagined data
science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-
driven) and asserted that "everything about science is changing because of the impact of
information technology" and the data deluge. In 2015, the American Statistical Association
identified database management, statistics and machine learning, and distributed and parallel
systems as the three emerging foundational professional communities.
In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century”,
the term "data science" became a buzzword. It is now often used interchangeably with earlier
concepts like business analytics, business intelligence, predictive modeling, and statistics.
Even the suggestion that data science is sexy was paraphrasing Hans Rosling, featured in a
2011 BBC documentary with the quote, "Statistics is now the sexiest subject around." Nate
Silver referred to data science as a sexed up term for statistics. In many cases, earlier
approaches and solutions are now simply rebranded as "data science" to be more attractive,
which can cause the term to become "dilute[d] beyond usefulness." While many university
programs now offer a data science degree, there exists no consensus on a definition or suitable
curriculum contents. To the discredit of the discipline, however, many data-science and big-
data projects fail to deliver useful results, often as a result of poor management and utilization
of resources.
Data science is related to computer science, but is a separate field. Computer science
involves creating programs and algorithms to record and process data, while data science
covers any type of data analysis, which may or may not use computers. Data science is more
closely related to the mathematics field of Statistics, which includes the collection,
organization, analysis, and presentation of data.
Because of the large amounts of data modern companies and organizations maintain, data
science has become an integral part of IT. For example, a company that has petabytes of user
data may use data science to develop effective ways to store, manage, and analyze the data.
The company may use the scientific method to run tests and extract results that can provide
meaningful insights about their users.
4
2. Covered Tools and Technologies
2.1. Python
Python is an interpreted, high-level, general-purpose programming language.
Python is highly used language by Data Scientists due to its “Write Less Work More”
property.
2.2. R Programming
R is a programming language and free software environment for statistical
computing and graphics supported by the R Foundation for Statistical Computing. The R
language is widely used among statisticians and data miners for developing statistical
software and data analysis. R is best for descriptive data analysis.
2.3. Introduction to Data Science

Data science is the study of data. It involves developing methods of recording,
storing, and analyzing data to effectively extract useful information. The goal of data
science is to gain insights and knowledge from any type of data both structured and
unstructured.
2.4. Statistics
Statistics is a form of mathematical analysis that uses quantified models,
representations and synopses for a given set of experimental data or real-life studies.
Statistics studies methodologies to gather, review, analyze and draw conclusions from
data. Analysis of variance, etc.
2.5. Data Analysis

Data analysis is a process of inspecting, cleansing, transforming and modeling data
with the goal of discovering useful information, informing conclusion and supporting
decision-making. There are a several data analysis methods including data mining, text
analytics, business intelligence and data visualization.
2.6. Machine Learning

Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without being
explicitly programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves.
2.7. Deep Learning

Deep learning is a subset of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is unstructured or unlabeled.
Also known as deep neural learning or deep neural network.
5
2.8. Big Data Analytics
Big Data analytics is the process of collecting, organizing and analyzing large sets of
data (called Big Data) to discover patterns and other useful information. This big data is
gathered from a wide variety of sources, including social networks, videos, digital images,
sensors, and sales transaction records.
2.8.1. Hadoop Hive

Apache Hive is a data warehouse system for data summarization and analysis and
for querying of large data systems in the open-source Hadoop platform. It converts
SQL-like queries into Map Reduce jobs for easy execution and processing of
extremely large volumes of data. ... One such technology is Apache Hive.
2.8.2. Spark with Python

Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop Map Reduce and it extends the Map Reduce
model to efficiently use it for more types of computations, which includes interactive
queries and stream processing.
2.9. Tableau
Tableau is a powerful and fastest growing data visualization tool used in the
Business Intelligence Industry. It helps in simplifying raw data into the very easily
understandable format.
Data analysis is very fast with Tableau and the visualizations created are in the form of
dashboards and worksheets. The data that is created using Tableau can be understood by
professional at any level in an organization. It even allows a non-technical user to create
a customized dashboard.
The best feature Tableau are
 Data Blending
 Real time analysis
 Collaboration of data
The great thing about Tableau software is that it doesn't require any technical or any kind
of programming skills to operate. The tool has garnered interest among the people from
all sectors such as business, researchers, different industries, etc.
6
CHAPTER 3
R PROGRAMMING
7
CHAPTER 3
R PROGRAMMING
1. Introduction
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
 an effective data handling and storage facility
 a suite of operators for calculations on arrays, in particular matrices
 a large, coherent, integrated collection of intermediate tools for data analysis,
 graphical facilities for data analysis and display either on-screen or on hardcopy
 a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent
system, rather than an incremental accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.
8
2. Installation
Step 1 – Download the R installer from https://cran.r-‐project.org/
Step 2 – Install R.
Step 3 – Download RStudio: https://www.rstudio.com/products/rstudio/download/
Step 4 – Install RStudio.
Step 5 – Check that R and RStudio are working. Open RStudio.
Step 6 – Install R packages required for the workshop.
3. IDE
RStudio is an integrated development environment for R, a programming language for
statistical computing and graphics. The RStudio IDE is developed by RStudio, Inc., a
commercial enterprise founded by JJ Allaire, creator of the programming language
ColdFusion.
Fig 3.1 R Studio IDE
9
4. Analysis Works
4.1. NBA Data Analysis
4.1.1. Data
Data is providing into form of R vectors and matrices.
4.1.2. Goal
To find important insights from data.
4.1.3. Results
Analysis ended up with many shocking results like wrong salary distribution,
continuous fall in top player’s accuracy, etc.
4.2. Movie Review Analysis

4.2.1. Data
Data is collected from IMDb and Rotten tomatoes.
4.2.2. Goal
Analyze the data to get important insights of blockbuster or super hit movies.
4.2.3. Results
Fig 4.2.3.1: Audience Ratings Vs Critic Ratings with Genre and Budget
10
Fig 4.2.3.2: Audience Ratings Vs Critic Ratings with Genre
Fig 4.2.3.3: Count of movies by Budget and Genre
11
CHATPTER 4
MACHINE LEARNING
12
CHATPTER 4
MACHINE LEARNING
1. Introduction
Machine learning is concerned with system programs that automatically improves their
performance through experience.
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.
2. Traditional Programming Vs Machine Learning
Program
Computer Output
Data
Fig 2.1: Traditional Programming
Program
Data
Computer Or
Output Model
Fig 2.2: Machine Learning
13
3. Types of Machine Learning
3.1. Supervised Learning
Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis of
a known training dataset, the learning algorithm produces an inferred function to make
predictions about the output values. The system is able to provide targets for any new input
after sufficient training. The learning algorithm can also compare its output with the
correct, intended output and find errors in order to modify the model accordingly.
Supervised learning problems can be further grouped into regression and classification
problems.
3.1.1.1. Classification
A classification problem is when the output variable is a category, such as “red” or
“blue” or “disease” and “no disease”.
3.1.1.2. Regression
A regression problem is when the output variable is a real value, such as “dollars”
or “weight”.
3.2. Unsupervised Learning

In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. Unsupervised learning studies
how systems can infer a function to describe a hidden structure from unlabeled data. The
system doesn’t figure out the right output, but it explores the data and can draw inferences
from datasets to describe hidden structures from unlabeled data. Unsupervised learning
problems can be further grouped into clustering and association problems.
3.2.1.1. Clustering
A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
3.2.1.2. Association
An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
3.3. Reinforcement Learning

Reinforcement machine learning algorithms is a learning method that interacts with
its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement learning.
This method allows machines and software agents to automatically determine the ideal
behavior within a specific context in order to maximize its performance. Simple reward
14
feedback is required for the agent to learn which action is best; this is known as the
reinforcement signal.
4. Basic Algorithms
4.1. Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between variables
and forecasting. Different regression models differ based on – the kind of relationship
between dependent and independent variables, they are considering and the number of
independent variables being used.
Fig 4.1.1: Linear Regression
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x). So, this regression technique finds out a linear relationship
between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
15
Fig 4.1.2: Hypothesis Function for LR
While training the model we are given:

x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to predict the value of y for a given value of
x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally
using our model for prediction, it will predict the value of y for the input value of x.
4.2. Logistic Regression

Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability.
We can call a Logistic Regression a Linear Regression model but the Logistic Regression
uses a more complex cost function, this cost function can be defined as the ‘Sigmoid
function’ or also known as the ‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1.
Therefore linear functions fail to represent it as it can have a value greater than 1 or less
than 0 which is not possible as per the hypothesis of logistic regression.
Fig 4.2.1: Logistic regression hypothesis expectation
In order to map predicted values to probabilities, we use the Sigmoid function. The
function maps any real value into another value between 0 and 1. In machine learning, we
use sigmoid to map predictions to probabilities.
16
Fig 4.2.2: Sigmoid Function Graph
4.3. K Nearest Neighbors
KNN (K — Nearest Neighbors) is one of many (supervised learning) algorithms used in
data mining and machine learning, it’s a classifier algorithm where the learning is based
“how similar” is a data (a vector) from other.
The KNN is pretty simple, imagine that you have a data about colored balls:
 Purple balls;
 Yellow balls;
 And a ball that you don’t know if it’s purple or yellow, but you have all the data
about this color (except the color label).
Each line refers to a ball and each column refers to a ball’s characteristic, in the last column
we have the class (color) of each of the balls:
R -> purple;
A -> yellow
We have 5 balls there ( 5 lines), each one with yours classification, you can try to discover
the new ball’s color (in the case the class) of N ways, one of these N ways is to comparing
this new ball’s characteristics with all the others, and see what it looks like most, if the
data(characteristics) of this new ball (that you doesn’t know the correct class) is similar to
the data of the yellow balls, then the color of the new ball is yellow, if the data in the new
ball is more similar to the data of the purple then yellow, then the color of the new ball is
purple, it looks so simple, and that is almost what the knn does, but in a most sophisticated
way.
In the case of KNN, indeed it doesn’t “compare” the new (unclassified) data with all other,
actually he performs a mathematical calculation to measure the distance between the data
to makes the classification, (that’s almost the same thing). This calculation that I am
talking about, can be anyone other calculation that measures the distance between two
points, for example: Euclidian, Manhattan, Minkowski, Weighted.
17
Fig.4.3.1: KNN
4.4. K Means
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying
subgroups in the data such that data points in the same subgroup (cluster) are very similar
while data points in different clusters are very different. In other words, we try to find
homogeneous subgroups within the data such that data points in each cluster are as similar
as possible according to a similarity measure such as Euclidean-based distance or
correlation-based distance. The decision of which similarity measure to use is application-
specific.
Unlike supervised learning, clustering is considered an unsupervised learning method

since we don’t have the ground truth to compare the output of the clustering algorithm to
the true labels to evaluate its performance. We only want to try to investigate the structure
of the data by grouping the data points into distinct subgroups.
In this post, we will cover only Kmeans which is considered as one of the most used
clustering algorithms due to its simplicity.
means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to
only one group. It tries to make the inter-cluster data points as similar as possible while
also keeping the clusters as different (far) as possible. It assigns data points to a cluster
such that the sum of the squared distance between the data points and the cluster’s centroid
(arithmetic mean of all the data points that belong to that cluster) is at the minimum. The
less variation we have within clusters; the more homogeneous (similar) the data points are
within the same cluster.
The way kmeans algorithm works is as follows:
 Specify number of clusters K.
18
 Initialize centroids by first shuffling the dataset and then randomly selecting K
data points for the centroids without replacement.
 Keep iterating until there is no change to the centroids. i.e. assignment of data
points to clusters isn’t changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster.
The approach kmeans follows to solve the problem is called Expectation-Maximization.

The E-step is assigning the data points to the closest cluster. The M-step is computing the
centroid of each cluster. Below is a breakdown of how we can solve it mathematically
(feel free to skip it).
19
4.5. Decision Trees
A decision tree is a decision support tool that uses a tree-like graph or model of decisions
and their possible consequences, including chance event outcomes, resource costs, and
utility. It is one way to display an algorithm that only contains conditional control
statements.
Fig 4.5.1: Decision Tree
4.6. Random Forests

Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude of
decision trees at training time and outputting the class that is the mode of the classes
(classification) or mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.
“Wisdom of Group is better than winsome of individual”
The first algorithm for random decision forests was created by Tin Kam Ho using the
random subspace method, which, in Ho's formulation, is a way to implement the
"stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who
registered "Random Forests" as a trademark (as of 2019, owned by Minitab, Inc. The
extension combines Breiman's "bagging" idea and random selection of features,
introduced first by Ho and later independently by Amit and Geman in order to construct a
collection of decision trees with controlled variance.
20
5. Advanced Techniques
5.1. Ensemble Modelling
Ensemble methods is a machine learning technique that combines several base models in order to
produce one optimal predictive model. To better understand this definition let’s take a step back
into ultimate goal of machine learning and model building. This is going to make more sense as I
dive into specific examples and why Ensemble methods are used.
Fig 5.1.1: Ensemble Modelling
5.2. Ada Boost

AdaBoost can be used to boost the performance of any machine learning algorithm.
It is best used with weak learners. These are models that achieve accuracy just above
random chance on a classification problem. The most suited and therefore most common
algorithm used with AdaBoost are decision trees with one level.
5.3. XG Boost
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a
gradient boosting framework. The algorithm differentiates itself in the following ways A
wide range of applications can be used to solve regression, classification, ranking, and
user-defined prediction problems.
21
CHAPTER 5
DEEP LEARNING
22
CHAPTER 5
DEEP LEARNING
1. Introduction
Deep learning is a machine learning technique that teaches computers to do what comes
naturally to humans: learn by example. Deep learning is a key technology behind driverless
cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It
is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free
speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving
results that were not possible before. When machine learning fails introduce Deep Learning.
In deep learning, a computer model learns to perform classification tasks directly from images,
text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes
exceeding human-level performance. Models are trained by using a large set of labeled data
and neural network architectures that contain many layers.
Deep learning maps inputs to outputs. It finds correlations. It is known as a “universal
approximator”, because it can learn to approximate an unknown function f(x) = y between any
input x and any output y, assuming they are related at all (by correlation or causation, for
example). In the process of learning, a neural network finds the right f, or the correct manner
of transforming x into y, whether that be f(x) = 3x + 12 or f(x) = 9x - 0.1. Here are a few
examples of what deep learning can do.
1.1. Classification
All classification tasks depend upon labeled datasets; that is, humans must transfer their
knowledge to the dataset in order for a neural network to learn the correlation between labels
and data. This is known as supervised learning.
Detect faces, identify people in images, recognize facial expressions (angry, joyful)
Identify objects in images (stop signs, pedestrians, lane markers…)
Recognize gestures in video
Detect voices, identify speakers, transcribe speech to text, recognize sentiment in voices
Classify text as spam (in emails), or fraudulent (in insurance claims); recognize sentiment in
text (customer feedback)
Any labels that humans can generate, any outcomes that you care about and which correlate to
data, can be used to train a neural network.
23
1.2. Clustering
Clustering or grouping is the detection of similarities. Deep learning does not require labels
to detect similarities. Learning without labels is called unsupervised learning. Unlabeled data
is the majority of data in the world. One law of machine learning is: the more data an algorithm
can train on, the more accurate it will be. Therefore, unsupervised learning has the potential to
produce highly accurate models.
Search: Comparing documents, images or sounds to surface similar items.

Anomaly detection: The flipside of detecting similarities is detecting anomalies, or unusual
behavior. In many cases, unusual behavior correlates highly with things you want to detect and
prevent, such as fraud.
1.3. Predictive Analytics: Regressions

With classification, deep learning is able to establish correlations between, say, pixels in
an image and the name of a person. You might call this a static prediction. By the same token,
exposed to enough of the right data, deep learning is able to establish correlations between
present events and future events. It can run regression between the past and the future. The
future event is like the label in a sense. Deep learning doesn’t necessarily care about time, or
the fact that something hasn’t happened yet. Given a time series, deep learning may read a
string of number and predict the number most likely to occur next.
Hardware breakdowns (data centers, manufacturing, transport)

Health breakdowns (strokes, heart attacks based on vital stats and data from wearables)
Customer churn (predicting the likelihood that a customer will leave, based on web activity
and metadata)
Employee turnover (ditto, but for employees)
The better we can predict, the better we can prevent and pre-empt. As you can see, with neural
networks, we’re moving towards a world of fewer surprises. Not zero surprises, just marginally
fewer. We’re also moving toward a world of smarter agents that combine neural networks with
other algorithms like reinforcement learning to attain goals.
2. Neural Networks
Deep Learning in implemented through Neural Networks. A neural network is a series of
algorithms that endeavors to recognize underlying relationships in a set of data through a
process that mimics the way the human brain operates. In this sense, neural networks refer to
systems of neurons, either organic or artificial in nature.
Neural networks help us cluster and classify. You can think of them as a clustering and
classification layer on top of the data you store and manage. They help to group unlabeled data
according to similarities among the example inputs, and they classify data when they have a
labeled dataset to train on. (Neural networks can also extract features that are fed to other
algorithms for clustering and classification; so you can think of deep neural networks as
24
components of larger machine-learning applications involving algorithms for reinforcement
learning, classification and regression.
Fig 2.1: Deep Learning
3. Network Architectures
There are following types of network architectures are used
3.1. Artificial Neural Networks
A lot of the advances in artificial intelligence are new statistical models, but the
overwhelming majority of the advances are in a technology called artificial neural
networks (ANN). If you’ve read anything about them before, you’ll have read that these
ANNs are a very rough model of how the human brain is structured. Take note that there
is a difference between artificial neural networks and neural networks. Though most
people drop the artificial for the sake of brevity, the word artificial was prepended to the
phrase so that people in computational neurobiology could still use the term neural
network to refer to their work. Below is a diagram of actual neurons and synapses in the
brain compared to artificial ones.
In following Figure -
ANNs we have these units of calculation called neurons. These artificial neurons are
connected by synapses which are really just weighted values. What this means is that given
a number, a neuron will perform some sort of calculation (for example the sigmoid
function), and then the result of this calculation will be multiplied by a weight as it
“travels.” The weighted result can sometimes be the output of your neural network, or as
25
I’ll talk about soon, you can have more neurons configured in layers, which is the basic
concept to an idea that we call deep learning.
Fig 3.1.1: Neurons
Artificial neural networks are not a new concept. In fact, we didn’t even always call
them neural networks and they certainly don’t look the same now as they did at their
inception. Back during the 1960s we had what was called a perceptron. Perceptron were
made of McCulloch-Pitts neurons. We even had biased perceptron, and ultimately
people started creating multilayer perceptron’s, which is synonymous with the general
artificial neural network we hear about now.
26
Fig: 3.1.2: A simple multilayer perceptron
But wait, if we’ve had neural networks since the 1960s, why are they just now getting
huge? It’s a long story, and I encourage you to listen to this podcast episode to listen to
the “fathers” of modern ANNs talk about their perspective of the topic. To quickly
summarize, there’s a hand full of factors that kept ANNs from becoming more popular.
We didn’t have the computer processing power and we didn’t have the data to train
them. Using them was frowned upon due to them having a seemingly arbitrary ability to
perform well. Each one of these factors is changing. Our computers are getting faster
and more powerful, and with the internet, we have all kinds of data being shared for use.
How do they work?

It is mentioned above that the neurons and synapses perform calculations. The question
on your mind should be: “How do they learn what calculations to perform?” Was I right?
The answer is that we need to essentially ask them a large amount of questions, and
provide them with answers. This is a field called supervised learning. With enough
examples of question-answer pairs, the calculations and values stored at each neuron and
synapse are slowly adjusted. Usually this is through a process called backpropagation.
27
Fig: 2.3.2: A simple backpropagation algorithm
3.2. Recurrent Neural Networks

Recurrent Neural Networks (RNN) were created to address the flaw in artificial neural
networks that didn’t make decisions based on previous knowledge. A typical ANN had
learned to make decisions based on context in training, but once it was making decisions
for use, the decisions were made independent of each other.
Fig: 3.2.1: A Recurrent Neural Network
28
3.3. Convolutional Neural Networks
Convolutional Neural Networks (CNN), sometimes called LeNets (named after Yann
LeCun), are artificial neural networks where the connections between layers appear to be
somewhat arbitrary. However, the reason for the synapses to be setup the way they are is
to help reduce the number of parameters that need to be optimized. This is done by
noting a certain symmetry in how the neurons are connected, and so you can essentially
“re-use” neurons to have identical copies without necessarily needing the same number
of synapses. CNNs are commonly used in working with images thanks to their ability to
recognize patterns in surrounding pixels. There’s redundant information contained when
you look at each individual pixel compared to its surrounding pixels, and you can
actually compress some of this information thanks to their symmetrical properties.
29
CHAPTER 6
BIG DATA ANALYTICS
30
CHAPTER 6
BIG DATA ANALYTICS
1. Introduction to Big Data
Big data is a term applied to data sets whose size is beyond the ability of commonly used
software tools to capture, manage, and process the data within a tolerable elapsed time.
Fig: 1.1 Big Data all V’s
2. Technologies in Big Data Analytics

2.1. Machine Learning Machine learning, a specific subset of AI that trains a machine how to
learn, makes it possible to quickly and automatically produce models that can analyze bigger,
more complex data and deliver faster, more accurate results – even on a very large scale.
And by building precise models, an organization has a better chance of identifying profitable
opportunities – or avoiding unknown risks.
2.2. Data management Data needs to be high quality and well-governed before it can be reliably
analyzed. With data constantly flowing in and out of an organization, it's important to
establish repeatable processes to build and maintain standards for data quality. Once data is
reliable, organizations should establish a master data management program that gets the
entire enterprise on the same page.
31
2.3. Data mining. Data mining technology helps you examine large amounts of data to discover
patterns in the data – and this information can be used for further analysis to help answer
complex business questions. With data mining software, you can sift through all the chaotic
and repetitive noise in data, pinpoint what's relevant, use that information to assess likely
outcomes, and then accelerate the pace of making informed decisions.
2.4. Hadoop. This open source software framework can store large amounts of data and run
applications on clusters of commodity hardware. It has become a key technology to doing
business due to the constant increase of data volumes and varieties, and its distributed
computing model processes big data fast. An additional benefit is that Hadoop's open source
framework is free and uses commodity hardware to store large quantities of data.
2.5. In-memory analytics. By analyzing data from system memory (instead of from your hard
disk drive), you can derive immediate insights from your data and act on them quickly. This
technology is able to remove data prep and analytical processing latencies to test new
scenarios and create models; it's not only an easy way for organizations to stay agile and
make better business decisions, it also enables them to run iterative and interactive analytics
scenarios.
2.6. Predictive analytics. Predictive analytics technology uses data, statistical algorithms and
machine-learning techniques to identify the likelihood of future outcomes based on historical
data. It's all about providing a best assessment on what will happen in the future, so
organizations can feel more confident that they're making the best possible business decision.
Some of the most common applications of predictive analytics include fraud detection, risk,
operations and marketing.
2.7. Text mining. With text mining technology, you can analyze text data from the web,
comment fields, books and other text-based sources to uncover insights you hadn't noticed
before. Text mining uses machine learning or natural language processing technology to
comb through documents – emails, blogs, Twitter feeds, surveys, competitive intelligence
and more – to help you analyze large amounts of information and discover new topics and
term relationships.
32
CHAPTER 7
PROJECTS
33
CHAPTER 7
PROJECTS
1. Phases of Model Building

1 2 3 4
DATA FEATURE DATA PERFORMANCE

PREPARATION ENGINEERING MODELLING MEASURE
PERFORMANCE
IMPROVEMENT
Fig. 2.1: Phases of model Building
2. Projects
2.1. Handwritten Letters Recognition
2.1.1. Object
To predict Letters from images
2.1.2. Results
34
Fig: 2.1.2.1 Handwritten letters prediction
2.1.3. Model Summary
Model: "sequential_1"
Layer (type) Output Shape PARAM #
conv2d_2 (Conv2D) multiple 640
max_pooling2d_2 (MaxPooling2 multiple 0
dropout_2 (Dropout) multiple 0
conv2d_3 (Conv2D) multiple 36928
max_pooling2d_3 (MaxPooling2 multiple 0
dropout_3 (Dropout) multiple 0
global_average_pooling2d_1 multiple 0
dense_1 (Dense) multiple 1690
Total params: 39,258

Trainable params: 39,258
Non-trainable params: 0
35
2.2. Stock Market Forecasting
2.2.1. Object
To predict Future performance of Stock market
2.2.2. Output
Fig:2.2.2.1 Performance of model
Fig:2.2.2.2 Prediction
36

Content

Uploaded by

Copyright:

Available Formats

Content

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Content

Uploaded by

Copyright:

Available Formats

CHAPTER 1

2.3. Introduction to Data Science

2.5. Data Analysis

2.6. Machine Learning

2.7. Deep Learning

2.8.1. Hadoop Hive

2.8.2. Spark with Python

The best feature Tableau are

Fig 3.1 R Studio IDE

4.2. Movie Review Analysis

Fig 4.2.3.3: Count of movies by Budget and Genre

2. Traditional Programming Vs Machine Learning

Fig 2.1: Traditional Programming

Fig 2.2: Machine Learning

3.2. Unsupervised Learning

3.3. Reinforcement Learning

Fig 4.1.1: Linear Regression

While training the model we are given:

4.2. Logistic Regression

Fig 4.2.1: Logistic regression hypothesis expectation

Unlike supervised learning, clustering is considered an unsupervised learning method

The approach kmeans follows to solve the problem is called Expectation-Maximization.

Fig 4.5.1: Decision Tree

4.6. Random Forests

“Wisdom of Group is better than winsome of individual”

Fig 5.1.1: Ensemble Modelling

5.2. Ada Boost

Search: Comparing documents, images or sounds to surface similar items.

1.3. Predictive Analytics: Regressions

Hardware breakdowns (data centers, manufacturing, transport)

Fig 2.1: Deep Learning

Fig 3.1.1: Neurons

How do they work?

3.2. Recurrent Neural Networks

Fig: 3.2.1: A Recurrent Neural Network

Fig: 1.1 Big Data all V’s

2. Technologies in Big Data Analytics

1. Phases of Model Building

DATA FEATURE DATA PERFORMANCE

Fig. 2.1: Phases of model Building

Total params: 39,258

Fig:2.2.2.1 Performance of model

You might also like