Content
Content
Content
INTRODUCTION
1
CHAPTER 1
INTRODUCTION
1. About Institute
1.1. Name
Naresh Information Technologies Hyderabad
1.2. Address
2nd Floor, Durga Bhavani Plaza, Satyam Theatre Road, Ameerpet
1.3. Certifications
Naresh IT is an ISO Certified Training Organization.
1.4. Courses
Naresh IT Provides many courses by experienced trainers. Few courses are, Java Core and
Advanced, Python Core and Advanced, Selenium, Dot Net, C, C++, C#, Data Structures,
ORACLE, Data Science, Big Data, Artificial Intelligence, Web services, DEV Ops, AWS,
Block Chain, etc.
2. About Trainer
Naveen Pathuru is leading Data Science trainer in Hyderabad. He has trained thousands of
students under his 22 years of tremendous work.
2
CHAPTER 2
COURSE
3
CHAPTER 2
COURSE
1. Introduction
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from structured and unstructured data. Nate
Silver referred to data science as a sexed up term for statistics.
Data science is a "concept to unify statistics, data analysis, machine learning and their
related methods" in order to "understand and analyze actual phenomena" with data. It employs
techniques and theories drawn from many fields within the context of mathematics, statistics,
computer science, and information science. Turing award winner Jim Gray imagined data
science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-
driven) and asserted that "everything about science is changing because of the impact of
information technology" and the data deluge. In 2015, the American Statistical Association
identified database management, statistics and machine learning, and distributed and parallel
systems as the three emerging foundational professional communities.
In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century”,
the term "data science" became a buzzword. It is now often used interchangeably with earlier
concepts like business analytics, business intelligence, predictive modeling, and statistics.
Even the suggestion that data science is sexy was paraphrasing Hans Rosling, featured in a
2011 BBC documentary with the quote, "Statistics is now the sexiest subject around." Nate
Silver referred to data science as a sexed up term for statistics. In many cases, earlier
approaches and solutions are now simply rebranded as "data science" to be more attractive,
which can cause the term to become "dilute[d] beyond usefulness." While many university
programs now offer a data science degree, there exists no consensus on a definition or suitable
curriculum contents. To the discredit of the discipline, however, many data-science and big-
data projects fail to deliver useful results, often as a result of poor management and utilization
of resources.
Data science is related to computer science, but is a separate field. Computer science
involves creating programs and algorithms to record and process data, while data science
covers any type of data analysis, which may or may not use computers. Data science is more
closely related to the mathematics field of Statistics, which includes the collection,
organization, analysis, and presentation of data.
Because of the large amounts of data modern companies and organizations maintain, data
science has become an integral part of IT. For example, a company that has petabytes of user
data may use data science to develop effective ways to store, manage, and analyze the data.
The company may use the scientific method to run tests and extract results that can provide
meaningful insights about their users.
4
2. Covered Tools and Technologies
2.1. Python
Python is an interpreted, high-level, general-purpose programming language.
Python is highly used language by Data Scientists due to its “Write Less Work More”
property.
2.2. R Programming
R is a programming language and free software environment for statistical
computing and graphics supported by the R Foundation for Statistical Computing. The R
language is widely used among statisticians and data miners for developing statistical
software and data analysis. R is best for descriptive data analysis.
2.4. Statistics
Statistics is a form of mathematical analysis that uses quantified models,
representations and synopses for a given set of experimental data or real-life studies.
Statistics studies methodologies to gather, review, analyze and draw conclusions from
data. Analysis of variance, etc.
5
2.8. Big Data Analytics
Big Data analytics is the process of collecting, organizing and analyzing large sets of
data (called Big Data) to discover patterns and other useful information. This big data is
gathered from a wide variety of sources, including social networks, videos, digital images,
sensors, and sales transaction records.
2.9. Tableau
Tableau is a powerful and fastest growing data visualization tool used in the
Business Intelligence Industry. It helps in simplifying raw data into the very easily
understandable format.
Data analysis is very fast with Tableau and the visualizations created are in the form of
dashboards and worksheets. The data that is created using Tableau can be understood by
professional at any level in an organization. It even allows a non-technical user to create
a customized dashboard.
Data Blending
Real time analysis
Collaboration of data
The great thing about Tableau software is that it doesn't require any technical or any kind
of programming skills to operate. The tool has garnered interest among the people from
all sectors such as business, researchers, different industries, etc.
6
CHAPTER 3
R PROGRAMMING
7
CHAPTER 3
R PROGRAMMING
1. Introduction
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
an effective data handling and storage facility
a suite of operators for calculations on arrays, in particular matrices
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy
a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent
system, rather than an incremental accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.
8
2. Installation
Step 1 – Download the R installer from https://cran.r-‐project.org/
Step 2 – Install R.
Step 3 – Download RStudio: https://www.rstudio.com/products/rstudio/download/
Step 4 – Install RStudio.
Step 5 – Check that R and RStudio are working. Open RStudio.
Step 6 – Install R packages required for the workshop.
3. IDE
RStudio is an integrated development environment for R, a programming language for
statistical computing and graphics. The RStudio IDE is developed by RStudio, Inc., a
commercial enterprise founded by JJ Allaire, creator of the programming language
ColdFusion.
9
4. Analysis Works
4.1. NBA Data Analysis
4.1.1. Data
Data is providing into form of R vectors and matrices.
4.1.2. Goal
To find important insights from data.
4.1.3. Results
Analysis ended up with many shocking results like wrong salary distribution,
continuous fall in top player’s accuracy, etc.
4.2.2. Goal
Analyze the data to get important insights of blockbuster or super hit movies.
4.2.3. Results
Fig 4.2.3.1: Audience Ratings Vs Critic Ratings with Genre and Budget
10
Fig 4.2.3.2: Audience Ratings Vs Critic Ratings with Genre
11
CHATPTER 4
MACHINE LEARNING
12
CHATPTER 4
MACHINE LEARNING
1. Introduction
Machine learning is concerned with system programs that automatically improves their
performance through experience.
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future based
on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.
Program
Computer Output
Data
Program
Data
Computer Or
Output Model
13
3. Types of Machine Learning
3.1. Supervised Learning
Supervised machine learning algorithms can apply what has been learned in the past
to new data using labeled examples to predict future events. Starting from the analysis of
a known training dataset, the learning algorithm produces an inferred function to make
predictions about the output values. The system is able to provide targets for any new input
after sufficient training. The learning algorithm can also compare its output with the
correct, intended output and find errors in order to modify the model accordingly.
Supervised learning problems can be further grouped into regression and classification
problems.
3.1.1.1. Classification
A classification problem is when the output variable is a category, such as “red” or
“blue” or “disease” and “no disease”.
3.1.1.2. Regression
A regression problem is when the output variable is a real value, such as “dollars”
or “weight”.
3.2.1.1. Clustering
A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
3.2.1.2. Association
An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
14
feedback is required for the agent to learn which action is best; this is known as the
reinforcement signal.
4. Basic Algorithms
4.1. Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between variables
and forecasting. Different regression models differ based on – the kind of relationship
between dependent and independent variables, they are considering and the number of
independent variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x). So, this regression technique finds out a linear relationship
between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
15
Fig 4.1.2: Hypothesis Function for LR
When training the model – it fits the best line to predict the value of y for a given value of
x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally
using our model for prediction, it will predict the value of y for the input value of x.
In order to map predicted values to probabilities, we use the Sigmoid function. The
function maps any real value into another value between 0 and 1. In machine learning, we
use sigmoid to map predictions to probabilities.
16
Fig 4.2.2: Sigmoid Function Graph
4.3. K Nearest Neighbors
KNN (K — Nearest Neighbors) is one of many (supervised learning) algorithms used in
data mining and machine learning, it’s a classifier algorithm where the learning is based
“how similar” is a data (a vector) from other.
The KNN is pretty simple, imagine that you have a data about colored balls:
Purple balls;
Yellow balls;
And a ball that you don’t know if it’s purple or yellow, but you have all the data
about this color (except the color label).
Each line refers to a ball and each column refers to a ball’s characteristic, in the last column
we have the class (color) of each of the balls:
R -> purple;
A -> yellow
We have 5 balls there ( 5 lines), each one with yours classification, you can try to discover
the new ball’s color (in the case the class) of N ways, one of these N ways is to comparing
this new ball’s characteristics with all the others, and see what it looks like most, if the
data(characteristics) of this new ball (that you doesn’t know the correct class) is similar to
the data of the yellow balls, then the color of the new ball is yellow, if the data in the new
ball is more similar to the data of the purple then yellow, then the color of the new ball is
purple, it looks so simple, and that is almost what the knn does, but in a most sophisticated
way.
In the case of KNN, indeed it doesn’t “compare” the new (unclassified) data with all other,
actually he performs a mathematical calculation to measure the distance between the data
to makes the classification, (that’s almost the same thing). This calculation that I am
talking about, can be anyone other calculation that measures the distance between two
points, for example: Euclidian, Manhattan, Minkowski, Weighted.
17
Fig.4.3.1: KNN
4.4. K Means
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying
subgroups in the data such that data points in the same subgroup (cluster) are very similar
while data points in different clusters are very different. In other words, we try to find
homogeneous subgroups within the data such that data points in each cluster are as similar
as possible according to a similarity measure such as Euclidean-based distance or
correlation-based distance. The decision of which similarity measure to use is application-
specific.
18
Initialize centroids by first shuffling the dataset and then randomly selecting K
data points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e. assignment of data
points to clusters isn’t changing.
Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster.
19
4.5. Decision Trees
A decision tree is a decision support tool that uses a tree-like graph or model of decisions
and their possible consequences, including chance event outcomes, resource costs, and
utility. It is one way to display an algorithm that only contains conditional control
statements.
The first algorithm for random decision forests was created by Tin Kam Ho using the
random subspace method, which, in Ho's formulation, is a way to implement the
"stochastic discrimination" approach to classification proposed by Eugene Kleinberg.
An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who
registered "Random Forests" as a trademark (as of 2019, owned by Minitab, Inc. The
extension combines Breiman's "bagging" idea and random selection of features,
introduced first by Ho and later independently by Amit and Geman in order to construct a
collection of decision trees with controlled variance.
20
5. Advanced Techniques
5.1. Ensemble Modelling
Ensemble methods is a machine learning technique that combines several base models in order to
produce one optimal predictive model. To better understand this definition let’s take a step back
into ultimate goal of machine learning and model building. This is going to make more sense as I
dive into specific examples and why Ensemble methods are used.
5.3. XG Boost
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a
gradient boosting framework. The algorithm differentiates itself in the following ways A
wide range of applications can be used to solve regression, classification, ranking, and
user-defined prediction problems.
21
CHAPTER 5
DEEP LEARNING
22
CHAPTER 5
DEEP LEARNING
1. Introduction
Deep learning is a machine learning technique that teaches computers to do what comes
naturally to humans: learn by example. Deep learning is a key technology behind driverless
cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It
is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free
speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving
results that were not possible before. When machine learning fails introduce Deep Learning.
In deep learning, a computer model learns to perform classification tasks directly from images,
text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes
exceeding human-level performance. Models are trained by using a large set of labeled data
and neural network architectures that contain many layers.
Deep learning maps inputs to outputs. It finds correlations. It is known as a “universal
approximator”, because it can learn to approximate an unknown function f(x) = y between any
input x and any output y, assuming they are related at all (by correlation or causation, for
example). In the process of learning, a neural network finds the right f, or the correct manner
of transforming x into y, whether that be f(x) = 3x + 12 or f(x) = 9x - 0.1. Here are a few
examples of what deep learning can do.
1.1. Classification
All classification tasks depend upon labeled datasets; that is, humans must transfer their
knowledge to the dataset in order for a neural network to learn the correlation between labels
and data. This is known as supervised learning.
Detect faces, identify people in images, recognize facial expressions (angry, joyful)
Identify objects in images (stop signs, pedestrians, lane markers…)
Recognize gestures in video
Detect voices, identify speakers, transcribe speech to text, recognize sentiment in voices
Classify text as spam (in emails), or fraudulent (in insurance claims); recognize sentiment in
text (customer feedback)
Any labels that humans can generate, any outcomes that you care about and which correlate to
data, can be used to train a neural network.
23
1.2. Clustering
Clustering or grouping is the detection of similarities. Deep learning does not require labels
to detect similarities. Learning without labels is called unsupervised learning. Unlabeled data
is the majority of data in the world. One law of machine learning is: the more data an algorithm
can train on, the more accurate it will be. Therefore, unsupervised learning has the potential to
produce highly accurate models.
2. Neural Networks
Deep Learning in implemented through Neural Networks. A neural network is a series of
algorithms that endeavors to recognize underlying relationships in a set of data through a
process that mimics the way the human brain operates. In this sense, neural networks refer to
systems of neurons, either organic or artificial in nature.
Neural networks help us cluster and classify. You can think of them as a clustering and
classification layer on top of the data you store and manage. They help to group unlabeled data
according to similarities among the example inputs, and they classify data when they have a
labeled dataset to train on. (Neural networks can also extract features that are fed to other
algorithms for clustering and classification; so you can think of deep neural networks as
24
components of larger machine-learning applications involving algorithms for reinforcement
learning, classification and regression.
3. Network Architectures
There are following types of network architectures are used
3.1. Artificial Neural Networks
A lot of the advances in artificial intelligence are new statistical models, but the
overwhelming majority of the advances are in a technology called artificial neural
networks (ANN). If you’ve read anything about them before, you’ll have read that these
ANNs are a very rough model of how the human brain is structured. Take note that there
is a difference between artificial neural networks and neural networks. Though most
people drop the artificial for the sake of brevity, the word artificial was prepended to the
phrase so that people in computational neurobiology could still use the term neural
network to refer to their work. Below is a diagram of actual neurons and synapses in the
brain compared to artificial ones.
In following Figure -
ANNs we have these units of calculation called neurons. These artificial neurons are
connected by synapses which are really just weighted values. What this means is that given
a number, a neuron will perform some sort of calculation (for example the sigmoid
function), and then the result of this calculation will be multiplied by a weight as it
“travels.” The weighted result can sometimes be the output of your neural network, or as
25
I’ll talk about soon, you can have more neurons configured in layers, which is the basic
concept to an idea that we call deep learning.
Artificial neural networks are not a new concept. In fact, we didn’t even always call
them neural networks and they certainly don’t look the same now as they did at their
inception. Back during the 1960s we had what was called a perceptron. Perceptron were
made of McCulloch-Pitts neurons. We even had biased perceptron, and ultimately
people started creating multilayer perceptron’s, which is synonymous with the general
artificial neural network we hear about now.
26
Fig: 3.1.2: A simple multilayer perceptron
But wait, if we’ve had neural networks since the 1960s, why are they just now getting
huge? It’s a long story, and I encourage you to listen to this podcast episode to listen to
the “fathers” of modern ANNs talk about their perspective of the topic. To quickly
summarize, there’s a hand full of factors that kept ANNs from becoming more popular.
We didn’t have the computer processing power and we didn’t have the data to train
them. Using them was frowned upon due to them having a seemingly arbitrary ability to
perform well. Each one of these factors is changing. Our computers are getting faster
and more powerful, and with the internet, we have all kinds of data being shared for use.
27
Fig: 2.3.2: A simple backpropagation algorithm
28
3.3. Convolutional Neural Networks
Convolutional Neural Networks (CNN), sometimes called LeNets (named after Yann
LeCun), are artificial neural networks where the connections between layers appear to be
somewhat arbitrary. However, the reason for the synapses to be setup the way they are is
to help reduce the number of parameters that need to be optimized. This is done by
noting a certain symmetry in how the neurons are connected, and so you can essentially
“re-use” neurons to have identical copies without necessarily needing the same number
of synapses. CNNs are commonly used in working with images thanks to their ability to
recognize patterns in surrounding pixels. There’s redundant information contained when
you look at each individual pixel compared to its surrounding pixels, and you can
actually compress some of this information thanks to their symmetrical properties.
29
CHAPTER 6
BIG DATA ANALYTICS
30
CHAPTER 6
BIG DATA ANALYTICS
1. Introduction to Big Data
Big data is a term applied to data sets whose size is beyond the ability of commonly used
software tools to capture, manage, and process the data within a tolerable elapsed time.
2.2. Data management Data needs to be high quality and well-governed before it can be reliably
analyzed. With data constantly flowing in and out of an organization, it's important to
establish repeatable processes to build and maintain standards for data quality. Once data is
reliable, organizations should establish a master data management program that gets the
entire enterprise on the same page.
31
2.3. Data mining. Data mining technology helps you examine large amounts of data to discover
patterns in the data – and this information can be used for further analysis to help answer
complex business questions. With data mining software, you can sift through all the chaotic
and repetitive noise in data, pinpoint what's relevant, use that information to assess likely
outcomes, and then accelerate the pace of making informed decisions.
2.4. Hadoop. This open source software framework can store large amounts of data and run
applications on clusters of commodity hardware. It has become a key technology to doing
business due to the constant increase of data volumes and varieties, and its distributed
computing model processes big data fast. An additional benefit is that Hadoop's open source
framework is free and uses commodity hardware to store large quantities of data.
2.5. In-memory analytics. By analyzing data from system memory (instead of from your hard
disk drive), you can derive immediate insights from your data and act on them quickly. This
technology is able to remove data prep and analytical processing latencies to test new
scenarios and create models; it's not only an easy way for organizations to stay agile and
make better business decisions, it also enables them to run iterative and interactive analytics
scenarios.
2.6. Predictive analytics. Predictive analytics technology uses data, statistical algorithms and
machine-learning techniques to identify the likelihood of future outcomes based on historical
data. It's all about providing a best assessment on what will happen in the future, so
organizations can feel more confident that they're making the best possible business decision.
Some of the most common applications of predictive analytics include fraud detection, risk,
operations and marketing.
2.7. Text mining. With text mining technology, you can analyze text data from the web,
comment fields, books and other text-based sources to uncover insights you hadn't noticed
before. Text mining uses machine learning or natural language processing technology to
comb through documents – emails, blogs, Twitter feeds, surveys, competitive intelligence
and more – to help you analyze large amounts of information and discover new topics and
term relationships.
32
CHAPTER 7
PROJECTS
33
CHAPTER 7
PROJECTS
PERFORMANCE
IMPROVEMENT
2. Projects
2.1. Handwritten Letters Recognition
2.1.1. Object
To predict Letters from images
2.1.2. Results
34
Fig: 2.1.2.1 Handwritten letters prediction
2.1.3. Model Summary
Model: "sequential_1"
Layer (type) Output Shape PARAM #
conv2d_2 (Conv2D) multiple 640
max_pooling2d_2 (MaxPooling2 multiple 0
dropout_2 (Dropout) multiple 0
conv2d_3 (Conv2D) multiple 36928
max_pooling2d_3 (MaxPooling2 multiple 0
dropout_3 (Dropout) multiple 0
global_average_pooling2d_1 multiple 0
dense_1 (Dense) multiple 1690
35
2.2. Stock Market Forecasting
2.2.1. Object
To predict Future performance of Stock market
2.2.2. Output
Fig:2.2.2.2 Prediction
36