B Ei

The statistician William S.
Cleveland defined data science as an interdisciplinary field

larger than statistics itself.
Father of Data Science:
The modern conception of data science as an independent discipline is
sometimes attributed to William S. Cleveland.
Difference b/n Deep and Machine learning
Machine Learning means computers learning from data using algorithms to perform a
task without being explicitly programmed.
Deep learning uses a complex structure of algorithms modeled on the human brain.
This enables the processing of unstructured data such as documents, images, and text.
What is DataScience
• It is about data gathering, analysis and decision making
• It helps to find patterns in data though analysis and predictions
• Helps to discover patterns and make better decisions
• Statistics is the science of analyzing data
• Data science is an interdisciplinary field that uses scientific
methods, processes, algorithms and systems to extract or
extrapolate knowledge and insights from noisy, structured and
unstructured data, and apply knowledge from data across a
broad range of application domains.
• The intellectual and practical activity dealing with the
systematic study of the structure and behaviour of the physical
and natural world through observation and experiment of data
Data
• It is collection of information
• Data can be categorized as structured or un
structured
• Variable can be measured or counted
• Data is categorized as numerical(Discrete or
continuous), categorical(color or type) or
ordinal (grading system)
By understanding the various techniques, methods, tools and analytical
approaches, data scientists can help the organizations that employ them achieve
the strategic and competitive benefits
The three types of machine learning methods
• Supervised Learning: It is based on the outcomes of a similar process

in the past. Supervised learning helps in predicting an outcome
based on historical patterns. Some of the algorithms for supervised
learning include SVMs, Random Forest and Linear Regression etc.
• Unsupervised Learning: This learning method remains devoid of an
existing outcome or pattern. Instead, it focuses on analyzing the
connections and the relationships between data elements. An
example for an unsupervised learning algorithm is K-means
clustering.
• Reinforcement Learning: Reinforcement Learning(RL) is a type of
machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from its
own actions and experiences. Some of the algorithms for RL include
Q-Learning and Deep Q Network etc.
What is a data science model?
A data science model organizes data elements and
standardizes how the data elements relate to one
another and to the properties of real-world entities.
What are the different models in data analytics?
Different models in data analytics include linear
regression, logistic regression, SVMs (Support Vector
Machines), Random Forest, Naïve Bayes Classifiers
and Decision Trees etc.
What are the different ML models?
Each machine learning algorithm can be categorized
into one of the three models: Supervised Learning.
Unsupervised Learning. Reinforcement Learning.
What are the five main types of data science models
Different models in data analytics include linear regression,
logistic regression, SVMs (Support Vector Machines), Random
Forest, Naïve Bayes Classifiers and Decision Trees
What are the 3 main uses of data science?
Healthcare: Data science can identify and predict disease, and
personalize healthcare recommendations.
Which tool is best for data science?
Data Science Tools For Data Storage
• Apache Hadoop is a free, open-source framework that can
manage and store tons and tons of data. ...
• Tableau is the most popular data visualization tool used in
the market. ...
• QlikView is another data visualization tool
What are the main components of data science?
Statistics: Statistics is one of the most important components of data science. ...
Domain Expertise: In data science, domain expertise binds data science together. ...
Data engineering: Data engineering is a part of data science, which involves
acquiring, storing, retrieving, and transforming the data.
6 Types of Data in Statistics & Research: Key in Data Science
• Quantitative data. Quantitative data seems to be

the easiest to explain. ...
• Qualitative data. Qualitative data can't be
expressed as a number and can't be measured. ...
• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Disciplines of Data Science
• data engineering,
• data preparation,
• data mining
• predictive analytics
• machine learning and data visualization,
• as well as statistics, mathematics and
• software programming
What are data science tools?
• MATLAB, as a popular data science tool, finds

numerous applications in Data Science. For
instance
• R programming
• Python
Requirement for Data Science
• Which programming language is concerned for learning data science.
• In terms of the mathematics behind all we will describe important concepts
in linear algebra
• good understanding of machine learning and data science algorithms
• statistics that are relevant for data science.
• describe data analysis problems in a structured framework
• identify some comprehensive solution strategies for the data analysis
Like problems, classify and recognize different types of data analysis
problems
• provide conceptual and descriptions that are easy to understand for
selected machine learning algorithms
• Validate: What we should use and then once you run the algorithm you get
the results and see whether your assumptions are validated by correlating
the results
• judging the appropriateness of the proposed solution based on the
observed results
1. Data Quality and monitoring
2.Data Modelling
1.Build a robust data quality subsystem
Each error instance of data quality is captured
Based on data warehouse
2. Evaluate various models/algorithms(clustering,
classification, Regression,..)
Tune paramters
Iterative experiments
Processing
Applications
R programming
 R is an expression language with a very simple syntax.
 R is an open-source programming language that is widely used as a statistical
software and data analysis tool.
 R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand
 It is case sensitive as are most UNIX based packages
 Commands are separated either by a semi-colon (‘;’), or by a newline.
 Elementary commands can be grouped together into one compound expression by
braces (‘{’ and ‘}’).
 Comments can be put almost anywhere, starting with a hashmark (‘#’)
 If a command is not complete at the end of a line, R will give a different prompt, by
default +
 R Studio is an integrated development environment for R.
Data structures are very important to understand because these are the objects you will
manipulate on a day-to-day basis in R
Everything in R is an object.
R has 6 basic data types
• character
• numeric (real or decimal)
• integer
• logical
• complex
Numeric Constants:
• All numbers fall under this category. They can be of type integer, double or
complex.
• It can be checked with the typeof() function.
• Numeric constants followed by L are regarded as integer and those followed
by i are regarded as complex.
> typeof(5)
[1] "double"
> typeof(5.9)
[1] "double"
> typeof(5.9i)
[1] "complex"
> typeof(5L)
[1] "integer“
> typeof("5")
[1] "character"
> month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
> 'charater'
[1] "charater"
R provides many functions to examine features of vectors and other objects, for
example
• class() - what kind of object is it (high-level)?
• typeof() - what is the object’s data type (low-level)?
• length() - how long is it? What about two dimensional objects?
• attributes() - does it have any metadata?
Objects Attributes
Objects can have attributes. Attributes are part of the object. These include:
• names
• dimnames
• dim
• class: used to check the class
• attributes (contain metadata)
R Data Structures
• R Vectors
• R Matrix
• R array
• List in R
• R Data Frame
• R Factor
Matrix is a two dimensional data structure in R programming.
Matrix is similar to vector but additionally contains the dimension attribute.
All attributes of an object can be checked with the attributes() function
(dimension can be checked directly with the dim() function).
We can check if a variable is a matrix or not with the class() function.
Matrics
> matrix(1:9, nrow = 3, ncol = 3) > # It is also possible to change names
[,1] [,2] [,3]
[1,] 1 4 7
> colnames(x) <- c("C1","C2","C3")
[2,] 2 5 8 > rownames(x) <- c("R1","R2","R3")
[3,] 3 6 9
> matrix(1:9, nrow=3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> dim(x) <- c(2,3)
> print(x)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> class(x)
[1] "matrix" "array"
> x <- c(1,2,3,4,5,6)

> class(x)
[1] "numeric"
• Regression is a statistical method for estimating
a relationship between one dependent variable
(usually denoted by Y) and a series of other
variables (known as independent variables).
Requirement for Data Science
• Which programming language is concerned for learning data science.
• In terms of the mathematics behind all we will describe important concepts
in linear algebra
• good understanding of machine learning and data science algorithms
• statistics that are relevant for data science.
• describe data analysis problems in a structured framework
• identify some comprehensive solution strategies for the data analysis
Like problems, classify and recognize different types of data analysis
problems
• provide conceptual and descriptions that are easy to understand for
selected machine learning algorithms
• Validate: What we should use and then once you run the algorithm you get
the results and see whether your assumptions are validated by correlating
the results
• judging the appropriateness of the proposed solution based on the
observed results
Dimensionality reduction
• Dimension reduction, is the transformation of data from a high-dimensional space
into a low-dimensional space Dimensionality reduction is the process of reducing
the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction
• In machine learning classification problems, there are often too many factors on the
basis of which the final classification is done. These factors are basically variables
called features. The higher the number of features, the harder it gets to visualize the
training set and then work on it. Sometimes, most of these features are correlated,
and hence redundant. This is where dimensionality reduction algorithms come into
play..
• Dimensionality reduction is a machine learning (ML) or statistical technique of
reducing the amount of random variables in a problem by obtaining a set of
principal variables
• Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant
Analysis (LDA) and Truncated Singular Value Decomposition (SVD) are examples of
linear dimensionality reduction methods
• Dimensionality reduction is common in fields that deal with large numbers of
observations and/or large numbers of variables, such as signal processing,
speech recognition, neuroinformatics, and bioinformatics.
• Dimensionality reduction can be used for noise reduction, data visualization,
cluster analysis, or as an intermediate step to facilitate other analyses
Statistical technique
• Principal component analysis, performs a linear mapping of the data to a lower-dimensional
space in such a way that the variance of the data in the low-dimensional representation is
maximized.
• Principal component analysis can be employed in a nonlinear way by means of the kernel
trick.
• Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method
used in statistics, pattern recognition and machine learning to find a linear combination of
features that characterizes or separates two or more classes of objects or events
• T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction
technique useful for visualization of high-dimensional datasets. It is not recommended for
use in analysis such as clustering or outlier detection since it does not necessarily preserve
densities or distances well
• Multidimensional scaling(MDS) is a visual representation of distances or dissimilarities
between sets of objects(Euclidean Distance
Manhattan,Chebyshev, and Minkowski Distance)
• actor Analysis (FA) is an exploratory data analysis method used to search influential
underlying factors or latent variables from a set of observed variables
Distances
• The Manhattan distance between two points x = (x 1, x 2, …, x n ) and y = (y 1, y 2, …,
y n ) in n-dimensional space is the sum of the distances in each dimension
The Manhattan distance is defined by Dm(x,y)=∑i=1D|xi−yi|
• Minkowski distance is a distance measured between two points in N-dimensional

space. It is basically a generalization of the Euclidean distance and the Manhattan
distance. Minkowski distance is a distance measured between two points in N-
dimensional space. It is basically a generalization of the Euclidean distance and the
Manhattan distance
Let us consider a 2-dimensional space having three points P1 (X1, Y1), P2 (X2, Y2), and
P3 (X3, Y3), the Minkowski distance is given by ( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p
What is Euclidean Distance? In Mathematics, the Euclidean distance is defined as the
distance between two points
Thus, the Euclidean distance formula is given by:
d =√[(x2 – x1)2 + (y2 – y1)2]
Where,
“d” is the Euclidean distance
(x1, y1) is the coordinate of the first point
(x2, y2) is the coordinate of the second point.
The Chebyshev distance calculation, commonly known as the

"maximum metric" in mathematics, measures distance between two
points as the maximum difference over any of their axis values. In a 2D
grid, for instance, if we have two points (x1, y1), and (x2, y2), the
Chebyshev distance between is max(y2 - y1, x2 - x1).

B Ei

Uploaded by

Copyright:

Available Formats

B Ei

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B Ei

Uploaded by

Copyright:

Available Formats

The statistician William S.

Cleveland defined data science as an interdisciplinary field

• Supervised Learning: It is based on the outcomes of a similar process

• Quantitative data. Quantitative data seems to be

• MATLAB, as a popular data science tool, finds

> x <- c(1,2,3,4,5,6)

• Minkowski distance is a distance measured between two points in N-dimensional

The Chebyshev distance calculation, commonly known as the

You might also like