Data Science Notes
Data Science Notes
TCS 733
Compiled by Dr. Vijay Singh
Associate Professor,
Department of Computer Science and Engineering
Graphic Era Deemed to be University, Dehradun
+91-9760322316
Vijaysingh.cse@geu.ac.in
Syllabus
Data science is an interdisciplinary field that uses
scientific methods, processes, algorithms and systems
to extract or extrapolate knowledge and insights from
noisy, structured and unstructured data, and apply
knowledge from data across a broad range of
application domains. Data science is related to data
mining, machine learning and big data.
Data science
A data scientist is a person who knows how to extract insights from the
data by using various processes, methods, systems, and algorithms.
Data scientist requires a range of skills to analyze, interpret, and
visualize data to make informed decisions.
• R-intro.pdf (r-project.org)
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Miscellaneous
operators
Conditional expression in R
A conditional expression or a conditional statement is a programming
construct where a decision is made to execute some code based on a
Boolean (true or false) condition. A more commonly used term for
conditional expression in programming is an 'if-else' condition. In plain
English, this is stated as 'if this test is true then do this operation;
otherwise do this different operation’.
Conditional expression in R
Loops in R
R Script file is a file with extension “.R” that
contains a program (a set of commands). Rscript
R script is an R Interpreter which helps in the execution of
R commands present in the script file.
Functions in R
A random variable, usually written X, is a variable
whose possible values are numerical outcomes of a
random phenomenon. There are two types of random
variables, discrete and continuous.
Discrete Random Variables
A discrete random variable is one which may take on only a countable number of
distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not
necessarily) counts. If a random variable can take only a finite number of distinct
values, then it must be discrete. Examples of discrete random variables include the
number of children in a family, the Friday night attendance at a cinema, the number
of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X
= 2) + P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X
= 1) = 1 - 0.1 = 0.9, by the complement rule. This distribution may also be described by
the probability histogram shown in the figure:
Continuous Random Variables
A continuous random variable is one which takes an infinite number of
possible values. Continuous random variables are usually
measurements. Examples include height, weight, the amount of sugar
in an orange, the time required to run a mile.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics
Glossary v1.1)
A continuous random variable is not defined at specific values. Instead,
it is defined over an interval of values, and is represented by the area
under a curve (in advanced mathematics, this is known as an integral).
The probability of observing any single value is equal to 0, since the
number of values which may be assumed by the random variable is
infinite.
Continuous Random Variables
Suppose a random variable X may take all values over an interval of real
numbers. Then the probability that X is in the set of outcomes A, P(A),
is defined to be the area above A and under a curve. The curve, which
represents a function p(x), must satisfy the following:
Missing Values
In statistics, missing data, or missing values, occur when no data value
is stored for the variable in an observation. Missing data are a common
occurrence and can have a significant effect on the conclusions that can
be drawn from the data.
• The problem of missing value is quite common in many real-life
datasets. Missing value can bias the results of the machine learning
models and/or reduce the accuracy of the model.
Why Is Data Missing From The Dataset
• There can be multiple reasons why certain values are missing from the data.
• Reasons for the missing data from the dataset affect the approach of handling
missing data.
• So it’s necessary to understand why the data could be missing.
In the case of MCAR, the data could be missing due to human error,
some system/equipment failure, loss of sample, or some unsatisfactory
technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some
values of overdue books in the computer system are missing. The
reason might be a human error like the librarian forgot to type in the
values. So, the missing values of overdue books are not related to any
other variable/data in the system.
It should not be assumed as it’s a rare case. The advantage of such data
is that the statistical analysis remains unbiased.
Missing At Random (MAR)
• Missing at random (MAR) means that the reason for missing values can be
explained by variables on which you have complete information as there is
some relationship between the missing data and other values/data.
• In this case, the data is not missing for all the observations. It is missing
only within sub-samples of the data and there is some pattern in the
missing values.
• For example, if you check the survey data, you may find that all the people
have answered their ‘Gender’ but ‘Age’ values are mostly missing for
people who have answered their ‘Gender’ as ‘female’. (The reason being
most of the females don’t want to reveal their age.)
• So, the probability of data being missing depends only on the observed
data.
Missing At Random (MAR)
• In this case, the variables ‘Gender’ and ‘Age’ are related and the
reason for missing values of the ‘Age’ variable can be explained by the
‘Gender’ variable but you can not predict the missing value itself.
• Suppose a poll is taken for overdue books of a library. Gender and the
number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why
the data is missing can be explained by another factor that is gender.
• In this case, the statistical analysis might result in bias.
Missing Not At Random (MNAR)
• Missing values depend on the unobserved data.
• If there is some structure/pattern in missing data and other observed
data can not explain it, then it is Missing Not At Random (MNAR).
• If the missing data does not fall under the MCAR or MAR then it can
be categorized as MNAR.
• It can happen due to the reluctance of people in providing the
required information. A specific group of people may not answer
some questions in a survey.
Missing Not At Random (MNAR)
• For example, suppose the name and the number of overdue books
are asked in the poll for a library. So most of the people having no
overdue books are likely to answer the poll. People having more
overdue books are less likely to answer the poll.
• So in this case, the missing value of the number of overdue books
depends on the people who have more books overdue.
How to deal with missing values using R
R has various packages to deal with the missing data.
List of R Packages
• MICE
• Amelia
• missForest
• Hmisc
• mi
Decoding the job description
The data analyst role is one of many job titles that contain the word “analyst.” To name a few
others that sound similar but may not be the same role:
• Business analyst — analyzes data to help businesses improve processes, products, or services
• Data analytics consultant — analyzes the systems and models for using data
• Data engineer — prepares and integrates data from different sources for analytical use
• Data scientist — uses expert skills in technology and social science to find trends through data
analysis
• Data specialist — organizes or converts data for use in databases or software systems
• Operations analyst — analyzes data to assess the performance of business operations and
workflows
The six data analysis phases
There are six data analysis phases that will help you make seamless decisions: ask,
prepare, process, analyze, share, and act. Keep in mind, these are different from
the data life cycle, which describes the changes data goes through over its lifetime.
Let’s walk through the steps to see how they can help you solve problems you
might face on the job.
Step 1: Ask
It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:
• Define the problem you’re trying to solve
• Make sure you fully understand the stakeholder’s expectations
• Focus on the actual problem and avoid any distractions
• Collaborate with stakeholders and keep an open line of communication
• Take a step back and see the whole situation in context
Clean data is the best data and you will need to clean up your data to get rid of any possible errors,
inaccuracies, or inconsistencies. This might mean:
• Using spreadsheet functions to find incorrectly entered data
• Using SQL functions to check for extra spaces
• Removing repeated entries
• Checking as much as possible for bias in the data
You will want to think analytically about your data. At this stage, you might sort and format your
data to make it easier to:
• Perform calculations
• Combine data from multiple sources
• Create tables with your results
• Observer bias
• Interpretation bias
• Confirmation bias
Data Visualization
In ggplot2, a graph is composed of the following arguments:
• data
• aesthetic mapping
• geometric object
• statistical transformations
• scales
• coordinate system
• position adjustments
• faceting
•You pass the dataset mtcars to ggplot.
•Inside the aes() argument, you add the x-axis as a factor variable(cyl)
•The + sign means you want R to keep reading the code. It makes the code more readable by
breaking it.
•Use geom_bar() for the geometric object.
The four “V” of Big Data
Big data is the collective name for the large amount of registered digital
data and the equal growth thereof. The aim is to convert this stream of
information into valuable information for the company.
o gain more insight into Big Data, IBM devised the system of the four
Vs. These Vs stand for the four dimensions of Big Data: Volume,
Velocity, Variety and Veracity.
Volume
The high speed and considerable volume are related to the variety of
forms of data. After all, smart IT solutions are available today for all
sectors, from the medical world to construction and business. Consider,
for example, the electronic patient records in healthcare, which
contribute to many trillions of gigabytes of data. And that’s not even
talking about the videos we watch on Youtube, the posts we share on
Facebook and the blog articles we write. When all parts of the world
have the internet in the future, the volume and variety will only
increase.
Veracity
How truthful Big Data is remains a difficult point. Data quickly becomes
outdated and the information shared via the internet and social media
does not necessarily have to be correct. Many managers and directors
in the business community do not dare to make decisions based on Big
Data. Data scientists and IT professionals have their hands full
organizing and accessing the right data. It is very important that they
find a good way to do this. Because if Big Data is organized and used in
the right way, it can be of great value in our lives. From predicting
business trends to preventing disease and crime.
Classification vs Regression
Classification predictive modeling problems are different from
regression predictive modeling problems.
• Classification is the task of predicting a discrete class label.
• Regression is the task of predicting a continuous quantity.
• R Operators (w3schools.com)