Data Science Introduction_lecture Class.ppt
Data Science Introduction_lecture Class.ppt
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
Data Science is about finding patterns in data, through analysis, and make future predictions.
Data Science can be applied in nearly every part of a business where data is available. Examples
are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
Machine Learning
Statistics
Programming (Python or R)
Mathematics
Databases
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she
must organize the data in a standard format.
Where to Start?
In this tutorial, we will start by presenting what data is and how data can be analyzed.
You will learn how to use statistics and mathematical functions to make predictions.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Structured data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Structured data is organized and easier to work with.
How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
In this tutorial we will try to make it as easy as possible to understand the concepts of Data
Science. We will therefore work with a small data set that is easy to interpret.
Database Table
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
This dataset contains information of a typical training session such as duration, average pulse,
calorie burnage etc.
Variables
A variable is defined as something that can be measured or counted.
In the example under, we can observe that each column represents a variable.
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse,
Calorie_Burnage, Hours_Work, Hours_Sleep).
There are 11 rows, meaning that each variable has 10 observations.
But if there are 11 rows, how come there are only 10 observations?
It is because the first row is the label, meaning that it is the name of the variable.
Python
Python is a programming language widely used by Data Scientists.
Python has in-built mathematical libraries and functions, making it easier to calculate
mathematical problems and to perform data analysis.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
Numpy - This is a mathematical library. Has a powerful N-dimensional array object,
linear algebra, Fourier transform, etc.
Matplotlib - This library is used for visualization of data.
SciPy - This library has linear algebra modules
Let's define a data frame with 3 columns and 5 rows with fictional numbers:
Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
Example Explained
We write pd. in front of DataFrame() to let Python know that we want to activate the
DataFrame() function from the Pandas library.
Do not be confused about the vertical numbers ranging from 0-4. They tell us the information
about the position of the rows.
Example
count_column = df.shape[1]
print(count_column)
Example
count_row = df.shape[0]
print(count_row)
Why Can We Not Just Count the Rows and Columns Ourselves?
If we work with larger data sets with many columns and rows, it will be confusing to count it by
yourself. You risk to count it wrongly. If we use the built-in functions in Python correctly, we
assure that the count is correct.
Data Science Functions
This chapter shows three commonly used functions when working with Data Science: max(),
min(), and mean().
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
We use underscore (_) to separate strings because Python cannot read space as separator.
The max() function
The Python max() function is used to find the highest value in an array.
Example
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
Example
Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
Example
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
We write np. in front of mean to let Python know that we want to activate the mean function
from the Numpy library.
Data Science - Data Preparation
Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file with the health data:
Example
import pandas as pd
print(health_data)
Example Explained
Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows:
Example
import pandas as pd
print(health_data.head())
Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered
values:
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into "NaN"
values.
NaN stands for Not A Number and is one of the common ways to represent the missing value in
the data. It is a special floating-point value and cannot be converted to any other type than float.
NaN value is one of the major problems in Data Analysis.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove
all rows that have a NaN value:
When inplace = True , the data is modified in place, which means it will return nothing and the
dataframe is now updated. When inplace = False , which is the default, then the operation is
performed and it returns a copy of the object.
Example
health_data.dropna(axis=0,inplace=True)
print(health_data)
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
By knowing the type of your data, you will be able to know what technique to use when
analyzing them.
Data Types
We can use the info() function to list the data types within our data set:
Example
print(health_data.info())
Result:
We see that this data set has two different types of data:
Float64
Object
We cannot use objects to calculate and perform analysis here. We must convert the type object to
float64 (float64 is a number with a decimal in Python).
We can use the astype() function to convert the data into float64.
The following example converts "Average_Pulse" and "Max_Pulse" into data type float64 (the
other variables are already of data type float64):
Example
health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float)
print (health_data.info())
Result:
Example
print(health_data.describe())
Result:
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
Linear Functions
In mathematics a function is used to relate one variable to another variable.
Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable
to assume that, in general, the calorie burnage will change as the average pulse changes - we say
that the calorie burnage depends upon the average pulse.
Furthermore, it may be reasonable to assume that as the average pulse increases, so will the
calorie burnage. Calorie burnage and average pulse are the two variables being considered.
Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the
dependent variable and the average pulse is the independent variable.
The relationship between a dependent and an independent variable can often be expressed
mathematically using a formula (function).
A linear function has one independent variable (x) and one dependent variable (y), and has
the following form:
y = f(x) = ax + b
This function is used to calculate a value for the dependent variable when we choose a value for
the independent variable.
Explanation:
Let us say we want to predict calorie burnage using average pulse. We have the following
formula:
f(x) = 2x + 80
f(x) = The output. This number is where we get the predicted value of Calorie_Burnage
x = The input, which is Average_Pulse
2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It
tells us how "steep" the diagonal line is
80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0
The horizontal axis is generally called the x-axis. Here, it represents Average_Pulse.
The vertical axis is generally called the y-axis. Here, it represents Calorie_Burnage.
Calorie_Burnage is a function of Average_Pulse, because Calorie_Burnage is assumed to be
dependent on Average_Pulse.
In other words, we use Average_Pulse to predict Calorie_Burnage.
The blue (diagonal) line represents the structure of the mathematical function that predicts
calorie burnage.
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
The plot() function is used to make a 2D hexagonal binning plot of points x,y:
Example
import matplotlib.pyplot as plt
plt.show()
Example Explained
As it turns out:
There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.
f(x) = 2x + 80
The image below points to the Slope - which indicates how steep the line is, and the Intercept -
which is the value of y, when x = 0 (the point where the diagonal line crosses the vertical axis).
The red line is the continuation of the blue line from previous page.
Find The Slope
The slope is defined as how much calorie burnage increases, if average pulse increases by one. It
tells us how "steep" the diagonal line is.
We can find the slope by using the proportional difference of two points from the graph.
We see that if average pulse increases with 10, the calorie burnage increases by 20.
Slope = 20/10 = 2
The slope is 2.
Be consistent to define the observations in the correct order! If not, the prediction will not be
correct!
Example
def slope(x1, y1, x2, y2):
s = (y2-y1)/(x2-x1)
return s
print (slope(80,240,90,260))
The intercept is where the diagonal line crosses the y-axis, if it were fully drawn.
Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80.
No, you would be dead and you certainly would not burn any calories.
However, we need to include the intercept in order to complete the mathematical function's
ability to predict Calorie_Burnage correctly.
Other examples where the intercept of a mathematical function can have a practical meaning:
Predicting next years revenue by using marketing expenditure (How much revenue will we have
next year, if marketing expenditure is zero?). It is likely to assume that a company will still have
some revenue even though if it does not spend money on marketing.
Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses
gasoline will still use fuel when it is idle.
Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
Example
import numpy as np
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
Example Explained:
Isolate the variables Average_Pulse (x) and Calorie_Burnage (y) from health_data.
Call the np.polyfit() function.
The last parameter of the function specifies the degree of the function, which in this case is "1".
Tip: linear functions = 1.degree function. In our example, the function is linear, which is in the
1.degree. That means that all coefficients (the numbers) are in the power of one.
We have now calculated the slope (2) and the intercept (80). We can write the mathematical
function as follow:
f(x) = 2x + 80
Task:
Remember that the intercept is a constant. A constant is a number that does not change.
Example
def my_function(x):
return 2*x + 80
print (my_function(135))
Max value of the y-axis is now 400 and for x-axis is 150:
Example
import matplotlib.pyplot as plt
plt.show()
Example Explained
Introduction to Statistics
Statistics is the science of analyzing data.
When we have created a model for prediction, we must assess the prediction's reliability.
Descriptive Statistics
We will first cover some basic descriptive statistics.
Count
Sum
Standard Deviation
Percentile
Average
Etc..
Example
print (full_health_data.describe())
Output:
Do you see anything interesting here?
The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an
average pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of
all of the training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session have an average
pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions
have an average pulse of 111 beats per minute or higher
Example
import numpy as np
Max_Pulse= full_health_data["Max_Pulse"]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health
data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a
Max_Pulse of 120 or lower.
Data Science - Statistics Standard Deviation
Standard Deviation
Standard deviation is a number that describes how spread out the observations are.
A mathematical function will have difficulties in predicting precise values, if the observations
are "spread". Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
We can use the std() function from Numpy to find the standard deviation of a variable:
Example
import numpy as np
std = np.std(full_health_data)
print(std)
The output:
What does these numbers mean?
Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Example
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
The output:
We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard
Deviation compared to Max_Pulse, Average_Pulse and Hours_Sleep.
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation. Or the other
way around, if you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate
the variance:
30 80 120 240 10 7
30 85 120 250 10 7
45 90 130 260 8 7
45 95 130 270 8 7
(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5
Step 2: For Each Value - Find the Difference From the Mean
2. Find the difference from the mean for each value:
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25
Example
import numpy as np
var = np.var(health_data)
print(var)
The output:
Example
import numpy as np
var_full = np.var(full_health_data)
print(var_full)
The output:
Data Science - Statistics Correlation
Correlation
Correlation measures the relationship between two variables.
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
Correlation Coefficient
The correlation coefficient measures the relationship between two variables.
1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between the variables (e.g. Less hours worked,
leads to higher calorie burnage during a training session)
Example
import matplotlib.pyplot as plt
Output:
As we saw earlier, it exists a perfect linear relationship between Average_Pulse and
Calorie_Burnage.
If we work longer hours, we tend to have lower calorie burnage because we are exhausted before
the training session.
Example
import pandas as pd
import matplotlib.pyplot as plt
Here, we have plotted Max_Pulse against Duration from the full_health_data set.
As you can see, there is no linear relationship between the two variables. It means that longer
training session does not lead to higher Max_Pulse.
Example
import matplotlib.pyplot as plt
A correlation matrix is simply a table showing the correlation coefficients between variables.
Here, the variables are represented in the first row, and in the first column:
The table above has used data from the full health data set.
Observations:
We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient
of 0.89. This makes sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships between Average_Pulse and
Calorie_Burnage (correlation coefficient of 0.02)
Can we conclude that Average_Pulse does not affect Calorie_Burnage? No. We will come back
to answer this question later!
Output:
Using a Heatmap
We can use a Heatmap to Visualize the Correlation Between Variables:
The closer the correlation coefficient is to 1, the greener the squares get.
The closer the correlation coefficient is to -1, the browner the squares get.
Example
import matplotlib.pyplot as plt
import seaborn as sns
correlation_full_health = full_health_data.corr()
axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)
plt.show()
Example Explained:
A high correlation coefficient (close to 1), does not mean that we can for sure conclude an actual
relationship between two variables.
A classic example:
Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents?
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
Example
import pandas as pd
import matplotlib.pyplot as plt
Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200],
"Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)
correlation_beach = Drowning.corr()
print(correlation_beach)
Output:
Correlation vs Causality - The Beach Example
In other words: can we use ice cream sale to predict drowning accidents?
It is likely that these two variables are accidentally correlating with each other.
Unskilled swimmers
Waves
Cramp
Seizure disorders
Lack of supervision
Alcohol (mis)use
etc.
Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low
correlation coefficient?
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
We are missing one important variable that affects Calorie_Burnage, which is the Duration of the
training session.
Linear Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning and in statistical modeling, that relationship is used to predict the outcome
of events.
The concept is to draw a line through all the plotted data points. The line is positioned in a way
that it minimizes the distance to all of the data points.
The red dashed lines represents the distance from the data points to the drawn mathematical
function.
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
Example Explained:
We will show that the variable Average_Pulse alone is not enough to make precise prediction of
Calorie_Burnage.
Regression Table
The output from linear regression can be summarized in a regression table.
Example
import pandas as pd
import statsmodels.formula.api as smf
Coef is short for coefficient. It is the output of the linear regression function.
print(Predict_Calorie_Burnage(120))
print(Predict_Calorie_Burnage(130))
print(Predict_Calorie_Burnage(150))
print(Predict_Calorie_Burnage(180))
Now, we want to test if the coefficients from the linear regression function has a significant
impact on the dependent variable (Calorie_Burnage).
This means that we want to prove that it exists a relationship between Average_Pulse and
Calorie_Burnage, using statistical tests.
There are four components that explains the statistics of the coefficients:
The P-value
The P-value is a statistical number to conclude if there is a relationship between Average_Pulse
and Calorie_Burnage.
We test if the true value of the coefficient is equal to zero (no relationship). The statistical test
for this is called Hypothesis testing.
A low P-value (< 0.05) means that the coefficient is likely not to equal zero.
A high P-value (> 0.05) means that we cannot conclude that the explanatory variable
affects the dependent variable (here: if Average_Pulse affects Calorie_Burnage).
A high P-value is also called an insignificant P-value.
Hypothesis Testing
Hypothesis testing is a statistical procedure to test if your results are valid.
In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal
to zero.
Hypothesis test has two statements. The null hypothesis and the alternative hypothesis.
Mathematically written:
H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0
If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse
and Calorie_Burnage. The P-value is used for this conclusion.
Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis. It
means that we accept that 5% of the times, we might falsely have concluded a relationship.
If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that it exist a
relationship between the variables.
However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship between
Average_Pulse and Calorie_Burnage.
It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero.
The intercept is used to adjust the regression function's ability to predict more precisely. It is
therefore uncommon to interpret the P-value of the intercept.
R - Squared
R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data
points:
The value of R-Squared is always between 0 to 1 (0% to 100%).
A high R-Squared value means that many data points are close to the linear regression function
line.
A low R-Squared value means that the linear regression function line does not fit the data well.
This can be visualized when we plot the linear regression function through the data points of
Average_Pulse and Calorie_Burnage.
Visual Example of a High R - Squared Value (0.79)
However, if we plot Duration and Calorie_Burnage, the R-Squared increases. Here, we see that
the data points are close to the linear regression function line:
Here is the code in Python:
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
x = full_health_data["Duration"]
y = full_health_data ["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
print(mymodel)
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Duration")
plt.ylabel ("Calorie_Burnage")
plt.show()
Coefficient of 0.3296, which means that Average_Pulse has a very small effect on
Calorie_Burnage.
High P-value (0.824), which means that we cannot conclude a relationship between
Average_Pulse and Calorie_Burnage.
R-Squared value of 0, which means that the linear regression function line does not fit the data
well.
Example
import pandas as pd
import statsmodels.formula.api as smf
Example Explained:
Output:
Example
def Predict_Calorie_Burnage(Average_Pulse, Duration):
return(3.1695*Average_Pulse + 5.8434 * Duration - 334.5194)
print(Predict_Calorie_Burnage(110,60))
print(Predict_Calorie_Burnage(140,45))
print(Predict_Calorie_Burnage(175,20))
The Answers:
Average pulse is 110 and duration of the training session is 60 minutes = 365 Calories
Average pulse is 140 and duration of the training session is 45 minutes = 372 Calories
Average pulse is 175 and duration of the training session is 20 minutes = 337 Calories
So here we can conclude that Average_Pulse and Duration has a relationship with
Calorie_Burnage.
Adjusted R-Squared
There is a problem with R-squared if we have more than one explanatory variable.
R-squared will almost always increase if we add more variables, and will never decrease.
This is because we are adding more data points around the linear regression function.
If we add random variables that does not affect Calorie_Burnage, we risk to falsely conclude that
the linear regression function is a good fit. Adjusted R-squared adjusts for this problem.
It is therefore better to look at the adjusted R-squared value if we have more than one
explanatory variable.
A high R-Squared value means that many data points are close to the linear regression function
line.
A low R-Squared value means that the linear regression function line does not fit the data well.
Congratulations! You have now finished the final module of the data science library.
Learning by Examples
With our "Try it Yourself" editor, you can edit Python code and view the result.
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
x = full_health_data["Average_Pulse"]
y = full_health_data["Calorie_Burnage"]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()