Bayesian Statistics Using Python
Bayesian Statistics Using Python
Bayesian Statistics Using Python
Edited by AI Publishing
eBook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-6-0
Legal Notice:
You are not permitted to amend, use, distribute, sell, quote, or paraphrase
any part of the content within this book without the specific consent of
the author.
Disclaimer Notice:
Kindly note that the information contained within this document is
solely for educational and entertainment purposes. No warranties of
any kind are indicated or expressed. Readers accept that the author
is not providing any legal, professional, financial, or medical advice.
Kindly consult a licensed professional before trying out any techniques
explained in this book.
Chapter 0: Preface...................................................................1
Why Learn Statistics?................................................................................1
The difference between Frequentist and Bayesian Statistics....2
What’s in This Book?................................................................................3
Background for Reading the Book.....................................................4
How to Use This Book?...........................................................................5
Requirements
This box lists all requirements needed to be done before
proceeding to the next topic. Generally, it works as a checklist
to see if everything is ready before a tutorial.
Further Readings
Here, you will be pointed to some external reference or
source that will serve as additional content about the specific
Topic being studied. In general, it consists of packages,
documentations, and cheat sheets.
6 | P r e fa c e
Hands-on Time
Here, you will be pointed to an external file to train and test all
the knowledge acquired about a Tool that has been studied.
Generally, these files are Jupyter notebooks (.ipynb), Python
(.py) files, or documents (.pdf).
https://www.aispublishing.net/book-sccb
1.1.1 Windows
1. Download the graphical Windows installer from https://
www.anaconda.com/products/individual.
2. Double-click the downloaded file. Next, click Continue
to begin the installation.
3. On the subsequent screens, answer the Introduction,
Read Me, and License prompts.
S tat i s t i c s C r a s h C o u r s e for Beginners | 13
1.1.2 Apple OS X
1. Download the graphical MacOS installer from https://
www.anaconda.com/products/individual.
2. Double-click the downloaded file. Next, click Continue
to begin the installation.
3. Then, click the Install button. Anaconda will install in
the specified directory.
1.1.3 GNU/Linux
Since graphical installation is not available, we use the Linux
command line to install Anaconda. A copy of the installation
file is downloaded from https://www.anaconda.com/products/
individual.
Figure 1.15: Launching Jupyter Notebook from the Windows search box.
We are now ready to write our first program. Place the cursor
in the cell. Type print(“Hello World”), and click on the Run
button on the toolbar. The output of this line of code appears
on the next line, as shown in Figure 1.18.
The integer numbers, e.g., the number 20 has type int, whereas
2.5 has type float. To get an integer result from division by
discarding the fractional part, we use the // operator.
1. 20 // 6
Output:
3
1. x ^ y
Output:
2
Output:
‘An example string in single quotes.’
Output:
‘ A string in double quotes.’
Output:
Mango and apple are fruits
Output:
File “<ipython-input-96-868ddf679a10>”, line 1
print(‘ Why don’t we visit the museum?’)
^
SyntaxError: invalid syntax
Output:
Why don’t we visit the museum?
S tat i s t i c s C r a s h C o u r s e for Beginners | 35
Output:
Why don’t we visit the museum?
Output:
Why don’t we visit
the museum?
Output:
c:\new_directory
§§ String Indexing
The string elements, individual characters, or a substring
can be accessed in Python using positive as well as negative
indices. Indexing used by Python is shown in Figure 1.21.
Output:
g
Output:
a test string
The string slice begins from the start of the string even if we
omit the first index: x[:i2] and x[0:i2] gives the same result.
§§ if(condition):
Statement1
Note that the line following the if(condition): is indented. We
may specify more indented statements after the Statement1.
Statements indented the same amount after the colon (:)
are part of the if statement. These statements run when the
condition of the if statement is True. In case this condition is
False, Statement1 and other indented statements will not be
executed. Figure 1.22 shows the flowchart of an if statement.
Note that the input() function gets the input from the user; it
saves the input as a string. We use int(input()) to get an integer
from the string. If we execute the aforementioned program
and input marks are greater than 100, we get a warning “Marks
exceed 100”. If the entered marks are 100 or less, no warning
message is displayed.
§§ else Statement
The statement else is always accompanied by an accompanying
if statement, i.e., if-else. The syntax of if-else is given below.
if(condition):
else:
§§ Nested Decisions
To make decisions under multiple conditions, Python allows
us to perform nested decisions. The if-elif-else statements
S tat i s t i c s C r a s h C o u r s e for Beginners | 43
Further Readings
More information about conditional statements can be found
at https://bit.ly/38exHbQ
Output:
0
123
4
We can also specify a step size other than 1 within the range
() function as follows.
46 | A Quick Introduction to Python for S tat i s t i c s
while (condition):
§§ Nested Loops
A loop, either for or while can be used inside another loop.
This is known as nested loops that can be used when we
work with the data in two-dimensions. The following program
uses two for loops, one nested inside another, to print all the
combinations of two variables.
1. attributes = [“heavy”, “wooden”, “fast”]
2. objects = [“chair”, “table”, “computer”]
3. for j in attributes:
4. for k in objects:
5. print(j, k)
Output:
heavy chair
heavy table
heavy computer
wooden chair
wooden table
wooden computer
fast chair
fast table
fast computer
S tat i s t i c s C r a s h C o u r s e for Beginners | 49
Further Readings
More information about iteration statements can be found
at
https://bit.ly/2TQeW6j
https://bit.ly/389E68j
Output:
This is a test function.
1. def my_function2(str_in):
2. print(“This function prints its input that is “ + str_
in)
3.
4. my_function2(“computer”)
5. my_function2(“table”)
6. my_function2(“chair”)
Output:
This function prints its input that is computer.
This function prints its input that is table.
This function prints its input that is chair.
Output:
0
-20
10
1.6.1 Lists
A list is a mutable and ordered collection of elements. Lists
are specified using square brackets in Python. For instance, to
create a list named item_list, type the following code:
1. item_list = [“backpack”, “laptop”, “ballpoint”,
“sunglasses”]
2. print(item_list)
Output:
[‘backpack’, ‘laptop’, ‘ballpoint’, ‘sunglasses’]
To return the second and the third list items, we type the
following code.
print(item_list[1:3]) # Elements at index 1 and 2 but not
the one at 3.
Output:
[‘laptop’, ‘ballpoint’]
1.6.2 Tuples
A tuple is immutable and ordered collection of items. In
Python, tuples are specified using round brackets (). The
following code creates a tuple:
1. py_stat = (“Python”, “for”, “statistics”)
2. print(py_stat)
Output:
(‘Python’, ‘for’, ‘statistics’)
Output:
0
1
S tat i s t i c s C r a s h C o u r s e for Beginners | 55
1.6.3 Sets
A set is an unindexed and unordered collection of items.
Python specifies sets using curly brackets { }. For example,
type the following code.
1. my_animals = {“cat”, “dog”, “tiger”, “fox”}
2. print(my_animals)
Output:
{‘dog’, ‘tiger’, ‘cat’, ‘fox’}
1. print(“tiger” in my_animals)
2. print(“lion” in my_animals)
Output:
True
False
Moreover, the method pop( ) removes the last item from a set.
The method clear ( ) empties the set, and the keyword del
before the name of the set deletes the set completely.
1.6.4 Dictionaries
A dictionary is a mutable, unordered, and indexed collection
of items. A dictionary in Python has a key:value pair for each
of its elements. Dictionaries retrieve values when the keys are
known. To create a dictionary, we use curly braces { }, and
put key: value elements inside these braces where each pair is
separated from others by commas. For example, the following
piece of code creates a dictionary named py_stat_dict.
1. py_stat_dict = {
2. “name”: “Python”,
3. “purpose”: “Statistics”,
4. “year”: 2020
5. }
6. print(py_stat_dict)
Output:
{‘name’: ‘Python’, ‘purpose’: Statistics’, ‘year’: 2020}
of any data type and can repeat. Square brackets are used to
refer to a key name to access a specified value of a dictionary.
1. print(py_stat_dict[‘name’]) # accesses value for key
‘name’
2. print(py_stat_dict.get(‘purpose’)) # accesses value for
key ‘purpose’
Output:
Python
Statistics
Output:
None
1. py_stat_dict2 = py_stat_dict.copy()
2. print(py_stat_dict)
3. print(py_stat_dict2)
Output:
{‘name’: ‘Python’, ‘purpose’: ‘Data science’, ‘year’: 2020}
{‘name’: ‘Python’, ‘purpose’: ‘Data science’, ‘year’: 2020}
Further Reading
A detailed tutorial on Python is given in
https://bit.ly/34WfGwY
https://bit.ly/389E68j
3. print(my_arr2)
Output:
[[10 20 30 40]
[90 80 70 60]]
3. print(my_arr3)
Output:
[[[ 10 20 30]
[ 40 50 60]]
[[ 70 80 90]
[100 110 120]]]
9.
10. print(“Subtraction of a and b :”,np.subtract(a,b))
11.
12. print(“division of a and b :”,np.divide(a,b))
13.
14. print(“a raised to b is:”,np.power(a,b))
15.
16. print(“mod of a and b :”,np.mod(a,b))
17.
18. print(“remainders when a/b :”,np.remainder(a,b))
19.
20. a = np.array([3.33,4.55,5.25])
21.
22. rounded_a = np.round_(a,2)
23. print(“Rounded array is: “,rounded_a)
24.
25. floor_a = np.floor(a)
26. print(“Floor of the array is: “,floor_a)
27.
28. print(“Square root of the array is: “,np.sqrt(a))
Output:
Addition of a and b : [5 7 9]
Multiplication of a and b : [ 4 10 18]
Subtraction of a and b : [-3 -3 -3]
division of a and b : [0.25 0.4 0.5 ]
a raised to b is: [ 1 32 729]
mod of a and b : [1 2 3]
remainders when a/b : [1 2 3]
Rounded array is: [2.33 4.15 5.85]
Floor of the array is: [2. 4. 5.]
Square root of the array is: [1.52643375 2.03715488
2.41867732]
Output:
0 -2
1 −1
2 0
3 1
4 2
dtype: int64
The last line of the output, dtype: int64, indicates that the type
of values of my series is an integer of 64 bits.
A series object contains two arrays: index and value that are
linked to each other. The first array on the left of the output
of the previous program saves the index of the data, while the
second array on the right contains the actual values of the
series.
6 70
7 80
8 90
dtype: int32
Output:
Dataframe from a dict of series is:
one two
a 1.2 1.5
b 2.3 2.4
c 3.4 3.2
d NaN 4.1
Dataframe from a dict of series with custom indexes is:
one two
d NaN 4.1
b 2.3 2.4
a 1.2 1.5
Note that we get pstdev = 2.75 and stdev = 2.87. The function
st.mode () gives the data point that occurs the most in the
72 | A Quick Introduction to Python for S tat i s t i c s
7.
8. all = stats.describe(c)
9. gmean = stats.gmean(c)
10. hmean = stats.hmean(c)
11. mode = stats.mode(c)
12. skewness = stats.skew(c)
13. iqr = stats.iqr(c)
14. z-score = stats.z-score(c)
15.
16. print(“\nDescribe\n”,all)
17. print(“\nGeometric mean\n”,gmean)
18. print(“\nharmonic mean\n”,hmean)
19. print(“\nMode\n”,mode)
20. print(“\nSkewness\n”,skewness)
21. print(«\nInter Quantile Range\n»,iqr)
22. print(«\nZ Score\n»,z-score)
Output:
Describe
DescribeResult(nobs=9, minmax=(2, 8), mean=4.666666666666667,
variance=4.5, skewness=0.12962962962962918,
kurtosis=-1.1574074074074074)
Geometric mean
4.196001296532889
harmonic mean
3.722304283604136
Mode
ModeResult(mode=array([2]), count=array([2]))
Skewness
0.12962962962962918
Z Score
[ 0.16666667 -1.33333333 0.16666667 0.66666667 -0.83333333
-0.33333333 -1.33333333 1.66666667 1.16666667]
S tat i s t i c s C r a s h C o u r s e for Beginners | 75
8. X = sm.add_constant(X)
9.
10. beta = [1, .1, .5]
11.
12. e = np.random.random(nobs)
13.
14. # Dot product of two arrays
15. y = np.dot(X, beta) + e
16.
17. # Fit regression model
18. results = sm.OLS(y, X).fit()
19.
20. # Inspect the results
21. print(results.summary())
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.079
Model: OLS Adj. R-squared: 0.060
Method: Least Squares F-statistic: 4.155
Date: Fri, 04 Sep 2020 Prob (F-statistic): 0.0186
Time: 22:18:38 Log-Likelihood: -25.991
No. Observations: 100 AIC: 57.98
Df Residuals: 97 BIC: 65.80
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.6532 0.088 18.781 0.000 1.478 1.828
x1 0.0062 0.120 0.052 0.959 -0.233 0.245
x2 0.3266 0.114 2.856 0.005 0.100 0.553
==============================================================================
Omnibus: 71.909 Durbin-Watson: 2.124
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7.872
Skew: -0.133 Prob(JB): 0.0195
Kurtosis: 1.651 Cond. No. 5.72
=============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
Further Reading
More information about Python and its commonly used
functions can be found at
https://bit.ly/3oQ6LFc
Hands-on Time
To test your understanding of the concepts presented in this
chapter, complete the following exercise.
S tat i s t i c s C r a s h C o u r s e for Beginners | 79
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
word = ‘Python’
word[1]
A. ‘P’
B. ‘p’
C. ‘y’
D. ‘Y’
S tat i s t i c s C r a s h C o u r s e for Beginners | 81
Question 7:
word = ‘Python’
word[−2]
A. ‘n’
B. ‘o’
C. ‘h’
D. ‘P’
Question 8:
Question 9:
Question 10:
S = {1, 2, 3, 4, 5, 6}.
S = {0, 1, 2}.
S = {0, 1, 2, 3, 4, …}
S tat i s t i c s C r a s h C o u r s e for Beginners | 87
S = [0, ¥),
E = {0, 1}.
E1 = {1, 3, 5}.
E2 = {1, 2, 3}.
88 | S ta r t i n g with Probability
E3 = E1∩E2 = {1, 3}
Figure 2.1: A Venn diagram showing the intersection of two events E1 E2.
S tat i s t i c s C r a s h C o u r s e for Beginners | 89
Figure 2.2: A Venn diagram showing the union of two events E1 E2.
1. P(S) = 1.
The probability that at least one of all the possible outcomes
of the sample space S will occur is 1. Alternatively, when
an experiment is performed, some events of the sample
space of this experiment will always occur.
2. P(E)≥0
If E is an event (a subset of S), its probability is equal to or
greater than zero.
3. P(E1∪E2 ) = P(E1) + P(E2) for mutually exclusive or disjoint
events E1 and E2.
The symbol ∪ stands for the set union. We can redefine
this as: If E1 and E2 have nothing in common, i.e., these are
mutually exclusive, the probability of either E1 or E2 equals
the probability of occurrence of E1 plus the probability of
occurrence of E2.
A = {TT} and
B = {HH, TH}.
P(A) = 1/4
P(B) = 2/4
P (A|B) = P(A)
P (B|A) = P(B)
26.
27.
28. # Print each probability
29. print(‘The product of probability of getting a king and
the probability of drawing a queen = ‘,round(prob_king1 *
prob_queen1,5))
30. print(‘The probability of getting a king after drawing a
queen in the first draw with replacement = ‘,round(prob_
king_and_queen1,5))
31. print(‘The probability of getting a king after drawing a
queen in the first draw without replacement = ‘,round(prob_
king_and_queen2,5))
Output:
The product of the probability of getting a king and the
probability of drawing a queen = 0.00592
and
P(B|A) = (0.66)(0.3)/0.4
= 0.495 = 49.5%.
Requirements
The Python scripts presented in this chapter have been
executed using the Jupyter notebook. Thus, to implement
the Python scripts, you should have the Jupyter notebook
installed. Since Jupyter notebook has built-in libraries, we do
not need to install them separately.
Further Readings
For the practice of questions related to probability theory,
please visit the following links:
https://bit.ly/2I3YQDM
For details and applications of Bayes’ theorem, visit
https://bit.ly/3mT9IDi
Question 2:
Question 3:
Question 4:
Question 5:
If the probability of an event P(E) = 0.4, P(not E) will be:
A. 0.4
B. 0.5
C. 0.6
D. 1
Question 6:
The probability of drawing an ace from a deck of 52 cards is:
A. 1/52
B. 1/26
C. 4/13
D. 1/13
Question 7:
A dice is rolled. Find out the probability of getting either 2
or 3?
A. 1/6
B. 2/6
C. 1/3
D. 1/2
Question 8:
If a card is chosen from a deck of 52 cards, what is the
probability of getting a one or a two?
A. 4/52
B. 1/26
C. 8/52
D. 1/169
106 | S ta r t i n g with Probability
Question 9:
Question 10:
of time in the interval [0, T]. In this case, the sample space
is continuous in contrast to the discrete sample space of the
experiment involving the dice.
0.55
[[43 20 36]
[41 48 24]
[28 48 3]
[35 25 2]]
Figure 3.1: The Venn diagram showing the sample space of the
experiment in which a coin is tossed twice.
X 0 1 2
Figure 3.2: The probability mass function (PMF) of the random variable
representing the number of tails when a coin is tossed twice.
34.
35. outcome_value, outcome_count = np.unique(num_tails_rep,
return_counts=True)
36. prob_count = outcome_count / len(num_tails_rep)
37.
38. # Now that we have tossed the coin twice for 1000 times,
we plot the results
39. plt.bar(outcome_value, prob_count)
40. plt.ylabel(“Probability”)
41. plt.xlabel(“Outcome”)
42. plt.title(“Probability Mass Function”)
43. plt.show()
Output:
Further Readings
The Python module SciPy.Stats offers a variety of continuous
and discrete random variables along with useful functions
to work with these distributions. For details of SciPy.Stats
functions, visit
https://bit.ly/3jYIMQs
120 | Random Variables & Probability Distributions
For instance, the sample space for tossing one fair coin twice
is
P(X = 0) = P(HH) = ¼,
P(X = 2) = P(TT) = ¼.
μ= E(X) = Σx.P(X=x)
μ= E(X) = (1) (1/6) + (2) (1/6) + (3) (1/6) +(4) (1/6) + (5) (1/6)
+ (6) (1/6) = 3.5.
E(X) = xp(X=x)
=0(1−p) + 1(p) = p
Output:
If we run the same code for a size of 2,000 times, we get the
following output:
Further Reading
More information about functions provided by SciPy.Stats
and the probability distributions using Python can be found
at https://bit.ly/3lb6JFX
140 | Random Variables & Probability Distributions
The median and mode of the numbers 15, 11, 9, 5, 15, 13, 17 are
respectively:
A. 13, 6
B. 13, 18
C. 13, 15
D. 15, 16
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
Question 7:
Question 8:
Question 9:
Question 10:
Question 11:
Question 12:
μ= E(X) = Σx.P(X=x)
μ= E(X) = (1) (1/6) + (2) (1/6) + (3) (1/6) +(4) (1/6) + (5) (1/6)
+ (6) (1/6) = 3.5.
= (18+20) / 2 = 19
Figure 4.1: A Box and Whisker plot. The length of the box represents IQR.
IQR = Q3 – Q1
x 0 1 2 3 4
f(x) 0.2 0.1 0.3 0.3 0.1
E[(X-μ)2] = Σ (x - μ)2f(x)
STD = √σ2 = σ
The terms (x-μx) and (y-μy) are computed for each data point.
These are multiplied together, and finally, the mean or average
of all the products is calculated to find a single number as the
covariance between features x and y.
156 | D e s c r i p t i v e S tat i s t i c s : M e a s u r e o f C e n t r a l T e n d e n c y a n d S p r e a d
1. import numpy as np
2. npmycov = np.cov([1, 2, 3], [1.0, 2.5, 7.5])
3. print(‘The covariance between x and y is \n’)
4. mycov
Output:
array([[ 1. , 3.25 ],
[ 3.25 , 11.58333333]])
5. mycorr = np.corrcoef([1, 2, 3], [1.0, 2.5, 7.5])
6. print(‘The correlation between x and y is \n’)
7. mycorr
Output:
array([[1. , 0.95491911],
[0.95491911, 1. ]])
Further Reading
More information about descriptive statistics and Python
code to implement these statistics can be found at
https://bit.ly/3kZcw0W
158 | D e s c r i p t i v e S tat i s t i c s : M e a s u r e o f C e n t r a l T e n d e n c y a n d S p r e a d
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
The mode of the data [18, 11, 10, 12, 14, 4, 5, 11, 5, 8, 6, 3, 12, 11,
5] is:
A. 11
B. 5
C. 0
D. No unique mode
Question 7:
The range of the data [21, 18, 9, 12, 8, 14, 23] is:
A. 23
B. 8
C. 15
D. 7
160 | D e s c r i p t i v e S tat i s t i c s : M e a s u r e o f C e n t r a l T e n d e n c y a n d S p r e a d
Question 8
If the sample variance of the data [21, 18, 23] is 2.51, the
population variance would be:
A. less than the sample variance
B. more than the sample variance
C. equal to the sample variance
D. cannot be determined.
5
Exploratory Analysis:
Data Visualization
5.1 Introduction
In the previous chapter, we have seen that descriptive statistics
provides a useful summary by exploring the underlying
statistical data. In this chapter, we perform further exploration
using plots and visualization tools.
Output:
Output:
plots from each other and for better visualization. The options
marker, marker face color, color, line width, and marker size
are used to adjust the marker type, line color, line thickness,
and marker size, respectively.
16.
17. # draw points for mean, median, mode
18. ax.plot([st.mean(raindata)], [st.mean(raindata)],
color=’r’, marker=”o”, markersize=15)
19. ax.plot([st.median(raindata)], [st.median(raindata)],
color=’g’, marker=”o”, markersize=15)
20. ax.plot([st.mode(raindata)], [st.mode(raindata)],
color=’k’, marker=”o”, markersize=15)
21.
22. # Annotation
23. plt.annotate(“Mean”, (st.mean(raindata),
st.mean(raindata)+0.3),color=”r”)
24. plt.annotate(“Median”, (st.median(raindata),
st.median(raindata)-0.7),color=”g”)
25. plt.annotate(“Mode”, (st.mode(raindata), st.mode(raindata)-
0.7),color=”k”)
26.
27. plt.show()
Output:
We plot the mean, the median, and the mode of the given
data as well using the commands given in lines 18 to 20. Lines
23 to 25 are used to annotate the measures of the center of
the data.
5.6 Histogram
A histogram is a bar chart that shows the frequency distribution
or shape of a numeric feature in the data. This allows us to
discover the underlying distribution of the data by visual
inspection. To plot a histogram, we pass a collection of numeric
values to the method hist () of the Matplotlib.pyplot package.
Output:
The plot shown in the output of the program reveals that more
than 2,500 data points out of 10,000 have a value around 0. A
few values are less than −3 and greater than 3. By default, the
method hist () uses 10 bins or groups to plot the distribution
of the data. We can change the number of bins in line 10 of the
code by using the option bins.
174 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
1. plt.hist(randomNumbers, bins=100)
Output:
1. plt.figure(figsize=[10,8])
2.
3. # Creating random numbers using numpy
4. x = 0.75 * np.random.randn(10000)
5. y = 1.5 * np.random.randn(10000) + 5
6.
7. plt.hist([x, y], bins=100, label=[‘Zero mean, 0.75
STD’,’Five mean, 1.5 STD’])
8. plt.xlabel(‘Value’,fontsize=12)
9. plt.ylabel(‘Frequency’,fontsize=12)
10. plt.title(‘Two Histograms Together’,fontsize=12)
11. plt.legend()
12.
13. plt.show()
Output:
1. import numpy as np
2. import matplotlib.pyplot as plt
3.
4. weight_range = [‘less than 50 kg’, ‘50–60 kg’, ‘60–70 kg’,
‘70–80 kg’, ‘80–90 kg’, ‘90–100 kg’]
5. num_students = [4, 15, 20, 22, 5, 2]
6.
7. # plotting the frequency distribution
8. plt.figure(figsize=[10,8])
9.
10. plt.bar(weight_range, num_students)
11. plt.xlabel(‘Range of weights’,fontsize=12)
12. plt.ylabel(‘Frequency’,fontsize=12)
13. plt.title(‘Number of students in different ranges of
weight’, fontsize=12)
14. plt.show()
Output:
The bar () method takes values of x and y axes and plots the
values of y as vertical bars.
178 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
Output:
Cumulative
Weight range Frequency
Frequency
less than 50 kg 4 4
50–60 kg 15 19
60–70 kg 20 39
70–80 kg 22 61
80–90 kg 5 66
90–100 kg 2 68
1. import pandas as pd
2. import matplotlib.pyplot as plt
3.
4. count_people = 600
5.
6. people_car_data = {‘Number of cars’: [1, 2, 3, 4, 5, 6, 7,
8, 9, 10],
7. ‘People having number of cars’: [300,
150, 100, 15, 10, 8, 6, 5, 4, 2]}
8.
9. df = pd.DataFrame(data=people_car_data)
10. print(df)
11.
12. df.plot(kind=’bar’, x=’Number of cars’, y=’People having
number of cars’,
13. figsize=(8, 6), color=’r’);
14.
15. plt.grid(axis=’y’, alpha=1)
16. plt.title(“Count of People for Number of cars they own”,
y=1.01, fontsize=12)
17. plt.ylabel(“Count of People”, fontsize=12)
18. plt.xlabel(“Number of Cars”, fontsize=12)
Output:
Number of cars People having number of cars
0 1 300
1 2 150
2 3 100
3 4 15
4 5 10
5 6 8
6 7 6
7 8 5
8 9 4
9 10 2
Text(0.5, 0, ‘Number of Cars’)
184 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
Output:
Number of cars People having number of cars totalPeople
0 1 300 300
1 2 150 450
2 3 100 550
3 4 15 565
4 5 10 575
5 6 8 583
6 7 6 589
7 8 5 594
8 9 4 598
9 10 2 600
Text(0.5, 0, ‘Number of Cars’)
where F(x) is the CDF, and the right-hand side of the equation
says that the probability of the random variable X having any
value equal to or less than value x.
or
,
where min on the right side of the equation means the quantile
function returns the minimum value of x from all those values
such that their distribution F(x) equals or exceeds probability
p, and “p means all probability values lie in 0 to 1 range. The
value x returned by Q(P) obeys the CDF equation.
The PPF or the quantile function can be used to get the values/
samples of the variable X from the given distribution. If F(x) is
the distribution function, we can use the quantile function to
generate the random variable that has F(x) as its distributions
function.
X=[−2 0 1 3]
We plot both F(X) and the quantile function using the following
Python script.
1. ## Quantile function
2.
3. import numpy as np
4. import matplotlib.pyplot as plt
5.
6. x = [-2, 0, 1, 3]
7. cdf_func = [0.2, 0.3, 0.6, 1]
8.
188 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
For a normal RV, we plot its PDF, CDF, and the quantile function
using the Python script given below:
1. from scipy.stats import norm
2. import numpy as np
3. import matplotlib.pyplot as plt
4.
5. # Generating a range of values from -4 to 4 because a
standard Normal RV has most values between -3 to 3
6. x= np.arange(-4,4,0.01)
7.
8. # Plot of PDF
9. plt.plot(x,norm.pdf(x))
10. plt.xlabel(‘Values of RV X’)
11. plt.ylabel(‘Probability’)
12. plt.title(‘Probability Density Function of a Normal RV’)
13. plt.show()
14.
15. # Plot of CDF
16. plt.plot(x,norm.cdf(x))
17. plt.xlabel(‘Values of RV X’)
18. plt.ylabel(‘Probability’)
19. plt.title(‘Cumulative Distribution Function of a Normal
RV’)
20. plt.show()
190 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
21.
22. # Plot of Inverse CDF (or PPF or Quantile function)
23. plt.plot(x, norm.ppf(x))
24. plt.xlabel(‘Probability’)
25. plt.ylabel(‘Values of RV X’)
26. plt.title(‘Quantile Function of a Normal RV’)
27. plt.show()
Output:
S tat i s t i c s C r a s h C o u r s e for Beginners | 191
17.
18. # get the cumulative probability for values
19. print(‘P(x<25): %.3f’ % ecdf(25))
20. print(‘P(x<50): %.3f’ % ecdf(50))
21. print(‘P(x<75): %.3f’ % ecdf(75))
22.
23. # plot the edf
24. pyplot.plot(ecdf.x, ecdf.y)
25. pyplot.show()
Output:
P(x<25): 0.569
P(x<50): 0.993
P(x<75): 1.000
194 | E x p l o r ato r y A n a ly s i s : D ata V i s u a l i z at i o n
Question 2:
Question 3:
Question 4:
Question 5:
Question 7:
Question 8:
where x, μ, and σ are data points on the curve, mean, and the
standard deviation, respectively. The z-score represents the
area under the Normal distribution up to the value of z. Recall
from Section 3.6.3 on the Normal distribution that the area
under the curve from – σ to +σ is 68.3 percent of the overall
area. Furthermore, the area under the curve from – 2σ to +2σ
is 95.5 percent of the overall area.
6.5.5 p-value
The examination of the observed data makes use of decision
rules for rejecting or accepting the null hypothesis. The decision
rules are described by using a probability value known as the
p-value.
(a) (b)
S tat i s t i c s C r a s h C o u r s e for Beginners | 213
(c)
Figure 6.4: For a 95% confidence interval, regions of acceptance
(unshaded) and rejection (shaded) for (a) one-tailed (left-tailed), (b) one-
tailed (right-tailed), and (c) a two-tailed test. If the p-value falls in shaded
regions, it is considered significant to reject the null hypothesis.
A very small p-value implies that under the null hypothesis such
an extreme observed outcome would not be likely. A p-value
is compared against the significance level, α. If the p-value is
less than α, we have a very unlikely observation. Thus, we say
that the results are statistically significant because the sample
data is giving us this evidence.
(a) (b)
Figure 6.5: (a) A confidence level of 90% corresponding to α =1−0.90 = 0.1,
α/2 = 0.1/2 = 0.05 area of rejection is on either side, (b) A confidence level
of 95%, corresponding to α =1−0.95 = 0.05, α/2 = 0.05/2 = 0.025 area of
rejection is on either side.
When the z-score is outside the −1.96 – 1.96 range, e.g., 2.8,
the corresponding p-value will be smaller than 0.025. In this
instance, the null hypothesis can be rejected straightaway.
Note that both z-value and p-value are associated with the
standard Normal distribution. These methods do not work with
other distributions. Thus, we assume our sampling distribution
to be a Normal distribution due to the central limit theorem.
Output:
Further Readings
More information on the point and interval estimates can be
found at:
https://bit.ly/365cSNu
For more information about hypothesis testing, visit:
https://bit.ly/38bWyNm
Question 2:
Question 3:
Question 4:
Question 5:
A small positive z-score, for example, 0.5, implies that the value
under consideration is ___________, and the corresponding
p-value would be ___________.
A. Highly likely, small
B. Highly likely, large
C. Less likely, small
D. Less likely, large
S tat i s t i c s C r a s h C o u r s e for Beginners | 221
Question 6:
Question 7:
Output:
Year InterestRate UnemploymentRate StockPrice
0 2001 2.00 5.1 30000
1 2002 2.15 5.1 30010
2 2003 2.45 5.3 30500
3 2004 2.75 5.2 32104
4 2005 2.30 5.7 27098
5 2006 2.65 4.9 28459
6 2007 3.50 6.0 33512
7 2008 3.15 4.9 29565
8 2009 3.25 4.8 30931
9 2010 4.15 4.1 34958
10 2011 4.50 3.2 33211
11 2012 3.45 3.1 34293
12 2013 3.75 4.1 36384
13 2014 5.25 4.1 38866
14 2015 5.75 3.9 38776
15 2016 5.50 3.2 40822
16 2017 3.05 3.2 35704
17 2018 6.00 4.1 36719
18 2019 6.20 4.2 40000
19 2020 6.05 4.4 40500
20 2021 6.30 4.6 42300
S tat i s t i c s C r a s h C o u r s e for Beginners | 227
Predictions:
0 28900.344332
1 29278.508818
2 29868.786739
3 30708.141239
4 29158.520145
5 30705.108162
6 31934.759459
7 31965.656451
8 32300.791636
9 35150.957245
10 36780.570789
11 34216.444907
12 34142.518613
13 37924.163482
14 39350.762825
15 39301.667368
16 33124.980749
17 39814.985916
18 40236.179705
19 39691.964165
20 40156.187256
dtype: float64
228 | Frequentist Inference
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
const = 2.809e+04,
Output:
sample mean: mu_sample : 3.940637
sample standard deviation: sigma_sample : 2.758312
confidence interval of mu_sample using normal distribution
: (2.859398, 5.021875)
1. %matplotlib inline
2. import numpy as np
3. from scipy import stats
4. import statsmodels.api as sm
5. import matplotlib.pyplot as plt
6. from statsmodels.distributions.mixture_rvs import mixture_
rvs
7.
8. # Seed the random number generator for reproducible
results
9. np.random.seed(12345)
The first line of the program adjusts the size of the figures,
whereas the second line specifies how many subplots we need.
We have specified 111 in line 2. The first two digits, 11, indicate
S tat i s t i c s C r a s h C o u r s e for Beginners | 237
The option z-order in all four plots defines the order of the
appearance of the plots. For example, a plot having a greater
number of z-order appears in front of the plot that has a
smaller number of z-order.
Further Readings
For the official documentation of the stable release of the
Statsmodel package and the kernel density estimation, visit
the following webpage:
https://bit.ly/32avvhC
https://bit.ly/3mLm4gy
16.
17. # Null Hypothesis
18. # sample_mean < 19
19. # Alternate Hypothesis
20. # sample_mean > 19
21.
22. pop_stdev = population.std()
23. print(«Population Standard Deviation:», pop_stdev)
24. z_test = (sample_mean - population_mean)/ (pop_stdev/math.
sqrt(sample_size))
25. print(“Z test value: “,z_test)
26.
27. confidence_level = 0.95
28. z_critical_val = st.norm.ppf(confidence_level)
29. print(“Z critical value: “, z_critical_val)
30.
31. if(z_test > z_critical_val):
32. print(“Null Hypothesis is rejected.”)
33. else:
34. print(“Null Hypothesis is accepted.”)
Output:
Population Mean: 25.214285714285715
Sample Mean: 24.083333333333332
Population Standard Deviation: 18.020538169498916
Z test value: -0.21740382738025502
Z critical value: 1.6448536269514722
Null Hypothesis is accepted.
Since the p-value, 0.1916 is greater than α/2, we accept the null
hypothesis.
S tat i s t i c s C r a s h C o u r s e for Beginners | 245
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
where
Thus,
34. plt.plot(list1)
35.
36. return results/n
37.
38. #Calling the function:
39.
40. answer = monte_carlo(100)
41. print(“Final value :”,answer)
Output:
The flipped coin value is 1
Final value : 0.49
35.
36.
37. # Use find_MAP to find maximum a posteriori from a pymc
model
38. map_estimate = pm.find_MAP(model=basic_model)
39. print(map_estimate)
Output:
100.00% [19/19 00:00<00:00 logp = -163.64, ||grad|| = 11.014]
{‘alpha’: array(1.03540327), ‘beta’: array([0.85459263,
2.29026671]), ‘sigma_log__’: array(0.0619924), ‘sigma’:
array(1.06395426)}
Output:
Lines 2 and 3 of the code show our initial belief before the run
of the experiment. We update our belief in lines 5 and 6 of
the code by simply adding the number of heads to α and the
number of tails to β. Lines 9 to 12 are used to plot the posterior
distribution. It can be seen that the distribution is skewed
toward the right because the value of α is greater than β.
Output:
Lines 2 and 3 of the code uses the btdtri () function from the
SciPy.special package. This function computes the quantiles
of the Beta distribution specified by the inputs a and b
(corresponding to α and β). It means that it uses the inverse
cumulative distribution function (the quantile function) of the
Beta distribution. Details of the quantile function are given in
Chapter 5. For example, 0.025 and 0.975 values in these lines
compute the points on the Beta distribution that corresponds
to the 97.5 percentile and 2.5 percentile of the distribution.
The difference 97.5−2.5 = 95 gives the 95 percent credible
interval. The 95 percent credible interval means that there is
a 95 percent probability that our parameter (probability of
success or heads) falls within the range of 0.625 and 0.799.
5! = 5x4x3x2x1 = 120.
Output:
1. import pymc3 as pm
2. model = pm.Model()
3. with model:
4.
5. # Define the prior of the parameter lambda.
6. lam = pm.Gamma(‘lambda’, alpha=a, beta=b)
7. # Define the likelihood function.
8. y_obs = pm.Poisson(‘y_obs’, mu=lam, observed=y)
9. # Consider 1000 draws and 2 chains.
10. trace = pm.sample(draws=1000, chains=2)
Output:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 4 jobs)
NUTS: [lambda]
To plot the output trace and the posterior, we may type the
following commands:
1. pm.traceplot(trace)
2.
3. pm.plot_posterior(trace, hdi_prob=.95)
Output:
S tat i s t i c s C r a s h C o u r s e for Beginners | 275
34.
35. #Create a Gaussian Classifier
36. model = GaussianNB()
37.
38. # Train the model using the training sets
39. model.fit(features,label)
40.
41. #Predict Output for input 0:Overcast, 2:Mild
42.
43. predicted= model.predict([[0,2]])
44. print (“Predicted Value:”, predicted)
Output:
Weather: [2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Combined feature: [[2 1]
[2 1]
[0 1]
[1 2]
[1 0]
[1 0]
[0 0]
[2 2]
[2 0]
[1 2]
[2 2]
[0 2]
[0 1]
[1 2]]
Predicted Value: [1]
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
Question 7:
Question 8:
Question 9:
Question 10:
where p0 and p are the conversion rates of the old and the
new designs, respectively. The null hypothesis H0 can be
interpreted as the probabilities of conversion in the control
and the test groups are equal. The alternative hypothesis Ha
states that the probabilities of conversion in the control and
the test groups are not equal.
𝛼 = (1−0.95) = 0.05.
1. df = pd.read_csv(‘ab_data.csv’)
2. df.info()
3. df.head(10)
Output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 294478 non-null int64
1 timestamp 294478 non-null object
2 group 294478 non-null object
3 landing_page 294478 non-null object
4 converted 294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB
The first line gets the indices of the users who appear in
multiple sessions. These indices are used in the isin () function
that returns the values in df[‘user_id’], which are in the given
list, and the ~ at the beginning is a not operator.
treatment 3064
control 3064
Name: group, dtype: int64
The results show that the p-value = 0.700 is greater than the
set significance level 𝛼 = 0.05. This implies that the probability
of observing extreme results is small. Thus, we shall not reject
the null hypothesis 𝐻0. In conclusion, the new web page (new
design) does not perform significantly better than the old
design.
1. # Importing packages
2. import pandas as pd
3. import numpy as np
4. import matplotlib.pyplot as plt
5. %matplotlib inline
6. import seaborn as sns
7. import scipy # Scipy for statistics
8. # PyMC3 for Bayesian Inference
9. import pymc3 as pm
10.
11. # Loading and displaying datasets
12. exercise = pd.read_csv(‘exercise.csv’) # give path of
the dataset files
13. calories = pd.read_csv(‘calories.csv’)
14. df = pd.merge(exercise, calories, on = ‘User_ID’)
15. df = df[df[‘Calories’] < 300]
16. df = df.reset_index()
17. df[‘Intercept’] = 1
18. df.head(10)
Output:
The following code plots the calories burnt against the time
spent in exercise.
1. plt.figure(figsize=(8, 8))
2.
3. plt.plot(df[‘Duration’], df[‘Calories’], ‘rx’);
4. plt.xlabel(‘Duration (min)’, size = 15); plt.
ylabel(‘Calories’, size = 15);
5. plt.title(‘Calories burned vs Duration of Exercise’, size
= 15);
Output:
Each red color x mark on this graph shows one data point
(observation) in the combined dataset. The time is measured
302 | Hands-on Projects
Output:
The blue color line that passes almost in the middle through
the observed data points is the OLS fit to our data. Note
that a line in a 2-dimensional space is described by two
parameters as in the simple linear regression model. Had we
used multiple input features, we would have to estimate more
than two parameters of the linear regression model in higher
dimensions.
The black line corresponds to the OLS fit, whereas the blue
lines correspond to the Bayesian posterior fit lines. It can be
seen that the Bayesian estimate of the parameters correctly
corresponds to the frequentist estimates.
Output:
Chapter 1
Question 1: A
Question 2: C
Question 3: B
Question 4: A
Question 5: B
Question 6: C
Question 7: B
Question 8: C
Question 9: C
Question 10: C
Chapter 2
Question 1: B
Question 2: A
Question 3: C
Question 4: C
Question 5: C
Question 6: D
314 | Answers to Exercise Questions
Question 7: B
Question 8: C
Question 9: D
Question 10: D
Chapter 3
Question 1: C
Question 2: A
Question 3: B
Question 4: B
Question 5: C
Question 6: C
Question 7: A
Question 8: C
Question 9: B
Question 10: B
Question 11: B
Question 12: B
Chapter 4
Question 1: D
Question 2: B
Question 3: B
Question 4: C
Question 5: C
Question 6: D
Question 7: C
Question 8: A
S tat i s t i c s C r a s h C o u r s e for Beginners | 315
Chapter 5
Question 1: C
Question 2: D
Question 3: B
Question 4: A
Question 5: D
Question 6: B
Question 7: D
Question 8: C
Chapter 6
Question 1: B
Question 2: A
Question 3: B
Question 4: C
Question 5: B
Question 6: D
Question 7: A
Chapter 7
Question 1: A
Question 2: B
Question 3: A
Question 4: A
Question 5: A
Question 6: C
316 | Answers to Exercise Questions
Chapter 8
Question 1: C
Question 2: D
Question 3: B
Question 4: C
Question 5: A
Question 6: B
Question 7: B
Question 8: C
Question 9: A
Question 10: B