Programming Python Statistics

This document summarizes a Python statistical package created by the author to analyze user-inputted datasets. The package includes: 1) A histogram to visualize the dataset distribution. 2) Descriptive statistics of the dataset like the mean, standard deviation, and number of observations. 3) A function calculating the probability that a given observation belongs to the inputted dataset, based on converting values to z-scores and using a pre-defined z-score probability table. The package provides basic statistical analysis of datasets in Python through visualization, descriptive statistics, and probability calculations.

Uploaded by

Tate Knight

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Programming Python Statistics

Uploaded by

Tate Knight

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Python as a Statistical Package

Tate Knight
12/4/16
COMP112
Section 5
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Abstract!
For this project, I used Python to create a small statistical package. My goals for this
program were threefold: (i) to create a visual representation of data, (ii) to return basic
descriptive values for the dataset, and (iii) to calculate the probability of a certain observation
belonging to the dataset. The user inputs a dataset, and the program returns the following: a
histogram of the dataset, a list of descriptive statistics of the dataset, and an opportunity to
calculate the probability that an observation belongs to the dataset. Though these techniques are
only a fraction of the full range of statistical capabilities and analyses, they are nonetheless
important for understanding the basic structure of a dataset.

Introduction
The analysis of data is a common practice in all disciplines. Whether calculating grades
for a test or analyzing cell samples, statistics are a powerful tool for understanding the driving
forces behind data. Understanding the structure of a sample dataset allows for inferences about
the population for which it describes. Statistics is simply a method of predicting population-scale
trends. The two main values that account for the structure of a dataset are the mean and standard
deviation. The mean describes the most likely observation within the dataset, while the standard
deviation describes the spread of the data. The majority of sample datasets are normally
distributed; that is, the frequency of observations of a sample is greatest towards the mean of the
data, and is lowest towards the tails of the data. Thus, normally distributed data resembles a bell-
shaped-curve, or more formally, a Gaussian distribution. The Gaussian distribution describes a
probability density function based on the mean and standard deviation of a dataset (Figure 1).
The “mean” value of the function determines the height of the distribution, while the “standard
deviation” value determines the width. A probability of belonging to a dataset is derived from the
Gaussian distribution; for example, 95.4% of observations in a dataset fall within two standard
deviations of the mean. The theory behind this technique forms a relationship between units of
standard deviation and probability, later discussed as z-scores in methods section.
First, it is important to visualize a dataset before diving into the statistics. A simple
histogram of a dataset describes its general distribution, and reveals any outliers or anomalies
that may need addressing. Second, and most importantly, a user must know the values which
describe the data. These values include, for example, the number of observations, the mean, and
the standard deviation. There are other descriptive values as well, but the majority of statistical
analyses are derived from these three important values. Third, it is helpful to know the likelihood
of an observation falling within your dataset. This technique applies the z-scores for
observations, explained in the methods section. For example, if only 10 out of 20 people show up
to my club meeting for two weeks in a row, what is the likelihood that 15 people will show up on
the third week? Probability is a powerful tool when hypothesizing future events. Using Python to
create a statistical package results in an efficient and simplistic approach to basic statistics; this
program is most likely helpful for students in an Intro to Statistics class where basic descriptive
values are frequently calculated.

Methods: Data Entry

The preliminary step before plotting the data is to have a user input data. Using a
try/except format, the program greets the user with space to enter data (Figure 2). If the data is
entered wrong (i.e. includes letters, is not separated by commas, etc.) it will tell the user that the
format is not readable and to enter the data again. Once the user enters valid data, it is assigned
to the variable “x”. In order to save time, while I was writing the code for the whole program, I
created a stand-in dataset using numpy.random.normal(10,3,100). This was helpful when writing
code for the descriptive statistics so I knew if the output values (mean = 10, sd = 3, obs = 100)
were correct or not. Once I finished, a replaced the placeholder data with the variable assigned to
the user input.

Methods: Plotting
I first import numpy and pylab in the program. This allowed me to write code for Python
to open Python Launcher and use its plotting capabilities accordingly. I bin the data relative to
the number of observations (# bins = observations/3), and plot the data on a histogram. If the
bins are too large, the plot will have no substance. If the bins are too small, then observation
frequencies in each bin are difficult to differentiate from one another. The x-axis is the binned
observation value, and the y-axis is the frequency of the binned observation. Although it is not
especially complicated, it is important for a user to visualize the data before proceeding with
analysis.

Methods: Descriptive Statistics

Descriptive statistics are values which describe the structure of a dataset. The values I
have used most often in my studies are: number of observations, mean, average deviation,
variance, standard deviation, coefficient of variance, and standard error of the mean. In order to
return the values to the user, I first created separate functions that called upon the input dataset
list “x”. For example, the function for the mean was: sum(x)/obs; or rather, the sum of all
observations in the dataset divided by the number of observations (Figure 3). Variance and
standard deviations were a bit trickier; I had to replicate and modify “x” to create a list of the
deviations of each value from the mean of “x”, and then use the new observation deviation list to
calculate variance and standard deviation. Once all of the individual function codes were written,
I created a new function which calls for the descriptive statistic values, and prints each each one
in the format: “DESCRIPTIVE STATISTIC = VALUE”. I rounded each value to two decimal
places to avoid too much clutter, and I rarely need more than two decimal places in my work.

Methods: Probability of an Observation

Each observation in a sample has a unique “z-score”; that is, the number of standard
deviations an observation lies from the mean. Take a dataset with: mean = 10, sd = 3. A value of
10 has a z-score of zero; value of 16 has a z-score of 2, and so on. This also means that
observations with lower z-scores, i.e. values close to the mean, are more probable than
observations with larger z-scores, i.e. values far from the mean. Thus, a dataset can be translated
into a list of z-scores, which are then normally distributed around a mean z-score of 0 (Figure 4).
Each z-score has a probability attached to it. For example, an observation with a z-score of 0
belongs to the distribution with a probability of 100% because it describes the mean of the data.
An observation with a z-score of 2 has a lower probability of belonging to the distribution
because it falls further from the mean. Because each z-score corresponds to a probability, I
created a dictionary containing z-scores from 0 to 4 in 0.01 step increments. Using a downloaded
Excel table that contains these correspondences, I manipulated the table into a dictionary
structure that Python could read. This dictionary was assigned the variable “z” (Figure 5).
Now, my task was to create a function which takes a value and returns the probability
that it belongs to the dataset that the user initially entered. First, the function takes the entered
value, and calculates its z-score based on the mean and standard deviation of the dataset (z-score
= [value – mean] / sd). Second, using a manipulation of the “z” dictionary, the function
calculates the probability that that value belongs to the dataset based off the z-score. Third, the
function prints the probability for the user to see. If a calculated z-score is not contained in the
“z” dictionary (that is, if the z-score exceeds 4), then the function will let the user know that
there is a 0% chance of that observation belonging to the dataset. In order to accommodate this
anomaly, I used a try/except structure for the function based on “KeyError”.

Results: Data Entry

Verifying the user input was straightforward. The program simply will not run unless the
user inputs a recognizable data structure (floats or integers separated by commas) (Figure 6).
Large datasets are time-consuming to type, so in the future, I would like to adjust the program to
read an Excel list of data. This would let the user input large datasets without the hassle of typing
it all.

Results: Plotting
As expected, Python Launcher is opened, and displays a histogram of the data (Figure 7).
The plot is nothing particularly exciting; perhaps allowing the user to choose the number of bins
or axis titles would be helpful.

Results: Descriptive Statistics

The program prints each descriptive statistic for the user to view and record by calling
upon the basic_stats function (Figure 8). However, the function will not be called until the plot
window is closed by the user. For my uses, I rarely need more than two decimal places, so I
rounded each value accordingly. These descriptive stats are the main building blocks to form
confidence limits, t-tests, ANOVAs, etc.

Results: Probability of an Observation

The program prints an invitation for the user to call upon the prob() function calculate the
probability that a value belongs to their dataset. The prob() function then prints the probability
for the user to view and record (Figure 9). There are many other uses of the z-table other than
finding the probability of an observation belonging. For example, with proper manipulation of
values, a user may calculate the limits in which 50% of their data lie. However, my program
offers just a taste of the possibilities of statistics and probability.

Discussion
This was a very fulfilling project; it was great to create a program to do basic calculations
that regularly consume my time when analyzing data. This project contains a range of techniques
we learned in class. The program takes user input (a dataset), and returns a plot of the data, a list
of descriptive statistics, and the probability of a value belonging to the dataset. This program
relies on a comprehensible data structure; I used a try/except block to ensure that the data was in
the correct format in order to be manipulated later on. I plotted the data, a technique we used in
one of our labs. I called upon and manipulated a list (the dataset) to calculate the descriptive
statistics. I created a dictionary for the z-values to calculate probabilities of certain observations.
Finally, I used a series of functions and return and print statements to create straightforward
results for the user. The “z” dictionary was the most difficult. Instead of creating it myself, I
could have downloaded and imported a macro to execute the probability calculations. For
example, instead of calling on my dictionary “z”, I would have called into the macro and
supplied the appropriate values and gathered the result. However, I wanted to make this program
accessible without the need for many external downloads, so I opted out of that method, and
instead embedded the dictionary into the code. Something I wish I had more time to complete
was the creation of the t-tests. This would have required input of two datasets and a matrix of F-
values, critical values that are compared against variations in datasets. However, my program
gives the basic values needed to calculate t-tests separately. I enjoyed this project because there
is little uncertainty to statistics. I had no need for if/else statements, because the math was pure
and straightforward, and there is little choice behind statistics. I knew what needed to be done
with the numbers, and I created a program to do just that.

Conclusion
My project did not create any significant advancements in the field of statistics. Instead,
my project offers an alternate and simple approach to dealing with datasets, which may be
preferable for certain users. However, my project is a glimpse at what Python is capable of; there
is much more to statistics that could be addressed with proper code. For example, there are a
range of tests you can execute on either one dataset or a group of datasets; t-tests, f-tests,
ANOVA, etc. There is software that manipulates and analyzes data much more comprehensibly
than I did in my program; however, it was enlightening to discover what type of code structures
are needed to create these types of powerful programs (ex. MATLAB, SAS). I certainly enjoyed
creating this program, and will expand and refine it for a more tailored use in my classes and
work.
Figure 1: Sample plots of Gaussian distributions, according to mean and sd values

Figure 2: Code for verifying user input

Figure 3: A sample of coding the descriptive statistics

Figure 4: Plot of Gaussian distribution in standard deviation units

Figure 5: Dictionary “z” that contains z-score to probability relationships

Figure 6: Verifying user input

Figure 7: Histogram of the dataset

Figure 8: Displaying the descriptive statistics of the dataset

Figure 9: Probability of an observation belonging to the distribution

Udacity Statistics Notes
No ratings yet
Udacity Statistics Notes
37 pages
A Quick and Easy Guide in Using SPSS for Linear Regression Analysis
From Everand
A Quick and Easy Guide in Using SPSS for Linear Regression Analysis
Jurex Gallo
No ratings yet
Chapter Two
No ratings yet
Chapter Two
36 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
DSOST2
No ratings yet
DSOST2
44 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Module3
No ratings yet
Module3
54 pages
UNIT 3
No ratings yet
UNIT 3
45 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
2-Statistika Deskriptif
No ratings yet
2-Statistika Deskriptif
34 pages
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
No ratings yet
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
16 pages
Descriptive and Inferential Statistics
No ratings yet
Descriptive and Inferential Statistics
10 pages
Week 5 - Result and Analysis 1 (UP)
No ratings yet
Week 5 - Result and Analysis 1 (UP)
7 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
Quantitative Skills 2 Data Analysis (1)
No ratings yet
Quantitative Skills 2 Data Analysis (1)
43 pages
Probability 3.2 EdX
No ratings yet
Probability 3.2 EdX
74 pages
Week 01 Introduction
No ratings yet
Week 01 Introduction
33 pages
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
No ratings yet
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
4 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Statistics
No ratings yet
Statistics
21 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
32 pages
Lectures_ProbaStat for Engineers
No ratings yet
Lectures_ProbaStat for Engineers
60 pages
Wa Nko Nalipay PR
No ratings yet
Wa Nko Nalipay PR
12 pages
FET 401 Week 8 Lecture Note
No ratings yet
FET 401 Week 8 Lecture Note
21 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
Tian Statistics Lesson 4 Frequency Distribution Definition and Properties of Probability
No ratings yet
Tian Statistics Lesson 4 Frequency Distribution Definition and Properties of Probability
54 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
Notes PDF
No ratings yet
Notes PDF
54 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
Igual-SeguÃ 2017 Chapter StatisticalInference
No ratings yet
Igual-SeguÃ 2017 Chapter StatisticalInference
15 pages
Statistics
No ratings yet
Statistics
12 pages
Descriptive and Inferential Statistics
100% (1)
Descriptive and Inferential Statistics
10 pages
Probability and Statistics Notes
No ratings yet
Probability and Statistics Notes
38 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
Probabilities and The Normal Distribution
No ratings yet
Probabilities and The Normal Distribution
3 pages
Topic 2- Descriptive_statistics
No ratings yet
Topic 2- Descriptive_statistics
36 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
2nd unit
No ratings yet
2nd unit
31 pages
CE Data Analysys Chap1.
No ratings yet
CE Data Analysys Chap1.
60 pages
Engineering Data Analysis
100% (1)
Engineering Data Analysis
82 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Data Mining Lab Maual Through Python 031023
No ratings yet
Data Mining Lab Maual Through Python 031023
22 pages
STAE lecture notes_LU3_Annotated
No ratings yet
STAE lecture notes_LU3_Annotated
10 pages
3 - Descriptive Stat
No ratings yet
3 - Descriptive Stat
70 pages
Difference Between Descriptive and Inferential Statistics
No ratings yet
Difference Between Descriptive and Inferential Statistics
9 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
stats notes
No ratings yet
stats notes
16 pages
Cschool teviewre
No ratings yet
Cschool teviewre
6 pages
MT233 October 2019-1
No ratings yet
MT233 October 2019-1
39 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Eng 2015 Prelims Reviewer
No ratings yet
Eng 2015 Prelims Reviewer
11 pages
Engineering Data Analysis (Report)
No ratings yet
Engineering Data Analysis (Report)
18 pages
Section 1: Make Allowances For It
No ratings yet
Section 1: Make Allowances For It
28 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Psychological Assessment - Reliability & Validity
100% (1)
Psychological Assessment - Reliability & Validity
56 pages
Mixed Analysis For Duration
No ratings yet
Mixed Analysis For Duration
465 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
12 pages
Classroom PDF
No ratings yet
Classroom PDF
118 pages
BITS Pilani
No ratings yet
BITS Pilani
31 pages
Anova, Ancova, Manova, & Mancova
No ratings yet
Anova, Ancova, Manova, & Mancova
11 pages
Performance Evaluation of Machine Learning
No ratings yet
Performance Evaluation of Machine Learning
5 pages
Wooldridge Solution Chapter 3
50% (2)
Wooldridge Solution Chapter 3
11 pages
Regression Modeling PDF
100% (1)
Regression Modeling PDF
598 pages
Forecasting Seasonal Time Series Decomposition
No ratings yet
Forecasting Seasonal Time Series Decomposition
32 pages
Instrumental Variables Estimation in Political Science. A Readers Guide PDF
No ratings yet
Instrumental Variables Estimation in Political Science. A Readers Guide PDF
14 pages
Business Analytics Syllabus (BVDU)
No ratings yet
Business Analytics Syllabus (BVDU)
18 pages
Homework2
No ratings yet
Homework2
12 pages
pr2 Exam POINTERS
No ratings yet
pr2 Exam POINTERS
4 pages
Final 2023 Summer IntroStat Sol
No ratings yet
Final 2023 Summer IntroStat Sol
6 pages
Module 2 Lab Activity - Regression
No ratings yet
Module 2 Lab Activity - Regression
9 pages
Testing of Hypothesis - Quiz 2
No ratings yet
Testing of Hypothesis - Quiz 2
5 pages
Data Analysis For Quantitative Research
100% (1)
Data Analysis For Quantitative Research
26 pages
WST01 01 Que Jan20215213
No ratings yet
WST01 01 Que Jan20215213
20 pages
Untitledone Way Anova Fishers
No ratings yet
Untitledone Way Anova Fishers
3 pages
Lenght of Confidence Intervals
No ratings yet
Lenght of Confidence Intervals
6 pages
Predicting Pregnancies of Our Customers I - Regression Model
No ratings yet
Predicting Pregnancies of Our Customers I - Regression Model
50 pages
Multiple Testing in QTL Mapping: Lucia Gutierrez Lecture Notes Tucson Winter Institute
No ratings yet
Multiple Testing in QTL Mapping: Lucia Gutierrez Lecture Notes Tucson Winter Institute
18 pages
2014 NIRnews EAS MarkWesterhaus
No ratings yet
2014 NIRnews EAS MarkWesterhaus
5 pages
Correlation & Regression
100% (1)
Correlation & Regression
26 pages
Mediator Versus Moderator Variables
No ratings yet
Mediator Versus Moderator Variables
2 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Practice Questions For COMM5011 Final Exam
No ratings yet
Practice Questions For COMM5011 Final Exam
3 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages