Programming Python Statistics
Programming Python Statistics
Tate Knight
12/4/16
COMP112
Section 5
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Abstract!
For this project, I used Python to create a small statistical package. My goals for this
program were threefold: (i) to create a visual representation of data, (ii) to return basic
descriptive values for the dataset, and (iii) to calculate the probability of a certain observation
belonging to the dataset. The user inputs a dataset, and the program returns the following: a
histogram of the dataset, a list of descriptive statistics of the dataset, and an opportunity to
calculate the probability that an observation belongs to the dataset. Though these techniques are
only a fraction of the full range of statistical capabilities and analyses, they are nonetheless
important for understanding the basic structure of a dataset.
Introduction
The analysis of data is a common practice in all disciplines. Whether calculating grades
for a test or analyzing cell samples, statistics are a powerful tool for understanding the driving
forces behind data. Understanding the structure of a sample dataset allows for inferences about
the population for which it describes. Statistics is simply a method of predicting population-scale
trends. The two main values that account for the structure of a dataset are the mean and standard
deviation. The mean describes the most likely observation within the dataset, while the standard
deviation describes the spread of the data. The majority of sample datasets are normally
distributed; that is, the frequency of observations of a sample is greatest towards the mean of the
data, and is lowest towards the tails of the data. Thus, normally distributed data resembles a bell-
shaped-curve, or more formally, a Gaussian distribution. The Gaussian distribution describes a
probability density function based on the mean and standard deviation of a dataset (Figure 1).
The “mean” value of the function determines the height of the distribution, while the “standard
deviation” value determines the width. A probability of belonging to a dataset is derived from the
Gaussian distribution; for example, 95.4% of observations in a dataset fall within two standard
deviations of the mean. The theory behind this technique forms a relationship between units of
standard deviation and probability, later discussed as z-scores in methods section.
First, it is important to visualize a dataset before diving into the statistics. A simple
histogram of a dataset describes its general distribution, and reveals any outliers or anomalies
that may need addressing. Second, and most importantly, a user must know the values which
describe the data. These values include, for example, the number of observations, the mean, and
the standard deviation. There are other descriptive values as well, but the majority of statistical
analyses are derived from these three important values. Third, it is helpful to know the likelihood
of an observation falling within your dataset. This technique applies the z-scores for
observations, explained in the methods section. For example, if only 10 out of 20 people show up
to my club meeting for two weeks in a row, what is the likelihood that 15 people will show up on
the third week? Probability is a powerful tool when hypothesizing future events. Using Python to
create a statistical package results in an efficient and simplistic approach to basic statistics; this
program is most likely helpful for students in an Intro to Statistics class where basic descriptive
values are frequently calculated.
Methods: Plotting
I first import numpy and pylab in the program. This allowed me to write code for Python
to open Python Launcher and use its plotting capabilities accordingly. I bin the data relative to
the number of observations (# bins = observations/3), and plot the data on a histogram. If the
bins are too large, the plot will have no substance. If the bins are too small, then observation
frequencies in each bin are difficult to differentiate from one another. The x-axis is the binned
observation value, and the y-axis is the frequency of the binned observation. Although it is not
especially complicated, it is important for a user to visualize the data before proceeding with
analysis.
Results: Plotting
As expected, Python Launcher is opened, and displays a histogram of the data (Figure 7).
The plot is nothing particularly exciting; perhaps allowing the user to choose the number of bins
or axis titles would be helpful.
Discussion
This was a very fulfilling project; it was great to create a program to do basic calculations
that regularly consume my time when analyzing data. This project contains a range of techniques
we learned in class. The program takes user input (a dataset), and returns a plot of the data, a list
of descriptive statistics, and the probability of a value belonging to the dataset. This program
relies on a comprehensible data structure; I used a try/except block to ensure that the data was in
the correct format in order to be manipulated later on. I plotted the data, a technique we used in
one of our labs. I called upon and manipulated a list (the dataset) to calculate the descriptive
statistics. I created a dictionary for the z-values to calculate probabilities of certain observations.
Finally, I used a series of functions and return and print statements to create straightforward
results for the user. The “z” dictionary was the most difficult. Instead of creating it myself, I
could have downloaded and imported a macro to execute the probability calculations. For
example, instead of calling on my dictionary “z”, I would have called into the macro and
supplied the appropriate values and gathered the result. However, I wanted to make this program
accessible without the need for many external downloads, so I opted out of that method, and
instead embedded the dictionary into the code. Something I wish I had more time to complete
was the creation of the t-tests. This would have required input of two datasets and a matrix of F-
values, critical values that are compared against variations in datasets. However, my program
gives the basic values needed to calculate t-tests separately. I enjoyed this project because there
is little uncertainty to statistics. I had no need for if/else statements, because the math was pure
and straightforward, and there is little choice behind statistics. I knew what needed to be done
with the numbers, and I created a program to do just that.
Conclusion
My project did not create any significant advancements in the field of statistics. Instead,
my project offers an alternate and simple approach to dealing with datasets, which may be
preferable for certain users. However, my project is a glimpse at what Python is capable of; there
is much more to statistics that could be addressed with proper code. For example, there are a
range of tests you can execute on either one dataset or a group of datasets; t-tests, f-tests,
ANOVA, etc. There is software that manipulates and analyzes data much more comprehensibly
than I did in my program; however, it was enlightening to discover what type of code structures
are needed to create these types of powerful programs (ex. MATLAB, SAS). I certainly enjoyed
creating this program, and will expand and refine it for a more tailored use in my classes and
work.
Figure 1: Sample plots of Gaussian distributions, according to mean and sd values