Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Deep diving statistical distributions with Python for Data Scientists

Astha Puri
Towards Data Science
8 min readSep 28, 2021

--

For any data scientist, it is important to understand statistics, data distributions and how they can be used in real world scenarios. Other than theoretical knowledge, it also helps to be able to use them in modeling, data analysis, driving insights.

Before diving into statistical distributions, here is a quick recap on random number generation with Python.

Photo by Mick Haupt on Unsplash

We can use numpy as well as random in python to do the same job of generating random numbers.

We can also draw random numbers from a particular distribution such as normal, uniform etc. I will go into those details under each specific distribution. We will use SciPy library in Python to generate the statistical distributions.

Uniform Distribution

In this sort of distribution, values within a specific range are equally likely to occur. Values outside that given range never occur. Let’s generate 100,000 numbers from a uniform distribution and plot them to visualize this. The function used to generate random numbers from a distribution is called rvs. To define the bounds, we use loc and scale.

We get this graph. We can see that between 0 and 10, every number is equally likely to occur but outside it, the probability of every number is 0.

uniform distribution

We can draw further insights from this distribution. If we want to find the probability that an observation drawn from this distribution falls below a specific values, we can do that using cdf. It gives area under this density curve up to a certain value on the x axis. Say we draw the cutoff at 5 and want to know area under that curve up to x=5 i.e

using cdf on uniform distribution
stats.uniform.cdf(x=5.0, #cutoff
loc=0,
scale=10)

This will give the output of 0.5 which means that in this particular uniform distribution, if we pick out an observation, there is a 50% chance that it would be between 0 and 5.

The inverse of cdf is ppf. Given the probability, ppf gives the cutoff on the x axis. For example, to get the cutoff value for which we have 30% chance of drawing an observation below that value

stats.uniform.ppf(q=0.3, #probability cutoff
loc=0, #distribution start
scale=10) #distribution end

This will result in the value 3. So at x axis value of 3, we can slice the distribution so that 30% of it is to the left of the graph.

To get the actual probability density at given value x, we use pdf. This will basically give the value of the height of the distribution at that value x. Since uniform distribution is flat, in our case above, all x values between 0 and 10 will have the same probability density. And all points outside of this range would have probability density = 0.

Now, to get random numbers from any uniform distribution, we can use:

np.random.uniform()

To get random numbers form the uniform distribution used above, use:

np.random.uniform(0 #low limit
,10 #upper limit
,(3,4)) #size

This would result in an array like:

Normal/Gaussian Distribution

Same as above, we can use SciPy, norm and rvs to generate this distribution

from scipy.stats import norm
stats.norm.rvs()

Here is what it looks like:

normal distribution

We can use cdf to get area under the curve below a cutoff value on x axis. For example:

print(stats.norm.cdf(x=0.4))

Similary, we can use ppf to get the cutoff on x axis for a certain probability (% from area under the curve). For example:

#Find the quantile for the 97.5% cutoff
print(stats.norm.ppf(q=0.975))

Generating random numbers from some normal distribution can be done in multiple ways.

If we want to generate random numbers from a normal distribution of a particular mean and standard deviation:

np.random.normal(1 #mean
,2 #standard deviation
,(3,2)) #size

If we want to generate random numbers from a normal distribution between 0 and 1

np.random.rand(3,2)

If we want to generate random numbers from a standard normal distribution

np.random.randn(3,2)

Binomial Distribution

This is a discrete probability distribution. It has only 2 possible outcomes in experiments. We can use the binomial distribution to find probability of success. It tells you how likely it is to get success in n number of trials. So x axis would be the number of successes in a trial and y axis would be the number of trials. We need 2 parameters to define binomial distribution — probability of success in a trial and number of trials. A trial could have multiple events. For example — flipping fair coin 10 times = 1 trial.

from scipy.stats import binom

coin=stats.binom.rvs(size=10000 # number of trials
,n=10 #number of flips in a trial
,p=0.5 #probability of success (say getting heads)
)
print(pd.crosstab(index="counts",columns=coin))

This shows that out of all the 10000 trials, it was only 10 times that we flipped the coin and got NO head. This makes sense because it is a fair coin with equal probability of getting a head or a tail. We can see that 2442 times, we got 5 heads. And only 10 times out of the 10000, did all 10 flips give a head.

pd.DataFrame(coin).hist()
fair coin flips

If we now flip an unfair coin with 80% chance for heads, the graph should become right skewed because we would have more results of getting heads.

from scipy.stats import binom

coin=stats.binom.rvs(size=10000 # number of trials
,n=10 #number of flips in a trial
,p=0.8 #probability of success
)

print(pd.crosstab(index="counts",columns=coin))

It is interesting to note that since this coin is biased to get heads, out of 10000 trials, there was not a single one where we got only 1 head in a trial. The least we got was 2 heads in a trial and that also happened just once. We can clearly see a bias of getting more heads in trials for this case.

pd.DataFrame(coin).hist()
biased coin flips

Same as other distributions cdf gives the probability of successes within a certain range. For example, if we want to find out what is the probability of getting heads in 5 flips or less in the biased coin case:

stats.binom.cdf(k=5, #probability of 5 success or less
n=10, #with 10 flips
p=0.8) #success probability 0.8

Probability of MORE than 5 successes would then be:

1-stats.binom.cdf(k=5, #probability of 5 success or less
n=10, #with 10 flips
p=0.8) #success probability 0.8

In discrete distributions like this one, we have pmf instead of pdf. pmf stands for probability mass function. It is the proportion of observations at a given number of success k.

stats.binom.pmf(k=5, #probability of 5 success 
n=10, #with 10 flips
p=0.5) #success probability 0.5

We can generate random numbers from a particular binomial distribution by giving the parameters n (number of trials) and p (probability of success). For example:

np.random.binomial(n=52, p=0.7, size=(2,3))

Geometric Distribution

This is also a discrete distribution. It models the amount of time it takes for an event to occur. For example, if success = heads, how many trials does it take to get success when flipping a fair coin?

heads = stats.geom.rvs(size = 10000, #We generate 10000 trails of flips
p=0.5) #fair coin

print(pd.crosstab(index="counts", columns = heads))

In the crosstab we can see, half of the time, it takes only 1 flip to get a head. That makes sense as we are flipping a fair coin. So the graph would also be right skewed.

pd.DataFrame(heads).hist()
geometric distribution

We can use cdf to draw more insight. For example, what is the probability of success in first 3 flips?

three_flip = stats.geom.cdf(k=3,
p=0.5)
print(three_flip)

What is the probability that we get success in exactly 2 flips? This can be solved using pmf.

stats.geom.pmf(k=2,
p=0.5)

Exponential Distribution

This is the continuous version of the geometric distribution. It models the amount of time it takes for a certain event to occur, given an occurrence rate.

For example, if occurrence rate is once per hour, what is the probability of waiting for more than an hour for an event to occur?

prob = stats.expon.cdf(x=1,
scale=1) #arrival rate
1-prob

Poisson Distribution

It models the probability of success/event occurrence within a given time interval. For example, if a waiting room has an arrival rate of once per hour, how many arrivals happen in an hour?

arr = stats.poisson.rvs(size=10000,
mu=1) #average arrival time 1

print(pd.crosstab(index='counts', columns = arr))

It it interesting to see that we often see 0 arrivals in an hour when the arrival rate is once per hour. We also see more arrivals sometimes. These are probably the busier hours.

pd.DataFrame(arr).hist()

So in such a case, we would want to have say more chairs in the waiting room for these busy hours. Arrival rate of once per hour can be deceptive in such cases and we can run into shortage of resources.

With statistical distributions, we get a lot more insight into out data and business for which we are doing this sort of modeling.

--

--