How to Find the Best Theoretical Distribution for Your Data

Knowing the underlying data distribution is an essential step for data modeling and has many applications, such as anomaly detection, synthetic data creation, and data compression.

Published in

Towards Data Science

19 min readFeb 3, 2023

Knowing the underlying (probability) distribution of your data has many modeling advantages. The easiest manner to determine the underlying distribution is by visually inspecting the random variable(s) using a histogram. With the candidate distribution, various plots can be created such as the Probability Distribution Function plot (PDF/CDF), and the QQ plot. However, to determine the exact distribution parameters (e.g., loc, scale), it is essential to use quantitative methods. In this blog, I will describe why it is important to determine the underlying probability distribution for your data set. What the differences are between parametric and non-parametric distributions. How to determine the best fit using a quantitative approach and how to confirm it using visual inspections. Analyses are performed using the distfit library, and a notebook is accompanied for easy access and experimenting.

If you find this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership…

How to Find the Best Theoretical Distribution for Your Data

Knowing the underlying data distribution is an essential step for data modeling and has many applications, such as anomaly detection, synthetic data creation, and data compression.

Written by Erdogan Taskesen