Statistical Modeling
Statistical Modeling
1. Supervised learning
In the supervised learning model, the algorithm uses a labeled data set
for learning, with an answer key the algorithm uses to determine
accuracy as it trains on the data. Supervised learning techniques in
statistical modeling include:
2. Unsupervised learning
In the unsupervised learning model, the algorithm is given unlabeled
data and attempts to extract features and determine patterns
independently. Clustering algorithms and association rules are examples
of unsupervised learning. Here are two examples:
In statistical models, probabilistic models for the data and variables are
interpreted and identified, such as the effects of predictor variables. A
statistical model establishes the magnitude and significance of
relationships between variables and their scale. Models based on
machine learning are more empirical.
Job opportunities
You'll find that statistical data analysis skills demand data science
positions that will involve machine learning. They may ask you to solve
some typical statistics problems during an interview.
With a proper background in statistics and math, it is possible to
optimize linear regression models and understand how decision trees
calculate impurity at each node. These are some of the top reasons
machine learning needs statistics. Taking online courses on statistics can
get you started.
Temp Sales
12 200
14 200
16 300
18 400
20 400
22 500
23 550
25 600
We can place the line "by eye": try to have the line as close as possible
to all points, and a similar number of points above and below the line.
But for better accuracy let's see how to calculate the line using Least
Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in
the equation of a line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for N points:
Step 1: For each (x,y) point calculate x2 and xy
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ
means "sum up")
Step 3: Calculate Slope m:
m = (N Σ(xy) − Σx Σy) / (N Σ(x2) − (Σx)2)
(N is the number of points.)
Step 4: Calculate Intercept b:
b = (Σy − m Σx) / N
Step 5: Assemble the equation of a line
y = mx + b
Done!
Example
Let's have an example to see how to do it!
Example: Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday:
"y"
"x"
Ice
Hours of
Creams
Sunshine
Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
Nice fit!
Sam hears the weather forecast which says "we expect 8 hours of sun
tomorrow", so he uses the above equation to estimate that he will sell
y = 1.518 x 8 + 0.305 = 12.45 Ice Creams
Sam makes fresh waffle cone mixture for 14 ice creams just in case.
Yum.
How does it work?
It works by making the total of the square of the errors as small as
possible (that is why it is called "least squares"):
Activity
Suppose we want to assess the association between BMI and systolic blood
pressure using data collected where a total of n=3,539 participants attended the exam, and
their mean systolic blood pressure was 127.3 with a standard deviation of 19.0. The mean
BMI in the sample was 28.2 with a standard deviation of 5.3. A simple linear regression
analysis reveals the following:
Independent Regression t- P-
Variable Coefficient statistic value
Intercept 108.28 62.61 0.0001
BMI 0.67 11.06 0.0001
Solution
𝑌̂ = 108.28 + 0.67(𝐵𝑀𝐼)
Where 𝑌̂ is the predicted of expected systolic blood pressure. The regression coefficient
associated with BMI is 0.67 suggesting that each one unit increase in BMI is associated with
a 0.67 unit increase in systolic blood pressure. The association between BMI and systolic
blood pressure is also statistically significant (p=0.0001).
𝑌̂ = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘