0% found this document useful (0 votes)

53 views

Steps of Implementation of A GLM

This document discusses implementing a generalized linear model (GLM) in 5 steps: 1. Understand the data through descriptive statistics, distributions, relationships between variables, and feature engineering. 2. Preprocess the data by handling missing values, encoding categories, scaling variables, and splitting into training and test sets. 3. Choose the appropriate GLM family based on the response variable distribution. 4. Train the GLM model using a library like statsmodels. 5. Evaluate the trained model using error metrics like MAE, RMSE, and R-squared and by plotting residuals.

Uploaded by

Paul Wattellier

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

Steps of Implementation of A GLM

Uploaded by

Paul Wattellier

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Implémentation d’un GLM :

1. Understand your data:

1. Data Collection Review:
where the data came from:
who collected it:
what methodologies were used:
what definitions were applied:
the discrepancies and alignment in data collection methodology:

2. Descriptive Statistics:
central tendency:
dispersion:
shape of a dataset’s distribution:
Functions in Python such as `describe()` in pandas can provide quick insights into your
numerical variables - count, mean, standard deviation, minimum and maximum values, and
the quartiles.

3. Data Structure and Types:

Table latex
Categorical:
 Region one hot encoding
 Cover- one hot encoding
 Type of vehicle

Time related:

 Occurencia
 Desde / Hasta -> Possibility to create new data to use as quantitative data

Quantitative:

 Monto/Reserva -> Possibility to use tsfresh

 Vehicle year buy date
 Feature engineering:
o Price
o Hp/wgt ratio

Sujet de liaison via R des différentes bases :

4. Data Distribution:
Visualize your data to understand their distributions. Histograms, box plots, and density
plots are useful for continuous variables. For categorical variables, consider bar plots. Check
for skewness and kurtosis in your variables.

5. Missing Values:
Check your data for missing values. Depending on how much data you're missing, you might
decide to impute the missing values or drop the rows or columns that contain them.
Visualizations like missingno matrix can help in understanding the missingness pattern.

6. Outliers:
Detect outliers in your data. Outliers might be errors, but they can also be valuable pieces
of information. Boxplots and scatter plots can help you visually identify outliers.

7. Relationships Between Variables:

Try to understand relationships and correlations between your different variables. Scatter
plots, pair plots, or correlation matrices can be useful for this. Remember that correlation
does not imply causation.

8. Temporal Patterns (if applicable):

- Possibilité d’utiliser le module tsfresh
If your data is time-series data, look at how your data changes over time. Line plots are a
good tool for this.

9. Feature Engineering Possibilities:

 Specific scraping : As you're exploring your data, start thinking about possible feature
engineering opportunities. Feature engineering is the process of creating new features or
modifying existing features to improve your machine learning model's performance.

10. Document Findings:

Document everything you find during this process, as it could be helpful later on during the
preprocessing, modeling, and interpretation stages.

Remember, the goal of this stage is to become a subject matter expert on your dataset. The
better you understand your data, the better equipped you'll be to build an effective model.

2. Preprocess your data: (2h)

1. Handle Missing Values:

The first step in preprocessing should be handling missing values, as many algorithms do not
handle them well. Here are a few strategies:

- Imputation: Replace missing values with statistical measures such as mean, median or
mode.
- Prediction: Use a machine learning algorithm to predict missing values. This could be a
simple model like linear regression for continuous data or logistic regression for categorical
data.
- Deletion: If only a few rows contain missing values, and if those rows aren’t crucial, they can
be deleted. However, be careful, as this can lead to bias if the missing data is not missing
completely at random.

2. Encode Categorical Variables:

Many machine learning models require the input data to be numerical. If your data contains
categorical variables, you'll need to encode them as numbers. Here are two common
methods:

- One-Hot Encoding: This method creates new columns indicating the presence (or absence)
of each possible value in the original data.
- Label Encoding: This method converts each value in a column to a number. Useful for
ordinal data where the order matters.

3. Handle Outliers:

Depending on the context and the specific algorithms you want to use, you might want to
handle outliers in your data. Outliers can skew statistical measures and data distributions,
leading to misleading results.

- Winsorization: This method limits the extreme values in the data to reduce the impact of
spurious outliers. However, it doesn’t entirely eliminate the impact of these extreme values.
- Trimming: This completely removes the data outliers from the dataset, but it could lead to
losing valuable information in the case that those outliers were genuine observations and not
errors.

4. Scale Numerical Variables:

Many machine learning algorithms perform better when numerical input variables are scaled
to a standard range. This includes algorithms that use a weighted sum of the input, like linear
regression, and algorithms that use distance measures, like k-nearest neighbors.

- Normalization / Min-Max Scaling: This method scales and shifts the features so that they
range from zero to one.
- Standardization / Z-Score Normalization: This method standardizes features by removing
the mean and scaling to unit variance.

5. Feature Engineering:

Creating new informative features can often help improve the performance of machine
learning models. This could involve arithmetic combinations of features, grouping sparse
classes in categorical features, creating interaction features, etc.

6. Splitting the Data:

Finally, you should split your data into a training set and a test set. The training set is used to
train the model, while the test set is used to evaluate the model's performance on unseen
data.

7. Apply Transformations:
Certain machine learning algorithms might assume that your data follows a normal
distribution. If your data is heavily skewed, applying a transformation like a logarithm or
square root can help.

Each of these steps may or may not be necessary depending on your data and the specific
algorithms you're planning to use. The best approach is to understand your data and use this
understanding to guide your preprocessing decisions.

3. Choose the appropriate GLM family: Inverse gaussian

Generalized Linear Models (GLMs) offer a flexible generalization of ordinary linear regression
models, allowing for response variables that have error distribution models other than a
normal distribution. The GLM generalizes linear regression by allowing the linear model to be
related to the response variable via a link function.

Choosing the appropriate GLM family is crucial to a successful model. Here are some
common GLM families and their applications:

1. Gaussian: This family should be chosen when your response variable is normally
distributed. It's used for regression problems where the output is a real value (like the cost of
a house).

2. Binomial: This family should be chosen when your response variable is binary. It's often
used for classification problems where the output is either 0 or 1.

3. Poisson: This family should be chosen when your response variable is a count (i.e., non-
negative integers). It assumes that the mean and variance of your response variable are the
same.

4. Negative Binomial: This family is an alternative to the Poisson family when the variance of
your response variable is larger than the mean (i.e., overdispersion).

5. Gamma: This family should be chosen when your response variable is continuous and non-
negative, and the reciprocal of the variance is linearly related to the mean. It's often used
when modeling time till some event or when modeling amounts.

4. Train your model: (30min)

Here's how you might implement a GLM in Python using the `statsmodels` library, which
offers more statistical insights compared to the `sklearn`:

```python
import statsmodels.api as sm

# add constant to your data if necessary

X_train = sm.add_constant(X_train)
# create a model (example with binomial family for a logistic regression case)
model = sm.GLM(y_train, X_train, family=sm.families.Binomial())

# train the model

model_results = model.fit()
```

5. Evaluate your model:

After training your model, it's important to evaluate its performance to understand how well
it's doing. Here's how you can do that for your Generalized Linear Model (GLM):

1. Calculate error metrics:

There are many error metrics you could use, but here are a few that are commonly used for
regression problems:

- Mean Absolute Error (MAE): This is the average absolute difference between the predicted
and actual values.

- Root Mean Square Error (RMSE): This is the square root of the average of the squared
differences between the predicted and actual values. It can be more sensitive to large errors
compared to MAE because it squares the differences.

- R-squared : This represents the proportion of variance in the dependent variable that can
be explained by the independent variables. The higher the R-squared, the better the model
fits your data.

- Adjusted R-squared: This is similar to R-squared but takes the number of predictors into
account. It's particularly useful when comparing models with different numbers of predictors.

2. Plot residuals:

Residual plots can be very useful for understanding your model's performance. The residuals
are the differences between the actual and predicted values. Ideally, you'd want your
residuals to be randomly scattered around zero.

3. Check assumptions:

GLMs make several assumptions (e.g., about the distribution of the residuals). It's important
to check these assumptions because if they're violated, your model's predictions could be
unreliable.

You can check the normality of the residuals by using a Q-Q plot and the homoscedasticity by
using a scale-location plot:
Remember, evaluating your model isn't just about getting good numbers. It's also about
understanding your model's weaknesses, validating its assumptions, and gaining insights to
help improve it.

6. Feature Selection:
Feature selection is a critical process in machine learning, which involves selecting the most
useful features (input variables) to use in modeling. Effective feature selection can result in
simpler, more interpretable models that perform better on unseen data.

Here are some methods that can be used for feature selection with a Generalized Linear
Model (GLM):

1. Univariate Selection:
One common method for univariate selection is the chi-squared test. This test is used to
determine if there's a significant association between two categorical variables. For numerical
variables, you might use a correlation coefficient (like Pearson's) or a simple linear regression.

2. Recursive Feature Elimination (RFE):

RFE is a greedy optimization algorithm that aims to find the best performing feature subset. It
repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the left features until all the features are
exhausted. Then it ranks the features based on the order of their elimination.

3. Regularization methods (L1/Lasso and L2/Ridge):

Regularization methods introduce additional constraints into the optimization of a predictive
algorithm (like a GLM) that bias the model toward lower complexity (fewer coefficients).

Both L1 (Lasso) and L2 (Ridge) work by penalizing the magnitude of the coefficients of
features and at the same time minimizing the error between the predicted and actual
observations. The key difference is that Lasso shrinks the less important feature’s coefficient
to zero, effectively excluding some features. Ridge regression, on the other hand, doesn't
exclude them but it does reduce their impact.

4. Feature Importance from Tree-based models:

Tree-based models like decision trees, random forests, and gradient boosting can be used to
rank features by importance, which can be used for feature selection.

```python
from sklearn.linear_model import LassoCV

# Define model with Lasso (L1) regularization

lasso = LassoCV()

Remember that feature selection methods are not one-size-fits-all. Depending on the dataset
and the problem at hand, different methods might yield better results. As such, it's worth
trying different methods and combinations of methods to find what works best.

7. Check for model assumptions: (1h)

When using Generalized Linear Models (GLMs), there are some key assumptions that need to
be satisfied for the model to provide reliable and interpretable results. Here's a rundown of
those assumptions and how you can check them:

1. Linear relationship:

GLMs assume that there's a linear relationship between the predictors and the response
variable. If you're dealing with a single predictor, you can check this assumption by plotting
the predictor against the response and checking if the data appears to follow a linear trend.

If you're dealing with multiple predictors, it's more challenging to check this assumption, but
one approach is to look at the residuals (the differences between the observed and predicted
responses) against the predicted responses. If there's a clear pattern in the residuals (like a
curve), it might suggest that the relationship is not linear.

2. Independence of errors:

The residuals should be independent, which means that the residuals for one observation
shouldn't predict the residuals for another. This assumption can be tricky to test, but if your
data is a time series or has some sort of grouping, you should be particularly careful about it.
A Durbin-Watson test can be used to check for autocorrelation in the residuals.

3. Homoscedasticity:

The variance of the residuals should be constant. This means that the spread of the residuals
should be about the same across all levels of the predictors. This can be checked with a
scatter plot of residuals vs predicted values as well. If the plot shows a funnel shape, it
suggests that the model suffers from heteroscedasticity.

4. Distribution of errors:

For a GLM, you assume that the response follows a certain distribution (e.g., normal,
binomial, Poisson, etc.) and that the residuals should therefore also follow this distribution.
This can be checked with a Q-Q plot (quantile-quantile plot). The points in this plot should
roughly follow a straight line if the assumption holds.

Checking these assumptions can give you a lot of insight into your model and can guide you
in refining your model if necessary. If the assumptions are not met, you might need to
consider data transformations, adding interaction terms, or using a different type of model.

8. Tune your model: (2h)

Tuning a model involves optimizing its parameters to improve performance. In the case of a
Generalized Linear Model (GLM), this might involve tuning the regularization parameter if you're
using a GLM with L1 or L2 regularization.
For models like Random Forests or Gradient Boosting, there are many more parameters that can be
tuned, such as the number of trees, maximum depth of trees, minimum number of samples per leaf,
and learning rate (for Gradient Boosting).

1. Decide on a tuning method:

The two most common methods for tuning are grid search and random search.

- Grid Search: involves specifying a subset of the hyperparameter space as a grid, and then trying out
every single combination of parameters in the grid.

- Random Search: involves specifying a distribution for each hyperparameter, and then randomly
sampling parameters from these distributions.

Random search can often find good parameters faster than grid search, but there's a chance it might
miss the optimal parameters if you're unlucky with the random sampling.

2. Specify a performance metric:

You'll need a performance metric to decide which set of parameters is best. The choice of metric will
depend on your problem. For a regression task like predicting insurance claims, common metrics are
Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared.

4. Validate the tuning:

After finding the best parameters according to your tuning, re-train your model on the entire training
set using these parameters, and evaluate its performance on the test set.

Remember, you should only use your test set once, after all tuning and model selection has been
done, to get an unbiased estimate of your model's performance on new, unseen data.

Remember, achieving great results often depends on understanding your data well and feature
engineering more than on tweaking algorithm parameters. Also, a strong understanding of the
problem domain can help guide your feature selection and engineering efforts, improving the
performance of your model.

Enlever les $ et les 2023

Encoder les indices de regions et les couvertures

Price range à revoir

Alevel S2
No ratings yet
Alevel S2
37 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Unit 5
No ratings yet
Unit 5
11 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Fam QB Ans
No ratings yet
Fam QB Ans
9 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Assignment
No ratings yet
Assignment
5 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
Assignment1_LATEX
No ratings yet
Assignment1_LATEX
11 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
5 no ans.
No ratings yet
5 no ans.
38 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
All DL
No ratings yet
All DL
72 pages
MLSC Final Notes
No ratings yet
MLSC Final Notes
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
64 pages
Data Mining Primer
No ratings yet
Data Mining Primer
5 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Naïve Bayes & Decision Algorithm
No ratings yet
Naïve Bayes & Decision Algorithm
19 pages
R For Data Science Sample Chapter
100% (1)
R For Data Science Sample Chapter
39 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Machine Learning Assignment (1)
No ratings yet
Machine Learning Assignment (1)
5 pages
S-6
No ratings yet
S-6
5 pages
Application of Predictive Analytics in Volume Forecasting and Resource Planning
No ratings yet
Application of Predictive Analytics in Volume Forecasting and Resource Planning
69 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
All About ML
No ratings yet
All About ML
18 pages
Data Collection
No ratings yet
Data Collection
8 pages
AI Notes
No ratings yet
AI Notes
12 pages
Unit 1 AAM
No ratings yet
Unit 1 AAM
16 pages
Reflective Journal Writing 6_1733814927 (1)
No ratings yet
Reflective Journal Writing 6_1733814927 (1)
4 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
6 pages
Homework # 2 - CYS 607: Submission Date: 24-03-21 Total Marks: 10
No ratings yet
Homework # 2 - CYS 607: Submission Date: 24-03-21 Total Marks: 10
4 pages
Module 3
No ratings yet
Module 3
6 pages
Data Science for Civil Engineering Unit 4 Notes
No ratings yet
Data Science for Civil Engineering Unit 4 Notes
18 pages
??????? ???????? ??????????!
No ratings yet
??????? ???????? ??????????!
16 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Lecture 1 introduction PM (1)
No ratings yet
Lecture 1 introduction PM (1)
21 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
25 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Week 12 Chats
No ratings yet
Week 12 Chats
4 pages
Machine learning assignment (3) (1)
No ratings yet
Machine learning assignment (3) (1)
5 pages
Key Terms in Machine Learning
No ratings yet
Key Terms in Machine Learning
6 pages
A Process For Implementing Industrial Predictive Maintenance - Part II - Google Cloud Blog
No ratings yet
A Process For Implementing Industrial Predictive Maintenance - Part II - Google Cloud Blog
11 pages
ML Fundamentals
No ratings yet
ML Fundamentals
15 pages
Machine learning assignment (3)
No ratings yet
Machine learning assignment (3)
5 pages
5th Unit Answer Bank AIML
No ratings yet
5th Unit Answer Bank AIML
24 pages
ML notes
No ratings yet
ML notes
10 pages
ML 21-22 Sem
No ratings yet
ML 21-22 Sem
10 pages
Predictive Analytics Notes
No ratings yet
Predictive Analytics Notes
42 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
Information Retrieval Important questions
No ratings yet
Information Retrieval Important questions
20 pages
DA
No ratings yet
DA
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
HW 1 Stat
No ratings yet
HW 1 Stat
1 page
F-Test and Decomposition of Data
No ratings yet
F-Test and Decomposition of Data
13 pages
Theory of Causality
No ratings yet
Theory of Causality
18 pages
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
No ratings yet
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
11 pages
Measurement and Error
50% (2)
Measurement and Error
25 pages
Hsu, C.-L., W.-C. Wang, Et Al.
No ratings yet
Hsu, C.-L., W.-C. Wang, Et Al.
20 pages
Solution of The Exercises Series N°4.
No ratings yet
Solution of The Exercises Series N°4.
4 pages
Exercises: Applied Bayesian Analysis and Numerical Methods (STK4021)
No ratings yet
Exercises: Applied Bayesian Analysis and Numerical Methods (STK4021)
30 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Statistics
No ratings yet
Statistics
8 pages
Central Tendency
No ratings yet
Central Tendency
11 pages
Tutorial For Random Variables and Stochastic Processes Eem2046 Engineering Mathematics Iv Multimedia University
No ratings yet
Tutorial For Random Variables and Stochastic Processes Eem2046 Engineering Mathematics Iv Multimedia University
9 pages
Parameter Estimation and Remaining Useful Life Prediction of Lubricating Oil With HMM
No ratings yet
Parameter Estimation and Remaining Useful Life Prediction of Lubricating Oil With HMM
7 pages
Two-Sample Tests of Hypothesis: Mcgraw-Hill/Irwin
No ratings yet
Two-Sample Tests of Hypothesis: Mcgraw-Hill/Irwin
14 pages
Cox-Stuart 1955 Some Quick Sign Tests For Trend in Location and Dispersion
No ratings yet
Cox-Stuart 1955 Some Quick Sign Tests For Trend in Location and Dispersion
17 pages
GRADISTAT-1 (Danau)
No ratings yet
GRADISTAT-1 (Danau)
222 pages
House Price Prediction
No ratings yet
House Price Prediction
3 pages
Machine Learning Prediction of New York Airbnb Prices
No ratings yet
Machine Learning Prediction of New York Airbnb Prices
5 pages
Fitting A Mixture Distribution To Data
No ratings yet
Fitting A Mixture Distribution To Data
12 pages
Static Calibration Process: Ref: ANSI / ISA (1979) Standard
No ratings yet
Static Calibration Process: Ref: ANSI / ISA (1979) Standard
20 pages
Skew Kurt
No ratings yet
Skew Kurt
10 pages
Practice Test 3 - Spring 2010
100% (1)
Practice Test 3 - Spring 2010
9 pages
Program Evaluation Review Technique (PERT) : Project Planning, Scheduling, and Control
No ratings yet
Program Evaluation Review Technique (PERT) : Project Planning, Scheduling, and Control
35 pages
Statistics and Probability
No ratings yet
Statistics and Probability
18 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
8 pages
Variabel Kategorikal: Npar Tests - Tes Binomial
No ratings yet
Variabel Kategorikal: Npar Tests - Tes Binomial
4 pages
Course Title: Business Statistics Course Code: Qam 103 Credit Unit: 03 Course Level: Ug
No ratings yet
Course Title: Business Statistics Course Code: Qam 103 Credit Unit: 03 Course Level: Ug
4 pages
Business Statistics For Contemporary Decision Making 8th Edition Black Solutions Manual 1
100% (71)
Business Statistics For Contemporary Decision Making 8th Edition Black Solutions Manual 1
31 pages
TUGAS1 - ADS - RA - 121450027 - Dimas Rizky Ramadhani
No ratings yet
TUGAS1 - ADS - RA - 121450027 - Dimas Rizky Ramadhani
6 pages