Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
By :
ELGOUNIDI Hajar
SAFSAFI Aya
EL MALKI Ikram Supervised by :
AQAABICH Reda Mr Hicham RAFFAK
Overview
Introduction to CRISP-DM
Phases of the CRISP-DM
Example
Advantages
Limitations
Conclusion
Introduction
Data mining has become an essential tool for
businesses to analyze and extract valuable
insights from their data. However, data mining
projects can be complex and challenging,
requiring a systematic and structured approach
to ensure success.
The CRISP-DM framework provides a clear
and flexible process model for data mining
projects, which can help to execute successful
data mining projects. Let’s understand the
CRISP-DM framework, and go through an
overview of the process model, benefits, and
how it can be applied to real-world scenarios.
What is DATA mining
Data mining is the process of discovering patterns, trends,
correlations, or valuable information from large sets of data. It
involves analyzing and extracting useful knowledge or insights from
datasets, often using various techniques from statistics, machine
learning, and database systems. The goal of data mining is to
uncover hidden patterns and relationships within data that can be
used to make informed decisions.
Data Mining: Focuses on the overarching goal of knowledge discovery from data.
CRISP-DM: Provides a standardized framework to guide practitioners through the stages of a data mining
project.
What is CRISP-DM Framework.
The CRISP-DM framework provides a structured and
systematic approach for Data Mining. It is a widely
used process model that offers several benefits for
data mining projects, including providing a clear and
standardized process for project execution, helping to
manage risks and uncertainties, improving
communication and collaboration among team
members, enhancing transparency and building trust,
and increasing efficiency and ensuring quality. It
consists of six phases, each with its set of objectives,
tasks, and deliverables.
Business understanding – What does the
business need?
Data understanding – What data do we
have / need? Is it clean?
Data preparation – How do we organize
the data for modeling?
Modeling – What modeling techniques
should we apply?
Evaluation – Which model best meets the
business objectives?
Deployment – How do stakeholders access
the results?
03 CRISP-DM’S STEPS
Business understanding
Not knowing how to frame the business problem, is a problem itself.
Company Background:
You are working with an online retail platform that sells a variety of products, ranging from electronics to
clothing.
Business Problem:
The company has noticed a plateau in its sales growth and wants to understand customer purchasing
patterns more deeply.
The goal is to identify trends, preferences, and potential areas for improvement in order to create
targeted sales and marketing strategies.
Business Objective:
Objective: Understand customer purchasing patterns to improve sales strategies.
Data understanding
In this phase, you need to identify already what data
you have, where to get the data, what tools to use to
get the data, and how much data available is crucial.
Understanding your data from the initial phase will
make your data science project more sense.
What is Data?
Collection of objects defined by attributes. An attribute is a property or
characteristic of an object.
Examples: eye color of a person, temperature, etc.
• Other names: variable, field, characteristic feature, predictor, etc.
A collection of attributes describe an object
• Other names: record, point, case, sample entity, instance, etc.
Data understanding
Types of data
1. Quantitative Data:
Quantitative data refers to information that can be expressed in numerical terms and can be
measured or counted. It deals with quantities and is often associated with variables that have a
numeric value.
Examples: Age, height, weight, temperature, income, and the number of products sold.
2. Qualitative Data:
Qualitative data is descriptive and non-numeric. It deals with qualities or characteristics that
cannot be measured in numerical terms. This type of data is often categorical and represents
attributes or labels.
Examples: Gender, color, and the type of material used in a product .
Data understanding
Data Representation:
Data Sources :
First-party data is information that a company or organization collects directly from its own
interactions with its customers or users. This data is typically gathered through direct interactions,
such as customer purchases, website visits, feedback forms, and other first-hand experiences.
Data understanding
Data Sources :
Second-party data is essentially someone else's first-party data. It is obtained through a mutually
beneficial relationship or partnership between two organizations. In this arrangement, one
organization shares its first-party data directly with another, often for strategic or collaborative
purposes.
Third-party data is information that is acquired from external sources that are not directly
connected to the collecting organization or its users. This data is typically purchased from data
providers or vendors who aggregate and sell data from various sources.
1. Collect Initial Data: Collects the available data from various sources that are relevant to the project
and load it into the analysis tool.
2. Describe Data: Examine the data and documents the structure of the data, including its format,
type, and potential quality issues. Describe the nature of the data and any patterns or anomalies that
may be present.
3. Explore Data: This step assess the data to identify patterns, trends, and relationships that may exist
between different data points. This is typically done through data visualization tools or exploratory data
analysis techniques.
4. Verify Data Quality: Inspects the quality of the data to identify any potential issues such as missing
data, data outliers, or data entry errors. This step is important to ensure that the data is suitable for use
in the data mining process.
Data understanding
Exemple : Online Retail Platform
Importing Libraries:
*pandas is a powerful data manipulation library.
*matplotlib.pyplot is a library for creating static, animated, and interactive visualizations in Python.
*seaborn is a data visualization library based on Matplotlib. It provides a high-level interface for
drawing attractive and informative statistical graphics.
*stringIO acts as an in-memory file that allows you to use functions designed to read from files with a
string variable.
Data understanding
Exemple : Online Retail Platform
df.fillna(df.mean(), inplace=True) fills missing values in the DataFrame (df) with the mean of each respective
column.
X is created by dropping the 'SpendingScore' column from the DataFrame, leaving only the features.
y is created to represent the target variable, which, in this case, is 'SpendingScore'.
Modeling
This phase is focused on building and
validating the data mining models.
It involves in selecting the best modeling
techniques , training and testing the
models,and evaluating their effectiveness.
Supervised Learning:In supervised learning, the model is trained on a labeled dataset, where the
input data is paired with corresponding output labels. The goal is for the model to learn the mapping
from inputs to outputs, making predictions on new, unseen data (linear regression, classification)
Unsupervised Learning:Unsupervised learning involves modeling without labeled output data.
The algorithm tries to find patterns or relationships within the data. Common tasks include
clustering (grouping similar data points) and dimensionality reduction (simplifying the data while
retaining important information).
1. Select Modeling Techniques: Select the appropriate modeling techniques based on the
business goals, available data, and the problem at hand. The selection of algorithm depends on the
type of data, such as numeric, categorical, or text data. For example: regression.
2.Generate Test Design: The next step is to design a test plan that outlines the criteria for evaluating
the performance of the models. It includes deciding how to divide the available data set into training
data, test data, and validation data sets based on the metrics to use such as accuracy, precision,
recall and F1-score.
3.Building Model: To build the model using the selected technique. This involves training the model
on the prepared data and adjusting the parameters.
4.Assess Model Performance: This step is to assess the performance of the model by testing it on a
hold-out dataset. This helps to determine how well the model generalizes to new data.
Modeling
train_test_split: This function is used to divide the dataset into training and test sets.
LinearRegression: This is the linear regression model you will use.
mean_squared_error: This function is used to calculate the mean square error, which measures the mean of
the squares of the deviations between the actual values and the predicted values.
Modeling
The trained model is used to make predictions on the test set (X_test).
Modeling
Real values (y_test) are compared with predictions (predictions) using the mean_squared_error
function and the result is displayed.
Evaluation
this code reads a CSV file into a DataFrame, sets the 'id' column as the index, prints the shape of the
DataFrame, and displays the first row of the dataset.
2-Data Understanding
2. Explore data:
here we have a quick overview of the distribution and scale of the numerical data in the
DataFrame.
This code provides a quick overview of the central tendency and spread of the numerical features
in our dataset. It is used to identify potential outliers, understanding the range of values, and
gaining insights into the distribution of the data.
2-Data Understanding
2. Explore data:
Now we need to understand the distribution of ages in the dataset,
<AxesSubplot:xlabel='age', ylabel='Density'>
the code is handling missing values in the The output "0" indicates that after the imputation
'bmi' column by filling them with the mean process, there are no longer any missing values in
value of the non-missing 'bmi' values. After the 'bmi' column. The count of missing values has
this operation, the code checks and prints the been reduced to zero. This suggests that the missing
count of missing values in the 'bmi' column to values in the 'bmi' column were successfully replaced
confirm that there are no longer any missing with the mean value of the non-missing 'bmi' values,
values in that column. and the data in the 'bmi' column is now complete.
3-Data Preparation
2. Construct Data:
The goal of this step is to gain a preliminary understanding of the 'gender' column, which is essential for
making informed decisions in subsequent stages of the modeling process.
The code converts the 'hypertension' and 'heart_disease' columns to 'uint8,' optimizing memory usage for
binary categorical variables (0 or 1).
3-Data Preparation
3. Integrate Data:
The goal of this step in data preparation is to transform the categorical 'smoking_status' column into a
numerical format suitable for machine learning models.
This folowing code prepares these columns for machine learning models by converting categorical
information into a numerical format, making them suitable for inclusion in predictive modeling tasks.
For 'ever_married':
'Yes' is replaced with 1, and 'No' is replaced with 0, transforming the column into a binary representation of
marital status.
The data type is then converted to 'uint8' for optimized memory usage.
3-Data Preparation
3. Integrate Data:
For 'Residence_type':
'Rural' is replaced with 1, and 'Urban' is replaced with 0, converting the column into a binary representation of
residence type.
The data type is converted to 'uint8' for memory efficiency.
3-Data Preparation
3. Integrate Data:
The goal is to gather descriptive statistics about the distribution of work types, providing a foundation for
further analysis, modeling, and decision-making in the data science or business context.
the goal is to prepare the 'work_type' information in a format suitable for machine learning models
while preserving the categorical nature of the original data.
This approach achieves the same result as using pd.get_dummies() alone but integrates the dummy
variables back into the original DataFrame and drops the original categorical column. This is a
common practice in preparing data for machine learning.
3-Data Preparation
4.Formal Data:
The code is handling the categorical variable 'work_type' by converting it into dummy variables(
binary variables), creating a binary representation for each work type.
The resulting DataFrame X now includes these dummy variables for each unique work type, and
the original 'work_type' column has been dropped.
3-Data Preparation
This kind of information is crucial for understanding the structure of the dataset, ensuring that the data
types are appropriate for the analysis or modeling task, and identifying any potential issues or
necessary preprocessing steps.
The heatmap visually represents the correlations between different columns in the DataFrame X.
Positive correlations are represented by warmer colors, negative correlations by cooler colors, and the
intensity of color indicates the strength of the correlation.
This visualization is useful for identifying relationships and dependencies between variables, which is
valuable in understanding the dataset and making decisions about feature selection or engineering.
3-Data Preparation
In general, we would say if the correlation
between two variables is bigger than 0.7, it
;
will then, have a high correlationship
between 0.7 to 0.3, will say it has median
correlations and
if lower than 0.3, we say that it has low
correlations.
unfortunately,all the variable in the heatmap
do not show some important
message,because there is no variables that
have high correlatinship with the stroke
variable, and other correlationship that have
a high value with other variable are all just
common sense.Thus, we take all the feature
as input data to build our model.
3-Data Preparation
preparing data for machine learning, specifically for training a predictive model.
The goal of these operations is to separate the target variable ('stroke') from the feature variables in
preparation for a machine learning model.
The y variable now contains the target variable, and X contains the remaining features that will be
used to predict the target.
This separation is common in machine learning workflows to clearly distinguish between the input
features and the output variable during model training.
After this operation, y would typically be used as the target variable when training a machine learning
model, and X would be used as the feature matrix. For example, in a classification task, the model
might be trained to predict whether an individual had a stroke (y) based on the other features in the
dataset (X).
3-Data Preparation
This preprocessing step is particularly beneficial for algorithms that are sensitive to the scale of input
features, such as gradient-based optimization algorithms commonly used in machine learning models.
The MinMaxScaler scales each feature independently, rescaling them to a specified range. By
default, it scales features to a range between 0 and 1.
Scaling is important in machine learning to ensure that features with different scales or units do not
disproportionately influence the model training process.
The fit_transform method is used to both compute the scaling parameters (using fit) and apply the
scaling to the data (using transform).
After this operation, the feature variables in X have been scaled, and the transformed X can be
used for training machine learning models.
3-Data Preparation
to create separate datasets for training and validation to assess the performance of a machine
learning model.
train_x and train_y represent the feature matrix and target variable for training, respectively.
val_x and val_y represent the feature matrix and target variable for validation, respectively.
The test_size parameter determines the proportion of data allocated for validation, and
random_state ensures reproducibility.
05 CRISP-DM BENEFITS
What are its
advantages?
Benefits of using CRISP-DM Framework: