Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

4 Data Preprocessing

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

4 Data Preprocessing

Uploaded by

umadataengg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Machine Learning

1
Data Preprocessing

• Data preprocessing is a process of preparing the


raw data and
• making it suitable for a machine learning model.
• It is the first and crucial step while creating a
machine learning model.
• while doing any operation with data, it is
mandatory to clean data and put in a formatted
way.
Why do we need Data Preprocessing?

• A real-world data generally contains noises, missing


values, and maybe in an unusable format which cannot
be directly used for machine learning models.
• Data preprocessing is required tasks for cleaning the data
and
• making it suitable for a machine learning model
• which also increases the accuracy and efficiency of a
machine learning model.
Why do we need Data Preprocessing?
It involves below steps:
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset

• To create a machine learning model, the first thing we


required is a dataset as a machine learning model
completely works on data.
• The collected data for a particular problem in a proper
format is known as the dataset.
• CSV file.
• HTML or
• xlsx file
2) Importing Libraries

• In order to perform data preprocessing using Python,


we need to import some predefined Python libraries.
• These libraries are used to perform some specific jobs.
• Numpy:
• Matplotlib:
• Pandas:
• import numpy as np
3) Importing the Datasets

• Now we need to import the datasets which we


have collected for our machine learning
project.
• df= pd.read_csv('Dataset.csv')
Extracting dependent and independent variables:

• X = df.iloc[:, :-1].values
• y = df.iloc[:, -1].values
• print(X)
• print(y)
4) Handling Missing data

• The next step of data preprocessing is to handle


missing data in the datasets.
• If our dataset contains some missing data, then
it may create a huge problem for our machine
learning model.
• Hence it is necessary to handle missing values
present in the dataset.
Ways to handle missing data:

• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to
commonly deal with null values.
• In this way, we just delete the specific row or column which
consists of null values.
• But this way is not so efficient and removing data may lead to
loss of information which will not give the accurate output.
Ways to handle missing data:

• By calculating the mean: In this way, we will


calculate the mean of that column or row which
contains any missing value and will put it on the
place of missing value.
• This strategy is useful for the features which
have numeric data such as age, salary, year, etc.
Here, we will use this approach.
4) Handling Missing data:

• To handle missing values, we will use Scikit-


learn library in our code,
• which contains various libraries for building
machine learning models.
• Here we will use Imputer class
of sklearn.preprocessing library.
• Below is the code for it:
4) Handling Missing data:

• from sklearn.impute import SimpleImputer


• imputer = SimpleImputer(missing_values=np.nan,
strategy='mean')
• imputer.fit(X[:, 1:3])X[:, 1:3] =
imputer.transform(X[:, 1:3])
• print(X)
5) Encoding Categorical data
5) Encoding Categorical data:
• Categorical data is data which has some categories such as,
in our dataset;
• there are two categorical variable, Country, and Purchased.
• Since machine learning model completely works on
mathematics and numbers,
• but if our dataset would have a categorical variable, then it
may create trouble while building the model.
• So it is necessary to encode these categorical variables into
numbers.
5) Encoding Categorical data

• For Country variable:


• Firstly, we will convert the country variables
into categorical data.
• So to do this, we will use LabelEncoder() class
from preprocessing library.
5) Encoding Categorical data

• # Encoding the Dependent Variable


• from sklearn.preprocessing import LabelEncoder
• le = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
5) Encoding Categorical data
• Explanation:
• In above code, we have imported LabelEncoder class of sklearn
library.
• This class has successfully encoded the variables into digits.
• But in our case, there are three country variables, and as we can see
in the above output, these variables are encoded into 0, 1, and 2.
• By these values, the machine learning model may assume that there
is some correlation between these variables
• which will produce the wrong output.
• So to remove this issue, we will use dummy encoding.
5) Encoding Categorical data:

• Dummy Variables:
• Dummy variables are those variables which have
values 0 or 1.
• The 1 value gives the presence of that variable in
a particular column, and rest variables become 0.
• With dummy encoding, we will have a number of
columns equal to the number of categories.
5) Encoding Categorical data:

• In our dataset, we have 3 categories so it will


produce three columns having 0 and 1 values.
• For Dummy Encoding, we will
use OneHotEncoder class
of preprocessing library.
5) Encoding Categorical data:

• # Encoding the Independent Variable


• from sklearn.compose import ColumnTransformer
• from sklearn.preprocessing import OneHotEncoder
• ct = ColumnTransformer(transformers=[('encoder',
OneHotEncoder(), [0])], remainder='passthrough')
• X = np.array(ct.fit_transform(X))
• print(X)
5) Encoding Categorical data:

• For Purchased Variable:


• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use
labelencoder object of LableEncoder class.
• Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and
• which are automatically encoded into 0 and 1.
5) Encoding Categorical data
• # Encoding the Dependent Variable
• from sklearn.preprocessing import Label
• Encoderle = LabelEncoder()
• y = le.fit_transform(y)
• print(y)
6. Splitting the dataset into the Training set and Test set

• from sklearn.model_selection import train_test_split


• X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 1)
• print(X_train)
• print(X_test)
• print(y_train)
• print(y_test)
7) Feature Scaling

• It is a technique to standardize the independent


variables of the dataset in a specific range.
• In feature scaling, we put our variables in the
same range and in the same scale
• so that no any variable dominate the other
variable.
7) Feature Scaling

• For feature scaling, we will import StandardScaler class


of sklearn.preprocessing library as:
• Now, we will create the object of StandardScaler class
for independent variables or features.
• And then we will fit and transform the training dataset.
• For test dataset, we will directly
apply transform() function instead of fit_transform()
• because it is already done in training set.
7) Feature Scaling

• from sklearn.preprocessing import StandardScaler


• sc = StandardScaler()
• X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
• X_test[:, 3:] = sc.transform(X_test[:, 3:])
• print(X_train)
• print(X_test)

You might also like