module_2
module_2
module_2
MODULE 2
Introduction: End-to-End Machine Learning Project
In this chapter, an example project end to end is presented, imagining the scenario of being a
recently hired data scientist in a real estate company. Here are the main steps to go through:
Each of questions helps in framing and understanding the machine learning project more
effectively.
1. What exactly is the business objective? - This question aims to clarify the ultimate
goal of the project and how the company expects to benefit from the model.
2. How does the company expect to use and benefit from this model? - This is important
because it will determine how to frame the problem, what algorithms to select, what
performance measure should use to evaluate model, and how much effort should
spend tweaking it.
Here, the model’s output (a prediction of a district’s median housing price) will be fed
to another Machine Learning system (Figure 2-2), along with many other signals. This
downstream system will determine whether it is worth investing in a given area or not.
Getting this right is critical, as it directly affects revenue.
3. What does the current solution look like (if any)? - It will give a reference
performance, as well as insights on how to solve the problem.
3. Should you use batch learning or online learning techniques? - Batch learning should
be chosen because there is no continuous flow of new data, no immediate need to
adjust to changing data, and the data is small enough to fit in memory.
Both the RMSE and the MAE are ways to measure the distance between two vectors: the
vector of predictions and the vector of target values.
• A number of Python modules are needed: Jupyter, NumPy, Pandas, Matplotlib, and
Scikit-Learn.
• The system’s packaging system (e.g., apt-get on Ubuntu, or MacPorts or HomeBrew
on MacOS) can be used. Install a Scientific Python distribution such as Anaconda and
its packaging system or Python’s own packaging system, pip, can be used.
• All the required modules and their dependencies can now be installed using this
simple pip command.
• For this project, just download a single compressed file, housing.tgz, which contains a
comma-separated value (CSV) file called housing.csv with all the data.
• A simple method is to use web browser to download it, decompress the file and extract
the CSV file.
• But it is preferable to create a small function / script to download the data because it is
useful in particular if data changes regularly, it can run whenever you need to fetch the
latest data.
This function returns a Pandas DataFrame object containing all the data.
1. head() : Let’s take a look at the top five rows using the DataFrame’s head() method.
Each row represents one district. There are 10 attributes “longitude, latitude,
housing_median_age, total_rooms, total_bed_rooms, population, households,
median_income, median_house_value, and ocean_proximity.”
2. info(): The info() method is useful to get a quick description of the data, in particular
the total number of rows, and each attribute’s type and number of non-null values
3. value_counts(): You can find out what categories exist and how many districts belong
to each category by using the value_counts() method:
• The count, mean, min, and max rows are self-explanatory. Note that the null values are
ignored (so, for example, count of total_bedrooms is 20,433, not 20,640).
• The std row shows the standard deviation, which measures how dispersed the values
are.
• The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile
indicates the value below which a given percentage of observations in a group of
observations falls.
• For example, 25% of the districts have a housing_median_age lower than 18, while
50% are lower than 29 and 75% are lower than 37. These are often called the 25th
percentile (or 1st quartile), the median, and the 75th percentile (or 3rd quartile).
When splitting the data into training and test sets, it's important to ensure that test set remains
consistent across different runs of the program.
The Problem: If the dataset is randomly split into training and test sets each time the
program is run, different test sets will be generated each time. Over time, the model might see
the entire dataset, which defeats the purpose of having a separate test set.
One way to address the issue of different test sets on each run is to save the test set when it is
first created. Then, load this saved test set in future runs. However, this approach has
limitations, especially if there is a need to update the dataset.
Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) so
that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset.
A more robust approach is to use each instance's unique identifier to determine whether it
should be in the test set. This way, even if dataset is refreshed, the split remains consistent.
The dataset has geographical information (latitude and longitude), it is a good idea to create a
scatterplot of all districts to visualize the data.
The above plot looks like California, but other than that it is hard to see any particular
pattern. Setting the alpha option to 0.1 makes it much easier to visualize the places where
there is a high density of data points
Now let’s look at the housing prices. The radius of each circle represents the district’s
population (s), and the color represents the price (c). We will use a predefined color map
(cmap) called jet, which ranges from blue (low values) to red (high prices)
To compute the standard correlation coefficient (also called Pearson’s r) between every pair
of attributes use the corr() method:
Now let’s look at how much each attribute correlates with the median house value:
The below figure shows various plots along with the correlation coefficient between their
horizontal and vertical axes.
Another way to check for correlation between attributes is to use Pandas’ scatter_matrix
function, which plots every numerical attribute against every other numerical attribute
The main diagonal (top left to bottom right) would be full of straight lines if Pandas plotted
each variable against itself, which would not be very useful. So instead, Pandas displays a
histogram of each attribute
The most promising attribute to predict the median house value is the median income, so let’s
look in on their correlation scatterplot.
Data Cleaning
Start to clean training set. Let’s separate the predictors and the labels since we don’t want to
apply the same transformations to the predictors and the target values.
Missing Features: Most Machine Learning algorithms cannot work with missing features. If
any attribute has some missing values there are three options to handle:
• Get rid of the corresponding attribute.
• Get rid of the whole attribute.
• Set the values to some value (zero, the mean, the median, etc.).
These can be accomplish easily by using DataFrame’s dropna(), drop(), and fillna() methods:
If option 3 is chosen, compute the median value on the training set and use it to fill the
missing values in the training set. Save the computed median value, as it will be needed later
to replace missing values in the test set for system evaluation, and also to handle missing
values in new data once the system goes live.
Since the median can only be computed on numerical attributes, we need to create a copy of
the data without the text attribute ocean_proximity:
Now, fit the imputer instance to the training data using the fit() method:
The imputer has simply computed the median of each attribute and stored the result in its
statistics_ instance variable.
Now you can use this “trained” imputer to transform the training set by replacing missing
values by the learned medians:
The result is a plain NumPy array containing the transformed features. If you want to put it
back into a Pandas DataFrame, it’s simple:
Another way to create one binary attribute per category: one attribute equal to 1 when the
category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category
is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only
one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are
sometimes called dummy attributes. Scikit-Learn provides a OneHotEn coder class to convert
categorical values into one-hot vectors.
By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense
array if needed by calling the toarray() method:
Feature Scaling
Machine Learning algorithms don’t perform well when the input numerical attributes have
very different scales.
There are two common ways to get all attributes to have the same scale:
1. Min-max scaling: In min-max scaling (normalization) the values are shifted and
rescaled so that they end up ranging from 0 to 1.
Transformation Pipelines
• There are many data transformation steps that need to be executed in the right order.
Scikit-Learn provides the Pipeline class to help with such sequences of
transformations.
• Here is a small pipeline for the numerical attributes:
First line imports the necessary classes from the sklearn library. Pipeline is used to create a
sequence of data processing steps.
StandardScaler is used to standardize features by removing the mean and scaling to unit
variance.
This code defines a pipeline named num_pipeline consisting of three steps:
1. 'imputer': Uses SimpleImputer to handle missing values by replacing them with the
median value of the column. This is specified by strategy="median".
2. 'attribs_adder': Uses a custom transformer CombinedAttributesAdder(), which is
assumed to be defined elsewhere. This step adds new attributes to the dataset based on
existing ones.
3. 'std_scaler': Uses StandardScaler to standardize the numerical attributes.
Standardization is the process of rescaling the features so that they have the properties
of a standard normal distribution with a mean of 0 and a standard deviation of 1.
The last line applies the pipeline to the housing_num data. The fit_transform method first fits
the pipeline to the data i.e., it computes the necessary statistics such as median values for
imputation and mean/standard deviation for scaling and then transforms the data according to
the fitted pipeline.
Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s
mean_squared_error function:
• This score is better than nothing but clearly not a great score: most districts’
median_housing_values range between $120,000 and $265,000, so a typical prediction
error of $68,628 is not very satisfying.
• This is an example of a model underfitting the training data. When this happens it can
mean that the features do not provide enough information to make good predictions, or
that the model is not powerful enough.
• To fix underfitting are to select a more powerful model, to feed the training algorithm
with better features, or to reduce the constraints on the model.
To evaluate the model (Decision Tree) would be to use the train_test_split function to split
the training set into a smaller training set and a validation set, then train models against the
smaller training set and evaluate them against the validation set.
K-Fold Cross-Validation
• A more efficient alternative is using Scikit-Learn’s K-fold cross-validation.
• This method splits the training set into n distinct subsets, called folds.
• The model is trained and evaluated k times, each time using a different fold for
evaluation and the remaining k-1 folds for training.
• This results in an array of k evaluation scores.
The goal is to identify a shortlist of two to five promising models without spending too much
time on hyperparameter tweaking. By following these steps, you can ensure a thorough
evaluation of machine learning models, leading to better performance and reliability in real-
world applications.
Grid Search
• Scikit-Learn’s GridSearchCV tell which hyperparameters you want it to experiment
with, and what values to try out, and it will evaluate all the possible combinations of
hyperparameter values, using cross-validation.
• For example, the following code searches for the best combination of hyperparameter
values for the RandomForestRegressor:
Randomized Search
Ensemble Methods
Another way to fine-tune your system is to try to combine the models that perform best. The
group (or “ensemble”) will often perform better than the best individual model, especially if
the individual models make very different types of errors.
With this information, you may want to try dropping some of the less useful features. You
should also look at the specific errors that your system makes, then try to understand why it
makes them and what could fix the problem.
• Production Readiness: Integrate the production input data sources into your system
and write necessary tests to ensure everything functions correctly.
• Performance Monitoring: Develop code to monitor your system’s live performance
regularly and trigger alerts if there is a performance drop, to catch both sudden
breakage and gradual performance degradation.
• Human Evaluation: Implement a pipeline for human analysis of your system’s
predictions, involving field experts or crowdsourcing platforms, to evaluate and
improve system accuracy.
• Input Data Quality Check: Regularly evaluate the quality of the system’s input data
to detect issues early, preventing minor problems from escalating and affecting system
performance.
• Automated Training: Automate the process of training models with fresh data
regularly to maintain consistent performance and save snapshots of the system's state
for easy rollback in online learning systems.
GridSearchCV: This is a tool from Scikit-Learn that performs an exhaustive search over
specified parameter values for an estimator. It helps in finding the best combination of
hyperparameters for a given model.
fit: This method trains the GridSearchCV object using the prepared housing data
(housing_prepared) and the corresponding labels (housing_labels)
Classification
MNIST
• MNIST dataset, which is a set of 70,000 small images of digits handwritten by high
school students and employees of the US Census Bureau.
• Each image is labeled with the digit it represents.
• This set has been studied so much that it is often called the “Hello World” of Machine
Learning: whenever people come up with a new classification algorithm, they are
curious to see how it will perform on MNIST.
Output
X, y = mnist["data"], mnist["target"]
X.shape
y.shape
Output
(70000, 784)
(70000,)
There are 70,000 images, and each image has 784 features. This is because each image is
28×28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black).
Let’s look at one digit from the dataset. Fetch an instance’s feature vector, reshape it to a
28×28 array, and display it using Matplotlib’s imshow() function:
The below figure shows a few more images from the MNIST dataset to give you a feel for the
complexity of the classification task.
The MNIST dataset is actually already split into a training set (the first 60,000 images) and a
test set (the last 10,000 images):
• To simplify the problem, focus on identifying a single digit, such as the number 5.
• This '5-detector' will serve as an example of a binary classifier, distinguishing
between two classes: 5 and not-5.
• Let's create the target vectors for this classification task:
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)
• Let’s pick a classifier and train it. Consider the Stochastic Gradient Descent (SGD)
classifier, using Scikit-Learn’s SGDClassifier class.
• This classifier has the advantage of being capable of handling very large datasets
efficiently.
Output
array([ True])
Performance Measures
• Let’s use the cross_val_score() function to evaluate your SGDClassifier model using
K-fold cross-validation, with three folds.
• K-fold crossvalidation means splitting the training set into K-folds (three), then
making predictions and evaluating them on each fold using a model trained on the
remaining folds
Output
array([0.95035, 0.96035, 0.9604 ])
Confusion Matrix
Output
array([[53892, 687],
[ 1891, 3530]]
• Each row in a confusion matrix represents an actual class, while each column
represents a predicted class.
• The first row of this matrix considers non-5 images (the negative class): 53,892 of
them were correctly classified as non-5s (they are called true negatives), 687 were
wrongly classified as 5s (false positives).
• The second row considers the images of 5s (the positive class): 1,891 were wrongly
classified as non-5s (false negatives), while the remaining 3530 were correctly
classified as 5s (true positives).
• A perfect classifier would have only true positives and true negatives, so its confusion
matrix would have nonzero values only on its main diagonal (top left to bottom right)
Output
array([[54579, 0],
[ 0, 5421]]
Precision
Where:
• TP (True Positives) is the number of correctly predicted positive instances.
• FP (False Positives) is the number of instances incorrectly predicted as positive.
Where:
• TP (True Positives) is the number of correctly predicted positive instances.
• FN (False Negatives) is the number of actual positive instances that were
incorrectly predicted as negative.
Output
0.8370879772350012
precision_score(y_train_5, y_train_pred)
Output
0.6511713705958311
When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it
only detects 65.1% of the 5s.
F1 score
• The F1 score is a metric used to evaluate the performance of a binary classification
model.
• It is the harmonic mean of precision and recall, providing a single metric that balances
both the false positives and false negatives.
• The F1 score is useful when you need to take both precision and recall into account
and is helpful when dealing with imbalanced datasets.
Output
0.7325171197343846
Area under the curve (AUC) : To compare classifiers is to measure the area under the
curve. The AUC value ranges from 0 to 1.
Let’s consider RandomForestClassifier and compare its ROC curve and ROC AUC score to
the SGDClassifier
The RandomForestClassifier’s ROC curve looks much better than the SGDClassifier’s. It
comes much closer to the top-left corner.
Multiclass Classification
Consider a system that can classify the digit images into 10 classes (from 0 to 9). There are
Multiclass Strategies
One-versus-All (OvA) Strategy:
• Train 10 binary classifiers, one for each digit (0 to 9).
• Classify an image by selecting the class with the highest score.
• Example: Train a 0-detector, 1-detector, etc.
One-versus-One (OvO) Strategy:
• Train a binary classifier for every pair of digits. one to distinguish 0s and 1s,
another to distinguish 0s and 2s, another for 1s and 2s, and so on
• If there are N classes, you need
N×(N−1)/2 classifiers.
• For 10 classes, train 45 classifiers. Classify an image by determining which class
wins the most pairwise duels.
Algorithm Selection
• OvO Preferred: For algorithms like SVM that scale poorly with large training sets.
• OvA Preferred: For most binary classification algorithms.
• Scikit-Learn Default: Automatically applies OvA for binary classifiers, OvO for SVM.
Error Analysis
Assume we have a promising model and aim to improve it by analysing the errors it makes.
Start by looking at the confusion matrix.
This confusion matrix looks fairly good, since most images are on the main diagonal, which
means that they were classified correctly. The 5s look slightly darker than the other digits,
which could mean that there are fewer images of 5s in the dataset or that the classifier does
not perform as well on 5s as on other digits.
Analyzing individual errors gain insights on what classifier is doing and why it is failing, but
it is more difficult and time-consuming.
Ex: let’s plot examples of 3s and 5s
The two 5×5 blocks on the left show digits classified as 3s, and the two 5×5 blocks on the
right show images classified as 5s. Some of the digits that the classifier gets wrong (i.e., in
the bottom-left and top-right blocks) are so badly written that even a human would have
trouble classifying them (e.g., the 5 on the 1st row and 2nd column truly looks like a badly
written 3).
Multilabel Classification
• Multilabel Classification is a type of classification where each instance can belong to
multiple classes simultaneously.
• For example, in a face-recognition system, a picture with multiple known faces
should result in multiple outputs. If a classifier can recognize Alice, Bob, and Charlie,
and sees a picture of Alice and Charlie, it should output [1, 0, 1], meaning "Alice yes,
Bob no, Charlie yes".
3. Make Predictions:
• Predict using the trained classifier and output multiple labels.
knn_clf.predict([some_digit])
Output
array([[False, True]])
The digit 5 is indeed not large (False) and odd (True).
Multioutput Classification
• Multioutput Classification (multioutput-multiclass classification) is a generalization of
multilabel classification. In this type of classification, each label can have multiple
values, not just binary options.
• For example, each label can represent different pixel intensities ranging from 0 to 255.
n-gl.com 35 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru