ML 5
ML 5
ML 5
Monitoring learning performance using a training set helps you keep an eye onhow the algorithm is doing. Training
results are always too optimistic because, as learning occurs, some data memorization happens as well. The following
advice helps you achieve a better result when using a test set. Only a k-folds cross-validation estimate or out-of-
sample results are predictive of how well your solution will work when using new data.
As a first step to improving your results, you need to determine the problems with your model. Learning curves
require you to verify against a test set as you vary the number of training instances. (A cross-validation estimate
will suffice when youdon’t have an out-of-sample set available.) You’ll immediately notice whether you find much
difference between your in-sample and out-of-sample errors. A wide initial difference (low in-sample error and high
out-of-sample error) is a sign of estimate variance; conversely, having errors that are both high and similar is a sign that
you’re working with a biased model.
Learning Curve: Line plot of learning (y-axis) over experience (x-axis).
Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the
model is learning.
Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of
how well the model is generalizing.
Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model
are being optimized, e.g. loss.
You also need to understand how the model behaves when the sample size increases. When variance is the problem, the
in-sample error should increase as the number of training examples grows because the model will find it harder to
memorize all the new instances you are feeding to it. On the other hand, the out- of-sample error should decrease as
the model learns more rules in light of the increased evidence provided by a larger sample.
Variance problems react well when you supply more data. To correct this problem,determine the
training set size where the out-of-sample and in-sample error curves converge, and check whether you can obtain
enough data. When getting enough data isn’t viable (because, for instance, the curves are too far away), youhave
to introduce model corrections based on variable selection or hyper-parametertuning.
There are three common dynamics that you are likely to observe in learning curves; they are:
Cross-validation provides you with hints when the steps you take (data preparation, data and feature selection, hyper-
parameter fixing, or model selection) arecorrect. It’s important, but not critical, that CV estimates precisely replicate
out- of-sample error measurements. However, it is crucial that CV estimates correctly reflect improvement or
worsening in the test phase due to your modelling decisions. Generally, there are two reasons that the cross-validation
estimates can vary from the true error results:
» Snooping: Information is leaking from the response to the model. The problem will also affect your test set when
leaking is generalized, but it won’taffect any new data you use out-of-sample. This problem often happens whenyou
apply preprocessing on pooled training and test data. (When pooling, you vertically stack training and test matrices so that
you can work on only a single matrix without having to replicate twice the same operations on data.)
You shouldn’t use the test data for computing missing data inputs, parameter normalization, or dimensionality
reduction. When you use the pooled data for any of these three purposes, information from the test set easily and
unnoticeably leaks to your training process, making your work unreliable.
» Incorrect sampling: When you convert classes into data, a simple random sampling may prove inadequate. You
should test stratified sampling, a statistical sampling method that assures that you draw the sample response classes in the
same proportion as the training set.
Methods used for Cross-Validation
Validation Set Approach : We divide our input dataset into a training set and test or validation set in the validation set
approach. Both the subsets are given 50% of the dataset. . It also tends to give the underfitted model.
Leave-P-out cross-validation : In this approach, the p datasets are left out of the training data. It means, if there
are total n datapoints in the original input dataset, then n-p data points will be used as the training dataset and the p
data points as the validation set. it can be computationally difficult for the large p.
Leave one out cross-validation : This method is similar to the leave-p-out cross-validation, but instead of p, we need
to take 1 dataset out of training. It means, in this approach, for each learning set, only one datapoint is reserved, and
the remaining dataset is used to train the model. This process repeats for each datapoint.
K-Fold Cross-Validation : K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. This approach is a very popular CV approach because it is easy to unders-tand,
and the output is less biased than other methods.
Stratified k-fold cross-validation : This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to ensure that each fold or group is a
good representative of the complete dataset. To deal with the bias and variance, it is one of the best approaches. It can
be understood with an example of housing prices, such that the price of some houses can be much high than other houses.
Trying to optimize an error metric based on the median error by using a learning algorithm based on the mean error
won’t provide you with the best results unless you manage the optimization process in a fashion that works in favor
of your chosen metric. When solving a problem using data and machine learning, youneed to analyze the
problem and determine the ideal metric to optimize. You can get many of them from academic papers and from
public machine learning contests that carefully define specific problems in terms of data and error/score metric.
Look for a contest whose objective and data are similar to yours, and then check the requested metric.
Contests can provide great inspiration on more than just error metrics. You can easily learn how to manage data
applications, use machine learning tricks, and perform smart feature creation.
Check whether the machine learning algorithm that you want to use supports your chosen metric. If the algorithm
uses another metric, try to influence it by searching for the best combination of hyper-parameters that can
maximize yourmetric. You can achieve that goal by doing a grid search for the best cross-validate result using your
own metric as a target.
Regression Metrics
RMSE (Root Mean Square Error) : It represents the sample standard deviation of the differences between predicted
values and observed values (called residuals). Mathematically, it is calculated using this formula:
MAE : MAE is the average of the absolute difference between the predicted values and observed value. The MAE is a
linear score which means that all the individual differences are weighted equally in the average. For example, the
differ- rence between 10 and 0 will be twice the difference between 5 and 0. Mathematically, it is calculated using this
formula:
The numerator is MSE ( average of the squares of the residuals) and the denominator is the variance in Y values.
Higher the MSE, smaller the R_squared and poorer is the model.
Adjusted R² : Just like R², adjusted R² also shows how well terms fit a curve or line but adjusts for the number of
terms in a model. It is given by below formula:
wh
ere n is the total number of observations and k is the number of predictors. Adjusted R² will always be less than or
equal to R².
Most algorithms perform fairly well out of the box using the default parameter settings. However, you can always
achieve better results by testing different hyper-parameters. All you have to do is to create a grid search among
possible values that your parameters can take and evaluate the results using the right error or score metric. The
search takes time, but it can improve your results (not drastically, but significantly).
When a search takes too long to complete, you can often achieve the same results by working on a sample of your
original data. Fewer examples chosen at random require fewer computations, but they usually hint at the same
solution. Another trick that can save time and effort is to do a randomized search thus limiting the number of
hyper-parameter combnations to test.
We have four main strategies available for searching for the best configuration:
Babysitting : Babysitting is also known as Trial & Error or Grad Student Descent in the academic field. This approach
is 100% manual and the most widely adopted by researchers, students, and hobbyists. This approach is very
education- nal, but it doesn't scale inside a team or a company where the time of the data scientist is really valuable.
Grid Search : The image below illustrates a simple grid search on two dimensions for the Dropout and Learning rate.
This strategy is embarrassingly parallel because it doesn't take into account the computation history. The real pain point
of this approach is known as the curse of dimensionality. This means that more dimensions we add, the more the search
will explode in time complexity (usually by an exponential factor), ultimately making this strategy unfeasible! It's
common to use this approach when the dimensions are less than or equal to 4.
Random Search :
The image compares the two approaches by searching the best configuration on two hyperparameters space.In the Grid
Layout, it's easy to notice that, even if we have trained 9 models, we have used only 3 values per variable! Whereas,
with the Random Layout, it's extremely unlikely that we will select the same variables more than once. It ends up that,
with the second approach, we will have trained 9 models using 9 different values for each variable.
Bayesian Optimization : This search strategy builds a surrogate model that tries to predict the metrics we care about
from the hyperparameters configuration.
Representing the performance of different models using the same chart is helpful before choosing the best
one to solve your problem. You can place models used to predict consumer behavior, such as a response to a
commercial offer, in special gain charts and lift charts. These charts show how your model performs by parti-
tioning its results into deciles or smaller parts. Because you may be interested only in the consumers who are
most likely to respond to your offer, ordering pre- dictions from most to least likely will emphasize how good your
models are at predicting the most promising customers.
A lift curve is a way of visualizing the performance of a classification model. Lift curves are closely related to, and
frequently confused with, cumulative gains charts.A cumulative gains chart shows the total number of events captured
by a model over a given number of
samples.
Here we're using it to compare a model vs a theoretically ideal model (perfectly predicts events given a sample) and a
model that is no better than random guessing.
A lift curve shows the ratio of a model to a random guess ('model cumulative sum' / 'random guess' from above).
Cumulative gains charts are a bit more useful as they show "how many events can I expect given X number of
sample" while a lift chart simply shows "how does my model compare to random guessing given X number of
samples".
Testing multiple models and introspecting them (understanding which features work better with them) can also
provide suggestions as to which features to transform for feature creation, or which feature to leave out when you
make feature selections.
Machine learning involves building many models and creating many different predictions, all with different
expected error performances. It may surprise you to know that you can get even better results by averaging the
models together. The principle is : Estimate variance is random, so by averaging many different models, you
can enhance the signal (the correct prediction) and rule out thenoise that will often cancel itself (opposite
errors sum to zero).
Sometimes the results from an algorithm that performs well, mixed with the results from a simpler algorithm that
doesn’t work as well, can create better predictions than using a single algorithm. Don’t underestimate contributions
delivered from simpler models, such as linear models, when you average their results with the output from more
sophisticated algorithms, such as gradient boosting.
It’s the same principle you seek when applying ensembles of learners, such as tree bagging and boosting ensembles.
However, this time you use the technique on complete and heterogeneous models that you prepare for evaluation.
In this case, if the result needs to guess a complex target function, different models may catch different parts of that
function. Only by averaging results output by different simple and complex models can you approximate a model
that you can’t build otherwise.
For the same reasons that averaging works, stacking can also provide you with better performance. In stacking,
you build your machine learning models in two(or sometimes even more) stages. Initially this technique predicts
multiple results using different algorithms, with all of them learning from the features present in your data. During
the second phase, instead of providing features that a new model will learn, you provide that model with the
predictions of the other, previously trained models.
Using a two-stage approach is justified when guessing complex target functions. You can approximate them only
by using multiple models together and then by combining the result of the multiplication in a smart way. You can
use a simple logistic regression or a complex tree ensemble as a second-stage model.
If you believe that bias is still affecting your model, you have little choice but to create new features that improve
the model’s performance. Every new feature can make guessing the target response easier. For instance, if classes
aren’t linearly separable, feature creation is the only way to change a situation that your machine learning algorithm
cannot properly deal with.
Automatic feature creation is possible using polynomial expansion or the support vector machines class of machine
learning algorithms. Support vector machines can automatically look for better features in higher-dimensional
feature spaces in a way that’s both computationally fast and memory optimal.
However, nothing can really substitute for your expertise and understanding of the method needed to solve the
data problem that the algorithm is trying to learn. You can create features based on your knowledge and ideas of
how things work in the world. Feature creation is more art than science, and an undoubtedly human art.
Feature creation is always the best way to improve the performance of an algorithm not just when bias is the
problem, but also when your model iscomplex and has high variance.
In Machine Learning, representation learning or feature learning is a set of methods or techniques that lets a systemfind
out, automatically, the representations required for feature recognition or grouping from raw data
Feature learning can be classified into two types:
2) Discretization :
Discretization involves essentially taking a set of values of data and grouping sets of them together in some logical
fashion into bins (or buckets). Binning can apply to numerical values as well as to categorical values. This could help
prevent data from overfitting but comes at the cost of loss of granularity of data. The grouping of data can be done as
follows:
1. Grouping of equal intervals
2. Grouping based on equal frequencies (of observations in the bin)
3. Grouping based on decision tree sorting (to establish a relationship with target)
3) Categorical Encoding :
Categorical encoding is the technique used to encode categorical features into numerical values which are usually simp--
ler for an algorithm to understand.
4) Feature Splitting :
Splitting features into parts can sometimes improve the value of the features toward the target to be learned. For insta- nce,
in this case, Date better contributes to the target function than Date and Time.
5) Handling Outliers :
Outliers are unusually high or low values in the dataset which are unlikely to occur in normal scenarios. Since these
outliers could adversely affect your prediction they must be handled appropriately.
If estimate variance is high and your algorithm is relying on many features (tree-based algorithms choose features
they learn from), you need to prune some features for better results. In this context, reducing the number of features
in yourdata matrix by picking those with the highest predictive value is advisable.
When working with linear models, linear support vector machines, or neural networks, regularization is always an
option. Both L1 and L2 can reduce the influence of redundant variables or even remove them from the model
Stability selection leverages the L1 ability to exclude less useful variables. The technique resamples the training
data to confirm the exclusion.
Feature Selection Techniques :There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique : Supervised Feature selection techniques consider the target
variable and can be used for the labelled dataset.
o Unsupervised Feature Selection technique : Unsupervised Feature selection techniques ignore the
target variable and can be used for the unlabelled dataset.
After trying all the previous suggestions, you may still have a high variance of predictions to deal with. In this case,
your only option is to increase your training set size. Try increasing your sample by providing new data, which
could translate into new cases or new features.
If you want to add more cases, just look to see whether you have similar data at hand. You can often find more data,
but it may lack labelling, that is, the response variable. Spending time labelling new data or asking other people to
label it for you may prove a great investment. Complex models can improve a lot from additional training examples
because adding data makes parameter estimation much more reliable and disambiguates cases in which the machine
learning algorithm can’t determine which rule to extract.
If you want to add new features, locate an open source data source, if possible, to match your data with its entries.
Another great way to obtain both new cases and new features is by scraping the data from the web. Often, data is
available between different sources or through an application programming interface (API). For instance, Google
APIs offer many geographical and business information sources. By scripting a scraping session, you can obtain
new data that can put a different perspective on your learning problem. New features help by offering alternative
ways to separate your classes, which you do by making the relationship between the response and the predictors
more linear and less ambiguous.