Data Science Interview Preparation
Data Science Interview Preparation
Please use search (Ctrl + F) / left side pane to find the below topics:
1. Linear Regression
2. Assumptions of Linear Regression
3. Maximum Likelihood method
4. LASSO & Ridge Regression
5. Logistic Regression
6. Probabilistic Model Selection with AIC, BIC, and MDL
7. Statistics
8. Central Limit Theorem
9. Confidence Interval
10. T-statistics
11. Hypothesis Testing & P-Value
12. Chi-Squared Statistic, Test and Distribution
13. ANOVA (Analysis of Variance)
14. Metrics for ML model performance - Classification
15. What’s the Bias-Variance trade-off ?
16. R2 and adjusted-R2
17. Regression Losses
18. Principal Component Analysis (PCA):
19. KNN
20. SVM
21. NAIVE BAYES
22. CLUSTERING
23. Class Imbalance
24. SQL
25. BOOK: Hands on ML with Scikit Learn and Tensorflow
26. TABLEAU
27. Miscellaneous Questions
28. SELF PROJECTS
29. COURSERA Deep Learning Specialization Notes:
30. DECISION TREES
1. Linear Regression:
Reducible error can be of two kinds. Error due to variance and error due to bias.
Find the coefficients:
This cost function corresponds to the batch gradient descent, where at each iteration
cost function is calculated using all the observations present in the training set.
Depending on the size of the training set, if large, m can be decreased to a fraction
(mini batch gradient descent) or to 1 observation at a time (stochastic gradient
descent or online learning) for the algorithm to converge.
Will this always converge?: Notice that the cost function for linear regression is
always in the form of a quadratic function. Hence, it is a convex function and so the
algorithm will always converge to a global minimum, unlike other complex models
where gradient descent algorithm can be stuck at local minimum.
Choose the right learning rate: However, choosing the right value of learning rate is
critical. If the learning rate is too small, gradient descent can be too slow to converge.
While if it is too large, gradient descent can overshoot the minimum and may fail to
converge or even diverge
Read more about Linear Regression: All about Linear Regression. This blog post
is intended to the… | by Supreeth Manyam | Medium
https://www.youtube.com/watch?v=0MFpOQRY0rw
a) (L) Linearity
b) (N) Normal Errors
c) No multicollinearity between independent variables (the Xs)
d) (I) Independent Error Terms: No autocorrelation in error
e) (E) Equal/Constant Error Variance: Residuals/errors have same variance
(Homoscedasticity)
f) Exogeneity
g) Random - Data has no bias
If these assumptions are not followed, the coefficients or Standard Error values
would not be reliable for Hypothesis Testing.
Relationship between the predictors (X) and the response variable (Y) is additive and
linear.
b) Normal Errors: The residuals (error) which explain the variation in Y is normally
distributed (with mean = 0, constant variance (point e) ) . The Gaussian (normal)
seems like a good choice, because our errors look like they’re symmetric about where
the line would be, and that small errors are more likely than large errors.
But, as sample sizes increase, the normality assumption for the residuals is not needed.
More precisely, if we consider repeated sampling from our population, for large sample
sizes, the distribution (across repeated samples) of the ordinary least squares estimates
of the regression coefficients follow a normal distribution. (by the Central Limit
Theorem) That is -> All variables should be multivariate normal (i.e linear combination
of r.v.s should be normally distributed). Checks for normality:
I. Histogram
II. QQ or Quantile Quantile Plots (A Q-Q plot is a scatterplot created by plotting
two sets of quantiles against one another. If both sets of quantiles came from the same
distribution, we should see the points forming a line that’s roughly straight.)
III. Kolmogorov-Smirnov test (Goodness to fit test): Test for differences in
shape for two sample distribution OR checks if a variable follows a given distribution in
a population
Note: This is a weak assumption - violating normality is not seen as a big problem,
specially for large no of observations.
While a scatterplot allows you to check for autocorrelations, you can test the linear
regression model for autocorrelation with the Durbin-Watson test. The null hypothesis
of the test is that there is no serial correlation. The Durbin-Watson test statistics is
defined as:
The test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation
of the residuals. Thus, for r == 0, indicating no serial correlation, the test statistic equals
2. This statistic will always be between 0 and 4. The closer to 0 the statistic, the more
evidence for positive serial correlation. The closer to 4, the more evidence for negative
serial correlation.
Q) How to check?
Ans) A scatter plot of residual values (Y-Axis) vs predicted values (X-Axis) is a
good way to check for homoscedasticity.There should be no clear pattern (constant
variance) in the distribution and if there is a specific pattern, the data is heteroscedastic.
f) Exogeneity (no omitted variable bias): If there is a variable that is emitted from the
model, but affects both X and Y, then it could cause omitted variable bias.
Q) How to detect?
Ans) Using intuition
Recommendations to improve the performance of the linear regression model:
2. If the distribution of predictors are skewed, the coefficients estimated will also be
skewed. Hence, it is a recommended practice to check for skewness in
numerical predictors and making it symmetric to obtain improved results.
3. Outlier removal: An outlier is a point for which the actual response value is far
from the predicted value by the regression model. Regression line does not have
much effect on non removal of outliers, but affects the interpretation of the model
as RSE and p-value corresponding to the model changes drastically.
Residual plots are used to detect the outliers. Most of the statistical tools use the
residuals to detect outliers.
Model Diagnostics:
1. Test for overall model (R² and adjusted-R²).
2. Test for statistical significance of overall model (F-statistic and it’s p-value).
3. Test for statistical significance of coefficient of each predictor (t-statistic and it’s
corresponding p-value).
4. Test for normality(QQ-plot) and homoscedasticity(residual plot) of residuals and
auto correlation(Durbin-watson statistic).
5. Test for multicollinearity.
6. Test for outliers, high leverage points and influential observations.
Model Selection:
Subset selection: Out of k predictors, only a subset of predictors might contribute to
the Linear regression model and rest all could be noise to the data which might actually
underestimate or overestimate the errors.
To select the best subset of variables, test error needs to be estimated either by directly
estimating the test error through cross validation or by indirectly making an adjustment
to the training error using the following metrics as discussed below.
Metrics to select the best subset:
● Mallow’s Cp: Estimate of the size of bias introduced into the predicted response.
Lower the Cp, better the model.
● Akaike information criterion (AIC): Amount of information lost due to the
predictions on the response variable. AICs’ main goal is to build model that
effectively predicts the response variable.
● Bayesian information criterion (BIC): BIC is similar to AIC, but BIC penalizes
the noise predictors. BICs’ main goal is to extract the features that are actually
influencing the response variable.
● Adjusted-R²: Proportion of the response variable explained by the independent
variables.
● We first have to decide which model we think best describes the process of
generating the data. This part is very important. At the very least, we should have
a good idea about which model to use. For these data we’ll assume that the data
generation process can be adequately described by a Gaussian (normal)
distribution. We want to know which curve was most likely responsible for
creating the data points that we observed? (See figure below). Maximum
likelihood estimation is a method that will find the values of μ and σ that result in
the curve that best fits the data.
● After deciding which curve was responsible for creating the data points that we
observed, we want to calculate the parameter values.
Again we’ll demonstrate this with an example. Suppose we have three data
points this time and we assume that they have been generated from a process
that is adequately described by a Gaussian distribution. These points are 9, 9.5
and 11. How do we calculate the maximum likelihood estimates of the
parameter values of the Gaussian distribution μ and σ?
What we want to calculate is the total probability of observing all of the data, i.e. the joint
probability distribution of all observed data points. To do this we would need to calculate
some conditional probabilities, which can get very difficult. So it is here that we’ll make
our first assumption. The assumption is that each data point is generated
independently of the others. (iids: independent and identically distributed )
We just have to figure out the values of μ and σ that result in giving the maximum
value of the above expression.
Figure: Finding the parameter Mu (mean) which maximizes the likelihood of observing
the weight (or any other variable/feature) values
The above expression for the total probability is actually quite a pain to differentiate, so
it is almost always simplified by taking the natural logarithm of the expression. This is
absolutely fine because the natural logarithm is a monotonically increasing function.
This means that if the value on the x-axis increases, the value on the y-axis also
increases (see figure below). This is important because it ensures that the maximum
value of the log of the probability occurs at the same point as the original probability
function. Therefore we can work with the simpler log-likelihood instead of the original
likelihood.
What kind of distribution are we going to use in linear regression? With that question,
we can talk about the main assumption in linear regression:
“The independent variable (y values) is assumed be in a normal distribution”
Hence:
Log Likelihood:
Ans) Least squares minimisation is another common method for estimating parameter
values for a model in machine learning. It turns out that when the model is assumed to
be Gaussian as in the examples above, the MLE estimates are equivalent to the
least squares method.
Intuitively we can interpret the connection between the two methods by understanding
their objectives. For least squares parameter estimation we want to find the line that
minimises the total squared distance between the data points and the regression line
(see the figure below). In maximum likelihood estimation we want to maximise the total
probability of the data. When a Gaussian distribution is assumed, the maximum
probability is found when the data points get closer to the mean value. Since the
Gaussian distribution is symmetric, this is equivalent to minimising the distance between
the data points and the mean value.
https://www.youtube.com/watch?v=Q81RR3yKn30&list=PLblh5JKOoLUICTaGLRoHQD
uF_7q2GfuJF&index=18
5. Logistic Regression:
1 / (1 + e^-value)
A linear model does not output probabilities, but it treats the classes as numbers (0
and 1) and fits the best hyperplane (for a single feature, it is a line) that minimizes the
distances between the points and the hyperplane. So it simply interpolates between the
points, and you cannot interpret it as probabilities.
A linear model also extrapolates and gives you values below zero and above one.
This is a good sign that there might be a smarter approach to classification.
Since the predicted outcome is not a probability, but a linear interpolation between
points, there is no meaningful threshold at which you can distinguish one class
from the other.
Linear models do not extend to classification problems with multiple classes. You
would have to start labeling the next class with 2, then 3, and so on. The classes might
not have any meaningful order, but the linear model would force a weird structure on the
relationship between the features and your class predictions. The higher the value of a
feature with a positive weight, the more it contributes to the prediction of a class with a
higher number, even if classes that happen to get a similar number are not closer than
other classes.
The step from linear regression to logistic regression is kind of straightforward. In the
linear regression model, we have modelled the relationship between outcome and
features with a linear equation:
For classification, we prefer probabilities between 0 and 1, so we wrap the right side
of the equation into the logistic function. This forces the output to assume only
values between 0 and 1.
The loss function computes the error for a single training example; the cost
function is the average of the loss functions of the entire training set.
Logit or log(odds)
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful because we can see that the calculation of the output on the right is linear
again (just like linear regression), and the input on the left is a log of the probability of
the default class.
This ratio on the left is called the odds of the default class (it’s historical that we use
odds, for example, odds are used in horse racing rather than probabilities). Odds are
calculated as a ratio of the probability of the event divided by the probability of not
the event, e.g. 0.8/(1-0.8) which has the odds of 4. So we could instead write:
ln(odds) = b0 + b1 * X
Utility: Logit / log(odds) can be used to find out the coefficients/parameters of the
model.
Gradient Descent
We want to predict w and b that minimize the cost function. Our cost function is
convex. First we initialize w and b to 0,0 or initialize them to a random value in the
convex function and then try to improve the values the reach minimum value. In Logistic
regression people always use 0,0 instead of random.
The gradient decent algorithm repeats: w = w - alpha * dw where alpha is the learning
rate and dw is the derivative of w (Change to w) The derivative is also the slope of w.
w = w - alpha * d(J(w,b) / dw) (how much the function slopes in the w direction)
b = b - alpha * d(J(w,b) / db) (how much the function slopes in the d direction)
Convex Function:
Important Questions and Answers:
Q) Which of the following methods do we use to best fit the data in Logistic
Regression?
Ans) Logistic regression uses maximum likelihood estimate for training a logistic
regression.
Q) Why can’t we use Mean Square Error (MSE) as a cost function for logistic
regression?
Ans) In logistic regression, we use the sigmoid function and perform a non-linear
transformation to obtain the probabilities. Squaring this non-linear transformation will
lead to non-convexity with local minimums. Finding the global minimum in such
cases using gradient descent is not possible. Due to this reason, MSE is not suitable for
logistic regression. Cross-entropy or log loss is used as a cost function for logistic
regression.
Q) How will you deal with the multiclass classification problem using logistic
regression?
Ans) The most famous method of dealing with multiclass classification using logistic
regression is using the one-vs-all approach. Under this approach, a number of models
are trained, which is equal to the number of classes. The models work in a specific way.
For example, the first model classifies the datapoint depending on whether it belongs to
class 1 or some other class; the second model classifies the datapoint into class 2 or
some other class. This way, each data point can be checked over all the classes.
Q) Explain the use of ROC curves and the AUC of an ROC Curve.
Ans) An ROC (Receiver Operating Characteristic) curve illustrates the performance of a
binary classification model. It is basically a TPR versus FPR (true positive rate versus
false-positive rate) curve for all the threshold values ranging from 0 to 1. In a ROC
curve, each point in the ROC space will be associated with a different confusion matrix.
A diagonal line from the bottom-left to the top-right on the ROC graph represents
random guessing. The Area Under the Curve (AUC) signifies how good the classifier
model is. If the value for AUC is high (near 1), then the model is working satisfactorily,
whereas if the value is low (around 0.5), then the model is not working properly and just
guessing randomly.
Stepwise Regression
Stepwise Regression - Statistics How To
The simplest reliable method of model selection involves fitting candidate models on a
training set, tuning them on the validation dataset, and selecting a model that performs
the best on the test dataset according to a chosen metric, such as accuracy or error. A
problem with this approach is that it requires a lot of data.
A third approach to model selection attempts to combine the complexity of the model
with the performance of the model into a score, then select the model that minimizes or
maximizes the score.
Models are scored both on their performance on the training dataset and based on the
complexity of the model.
a. Model Performance. How well a candidate model has performed on the training
dataset.
b. Model Complexity. How complicated the trained candidate model is after
training.
A benefit of probabilistic model selection methods is that a test dataset is not required,
meaning that all of the data can be used to fit the model, and the final model that will be
used for prediction in the domain can be scored directly.
The limitations:
- the same general statistic cannot be calculated across a range of different types of
models. i.e. the metric must be carefully derived for each model.
- do not take the uncertainty of the model into account, and in practice they tend to
favour overly simple models.
Q) What are the advantages and disadvantages of solely using correlation values of
features with target value for feature selection ?
7. Statistics:
Types of Stats
1. Descriptive
2. Predictive
3. Prescriptive
Causality vs Correlation
Sampling Bias-
Response Bias
Convenience bias
Voluntary response bias
Random Sampling-
Simple Random Sample
Stratification
Clustering
Types of Studies-
1. Sampling Study: Estimate a parameter of the population using a sample
2. Observational study: Correlations
3. Experiment: Establish Causality -> Control and treatment group
Matched pair experiment design
Sampling Distributions:
If np >= 10 AND n(1-p) >= 10 then the sampling distribution is approximately normal
In the study of probability theory, the central limit theorem (CLT) states that the
distribution of sample approximates a normal distribution (also known as a “bell
curve”) as the sample size becomes larger, assuming that all samples are identical in
size, and regardless of the population distribution shape.
Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
A key aspect of CLT is that the average of the sample means and standard
deviations will equal the population mean and standard deviation.
Standard Error of mean: The standard deviation (SD) measures the amount of
variability, or dispersion, from the individual data values to the mean, while the standard
error of the mean (SEM) measures how far the sample mean (average) of the data is
likely to be from the true population mean. The SEM is always smaller than the SD.
The Standard Error ("Std Err" or "SE"), is an indication of the reliability of the mean. A
small SE is an indication that the sample mean is a more accurate reflection of the
actual population mean. A larger sample size will normally result in a smaller SE (while
SD is not directly affected by sample size).
SD tells us about the shape of our distribution, how close the individual data values are
from the mean value. SE tells us how close our sample mean is to the true mean of
the overall population. Together, they help to provide a more complete picture than the
mean alone can tell us.
9. Confidence Interval:
https://www.khanacademy.org/math/statistics-probability/
In practice, however, we select one random sample and generate one confidence
interval, which may or may not contain the true mean. The observed interval may over-
or underestimate μ. Consequently, the 95% CI is the likely range of the true, unknown
parameter.
The confidence level refers to the long-term success rate of the method, that is, how
often this type of interval will capture the parameter of interest.
When we want to carry out inferences on one proportion (build a confidence interval or
do a significance test), the accuracy of our methods depend on a few conditions.
Before doing the actual computations of the interval or test, it's important to check
whether or not these conditions have been met, otherwise the calculations and
conclusions that follow aren't actually valid. The conditions we need for inference on
one proportion are:
Random: The data needs to come from a random sample or randomized experiment.
To use the formula for standard deviation of p^, we need individual observations to be
independent. When we are sampling without replacement, individual observations aren't
technically independent since removing each item changes the population.
But the 10%, percent condition says that if we sample 10%, percent or less of the
population, we can treat individual observations as independent since removing each
observation doesn't significantly change the population as we sample. For instance, if
our sample size is n=150, there should be at least N=1500 members in the population.
This allows us to use the formula for standard deviation of p^.
So the confidence interval is: (statistic)^ +/- (z*) times (Standard Deviation of population
distribution)
10. T-statistics:
In statistics, the t-statistic is the ratio of the departure of the estimated value of a
parameter from its hypothesized value (usually the value hypothesized in null
hypothesis) to its standard error. It is used in hypothesis testing via Student's t-test.
The T-statistic is used in a T test to determine if you should support or reject the null
hypothesis.
It is very similar to the Z-score but with the difference that T-statistic is used when
the sample size is small (<30) and the population standard deviation is unknown.
For small sample size (<30), the standard deviation of the population cannot be
estimated using the std. deviation of the sample. For example, the T-statistic is used in
estimating the population mean from a sampling distribution of sample means if the
population standard deviation is unknown. It is also used along with p-value when
running hypothesis tests where the p-value tells us what the odds are of the results to
have happened.
When sample size or n is very small (<30), we assume that the sampling distribution is
a t-distribution (similar to normal dist. with fatter tails), not a normal distribution.
For sample size >30, we assume a normal distribution (according to the central limit
theorem) and hence we can use the z-statistic.
For finding the t value from t-table we need the degrees of freedom which is equal to
(n-1)
Q) When and why can’t z-stat be used?
Ans) If the population std deviation (sigma) is known, we can find the std deviation of
the sampling distribution (which is = Std. Error = Population Std. Deviation / (root(n) ).
But in most cases, population std deviation is unknown, and hence we cannot estimate
the std. deviation of sampling distribution (also called std. error) unless n>30 (then acc
to CLT, we can use z-stat as the dist. will be normal).
In this case, we use the std. deviation of just the sample (denoted by “s”) (just the
sample, not sampling dist.) along with the t-stat (not z-stat) because t-stat gives a better
probability than z-stat in this case.
Conclusion: If n>30, use z-stat (dist. becomes normal acc to CLT and hence
population std deviation can be estimated using sample std deviation)
Else if n<30, the dist is not normal but a t-dist (fat tails), and hence use the t-stat
not z-stat.
Evaluates two mutually exclusive statements about population data using sample data.
Significance Testing:
Power:
Increasing the alpha -> Increases the Power -> Decreases Type 2 Error
Increasing the alpha -> Increases Type 1 Error
Increasing sample size (n) -> Makes curve more normal / narrower -> Less Overlap ->
Increases the power
Using log, square, square root, etc. transformation to get linear models should be tried!
Because we have developed so many tools etc around linear model, that it is always
easier to do things with a linear model than an exponential, etc. model.
12. Chi-Squared Statistic, Test and Distribution
The subscript “c” is the degrees of freedom. “O” is your observed value and E is your
expected value.
A very small chi square test statistic means that your observed data fits your expected
data extremely well. In other words, there is a relationship.
A very large chi square test statistic means that the data does not fit very well. In other
words, there isn’t a relationship.
if your observed and expected values were equal (“no difference -> high correlation”)
then chi-square would be zero — an event that is unlikely to happen in real life.
You could take your calculated chi-square value and compare it to a critical value from
a chi-square table. If the chi-square value is more than the critical value, then there is a
significant difference. You could also use a p-value. First state the null hypothesis and
the alternate hypothesis. Then generate a chi-square curve for your results along with a
p-value (See: Calculate a chi-square p-value Excel). Small p-values (under 5%) usually
indicate that a difference is significant (or “small enough”).
Chi Squared Distribution: Let’s say you have a random sample taken from a normal
distribution. The chi square distribution is the distribution of the sum of these random
samples squared . The degrees of freedom (k) are equal to the number of samples
being summed.
The degrees of freedom in a chi square distribution is also its mean. Chi square
distributions are always right skewed. However, the greater the degrees of freedom,
the more the chi square distribution looks like a normal distribution.
ANOVA tests the non-specific null hypothesis that all population means are equal.
An ANOVA conducted on a design in which there is only one factor (one independent
variable) is called a one-way ANOVA
If an experiment has two factors, then the ANOVA is called a two-way ANOVA.
When different subjects are used for the levels of a factor, the factor is called a
between-subjects factor or a between-subjects variable.
When the same subjects are used for the levels of a factor, the factor is called a
within-subjects factor or a within-subjects variable.
Q) What Does “One-Way” or “Two-Way Mean?
One-way or two-way refers to the number of independent variables (IVs) in your
Analysis of Variance test.
● One-way has one independent variable (with 2 levels). For example: brand of
cereal,
● Two-way has two independent variables (it can have multiple levels). For
example: brand of cereal, calories.
Types of Tests: There are two main types: one-way and two-way. Two-way tests can
be with or without replication.
● One-way ANOVA between groups: used when you want to test two groups to see
if there’s a difference between them.
● Two way ANOVA without replication: used when you have one group and you’re
double-testing that same group. For example, you’re testing one set of
individuals before and after they take a medication to see if it works or not.
● Two way ANOVA with replication: Two groups, and the members of those groups
are doing more than one thing. For example, two groups of patients from different
hospitals trying two different therapies.
For example, you might want to find out if there is an interaction between income and
gender for anxiety level at job interviews. The anxiety level is the outcome, or the
variable that can be measured. Gender and Income are the two categorical variables.
These categorical variables are also the independent variables, which are called factors
in a Two Way ANOVA.
The factors can be split into levels. In the above example, income level could be split
into three levels: low, middle and high income. Gender could be split into three levels:
male, female, and transgender. Treatment groups are all possible combinations of the
factors. In this example there would be 3 x 3 = 9 treatment groups.
Degrees Of Freedom: The term number of degrees of freedom means the total number
of observations in the sample (= n) less the number of independent (linear) constraints
or restrictions put on them.
For medical use case type 2 errors are more dangerous than type 1 errors.
Example: The covid-19 test report is negative but actually the patient is positive.
Sensitivity matters more when classifying the 1’s correctly is more important than
classifying the 0’s.
Example: Just like what we need here in the BreastCancer case, where you don’t want
to miss out any malignant to be classified as ‘benign’.
Example: For predicting loan default, we don't want to miss out any bad loans (1s) to
be classified as good (0s).
Specificity matters more when classifying the 0’s correctly is more important than
classifying the 1’s.
Example: Maximizing specificity is more relevant in cases like spam detection, where
you strictly don’t want genuine messages (0’s) to end up in spam (1’s).
4. Precision
5. F1-Score
A high precision score gives more confidence to the model’s capability to classify 1’s.
Combining this with Recall gives an idea of how many of the total 1’s it was able to
cover.
A good model should have a good precision as well as a high recall. So ideally, I want to
have a measure that combines both these aspects in one single metric – the F1 Score.
6. Cohen's Kappa
classification - Cohen's kappa in plain English - Cross Validated
(stackexchange.com)
Cohen’s Kappa statistic is a very useful, but under-utilised, metric. Sometimes in
machine learning we are faced with a multi-class classification problem. In those
cases, measures such as the accuracy, or precision/recall do not provide the complete
picture of the performance of our classifier.
In some other cases we might face a problem with imbalanced classes. E.g. we have
two classes, say A and B, and A shows up on 5% of the time. Accuracy can be
misleading, so we go for measures such as precision and recall. There are ways to
combine the two, such as the F-measure, but the F-measure does not have a very
good intuitive explanation, other than it being the harmonic mean of precision and
recall.
The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an
Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a
single classifier, but also to evaluate classifiers amongst themselves. In addition, it
takes into account random chance (agreement with a random classifier), which
generally means it is less misleading than simply using accuracy as a metric (an
Observed Accuracy of 80% is a lot less impressive with an Expected Accuracy of
75% versus an Expected Accuracy of 50%).
Example:
Assume that a model was built using supervised machine learning on labeled data. This
doesn't always have to be the case; the kappa statistic is often used as a measure
of reliability between two human raters.
Expected accuracy = This value is defined as the accuracy that any random classifier
would be expected to achieve based on the confusion matrix
= ((15 * 17 / 30) + (15 * 13 / 30)) / 30 = 0.50
(For all balanced dataset, with equal number of grounds truths in all class, expected
accuracy = 0.50)
Not only can this kappa statistic shed light into how the classifier itself performed, the
kappa statistic for one model is directly comparable to the kappa statistic for any other
model used for the same classification task.
So, in answer to your question about a 0.40 kappa, it depends. If nothing else, it means
that the classifier achieved a rate of classification 2/5 of the way between whatever the
expected accuracy was and 100% accuracy. If expected accuracy was 80%, that means
that the classifier performed 40% (because kappa is 0.4) of 20% (because this is the
distance between 80% and 100%) above 80% (because this is a kappa of 0, or random
chance), or 88%. So, in that case, each increase in kappa of 0.10 indicates a 2%
increase in classification accuracy. If accuracy was instead 50%, a kappa of 0.4 would
mean that the classifier performed with an accuracy that is 40% (kappa of 0.4) of 50%
(distance between 50% and 100%) greater than 50% (because this is a kappa of 0, or
random chance), or 70%. Again, in this case that means that an increase in kappa of
0.1 indicates a 5% increase in classification accuracy.
7. AUC ROC:
Tells us how well can our model distinguish between the two classes for different
threshold values. They can help us to find the best threshold.
Often, choosing the best model is sort of a balance between predicting the one's
accurately or the zeroes accurately. In other words sensitivity and specificity.
But it would be great to have something that captures both these aspects in one single
metric. This is nicely captured by the 'Receiver Operating Characteristics' curve, also
called as the ROC curve. In fact, the area under the ROC curve can be used as an
evaluation metric to compare the efficacy of the models.
The 45 degree line shows TPR = FPR. Any point on this line shows proportion of
correctly classified +ve sample = Incorrectly classified negative sample
So, if we trace the curve from bottom left, the value of probability cutoff (threshold)
decreases from 1 towards 0.
8. Gini Coefficient
Gini Coefficient is an indicator of how well the model outperforms random predictions. It
can be computed from the area under the ROC curve using the following formula:
Gini Coefficient = (2 * AUROC) - 1
The prediction error for any machine learning algorithm can be broken down into three
parts:
a. Bias Error
b. Variance Error
c. Irreducible Error: It is a measure of the amount of noise in our data. Here it is
important to understand that no matter how good we make our model, our data
will have a certain amount of noise or irreducible error that can not be removed.
a. Bias are the simplifying assumptions made by a model to make the target
function easier to learn.
Generally, linear algorithms have a high bias making them fast to learn and
easier to understand but generally less flexible. In turn, they have lower
predictive performance on complex problems that fail to meet the simplifying
assumptions of the algorithms bias.
Examples of low-bias machine learning algorithms include: Decision Trees,
k-Nearest Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression,
Linear Discriminant Analysis and Logistic Regression.
Model with high bias pays very little attention to the training data and
oversimplifies the model. It always leads to high error on training and test data.
f’(X) = Model
Y = f(X) + e, Y = target variable
Bias[f'(X)] = E[f'(X) – f(X)]
In the simplest terms, Bias is the difference between the Predicted Value and the
Expected Value.
b. Machine learning algorithms that have a high variance are strongly influenced by
the specifics of the training data. Model with high variance pays a lot of attention
to training data and does not generalize on the data which it hasn’t seen before.
As a result, such models perform very well on training data but have high error
rates on test data.
Examples of low-variance machine learning algorithms include: Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Variance[f(x))=E[X^2]−E[X]^2
The goal of any supervised machine learning algorithm is to achieve low bias and
low variance. In turn the algorithm should achieve good prediction performance.
Underfitting: model unable to capture the underlying pattern of the data. These
models usually have high bias and low variance. It happens when we have very little
data to build an accurate model or when we try to build a linear model with nonlinear
data. Also, these kinds of models are very simple to capture the complex patterns in
data like Linear and logistic regression.
In supervised learning, overfitting happens when our model captures the noise along
with the underlying pattern in data. It happens when we train our model a lot over noisy
dataset. These models have low bias and high variance. These models are very
complex like Decision trees which are prone to overfitting.
Fortunately, if you have a low R-squared value but the independent variables are
statistically significant, you can still draw important conclusions about the relationships
between the variables. Statistically significant coefficients continue to represent the
mean change in the dependent variable given a one-unit shift in the independent
variable. Clearly, being able to draw conclusions like this is vital.
Are High R-squared Values Always Great? The data in the fitted line plot follow a
very low noise relationship, and the R-squared is 98.5%, which seems fantastic.
However, the regression line consistently under and over-predicts the data along the
curve, which is bias. The Residuals versus Fits plot emphasizes this unwanted pattern.
An unbiased model has residuals that are randomly scattered around zero.
Non-random residual patterns indicate a bad fit despite a high R2. Always check
your residual plots!
At first glance, R-squared seems like an easy to understand statistic that indicates how
well a regression model fits a data set. However, it doesn’t tell us the entire story. To get
the full picture, you must consider R2 values in combination with residual plots,
other statistics, and in-depth knowledge of the subject area.
Q) Can R2 be negative?
Ans) Yes. When the best fit line is worse than the average/mean line, then SS,Total <
SS, Residual and hence R2<0.
Degrees of Freedom
Dof = n - k -1
N = no of observation, K = no of explanatory variables (X)
● The adjusted R-squared can be negative, but it’s usually not. It is always
lower than the R-squared. Use the adjusted R-square to compare models with
different numbers of predictors
● Use the predicted R-square to determine how well the model predicts new
observations and whether the model is too complicated
If the chosen model fits worse than a straight line, R2 can be negative
As you might expect, the mean of the sampling distribution of the difference between
means is:
How to choose between RMSE, MAE and R2 for regression model evaluation?
● Easy to understand and interpret MAE because it directly takes the average of
offsets whereas RMSE penalizes the higher difference more than MAE.
● RMSE will be higher than or equal to MAE. The only case where it equals MAE is
when all the differences are equal or zero
● However, even after being more complex and biased towards higher deviation,
RMSE is still the default metric of many models because loss function defined in
terms of RMSE is smoothly differentiable and makes it easier to perform
mathematical operations.
● Minimizing the squared error over a set of numbers results in finding its mean,
and minimizing the absolute error results in finding its median. This is the reason
why MAE is robust to outliers whereas RMSE is not.
● The absolute value of RMSE does not actually tell how bad a model is. It can
only be used to compare across two models whereas Adjusted R² easily does
that. For example, if a model has adjusted R² equal to 0.05 then it is definitely
poor.
● However, if you care only about prediction accuracy then RMSE is best. It is
computationally simple, easily differentiable and present as default metric for
most of the models.
- Mean Square loss/Quadratic loss/L2 loss - To calculate the MSE, you take the
difference between your model’s predictions and the ground truth, square it, and
average it out across the whole dataset.
- Advantage: The MSE is great for ensuring that our trained model has no outlier
predictions with huge errors, since the MSE puts larger weight on theses errors due to
the squaring part of the function
- Disadvantage: If our model makes a single very bad prediction, the squaring part of
the function magnifies the error.
- Mean Absolute loss/L1 loss - To calculate the MAE, you take the difference between
your model’s predictions and the ground truth, apply the absolute value to that
difference, and then average it out across the whole dataset
- Advantage : The beauty of the MAE is that its advantage directly covers the MSE
disadvantage. Since we are taking the absolute value, all of the errors will be weighted
on the same linear scale. Thus, unlike the MSE, we won’t be putting too much weight on
our outliers and our loss function provides a generic and even measure of how well our
model is performing
- Disadvantage: If we do in fact care about the outlier predictions of our model, then the
MAE won’t be as effective. The large errors coming from the outliers end up being
weighted the exact same as lower errors. This might result in our model being great
most of the time, but making a few very poor predictions every-so-often
- Huber Loss - Now we know that the MSE is great for learning outliers while the MAE
is great for ignoring them. But what about something in the middle?
- The Huber Loss offers the best of both worlds by balancing the MSE and MAE
together
- What this equation essentially says is: for loss values less than delta, use the MSE; for
loss values greater than delta, use the MAE
- You’ll want to use the Huber loss any time you feel that you need a balance
between giving outliers some weight, but not too much. For cases where outliers
are very important to you, use the MSE! For cases where you don’t care at all
about the outliers, use the MAE!
- There may be regression problems in which the target value has a spread of values
and when predicting a large value, you may not want to punish a model as heavily as
mean squared error.
- Instead, you can first calculate the natural logarithm of each of the predicted values,
then calculate the mean squared error. This is called the Mean Squared Logarithmic
Error loss, or MSLE for short.
- It has the effect of relaxing the punishing effect of large differences in large predicted
values. As a loss measure, it may be more appropriate when the model is predicting
unscaled quantities directly.
Q) What are some common ways of dealing with heavily imbalanced datasets ?
A. Downsampling: Remove some points from majority class. Disadv: Loss of data points
which may contain vital info
B. Upsampling: Repeat some points from minority class. Disadv: Overfitting
C. Upsampling using Artificial/synthetic points: Use extrapolation to create artificial
points
D. Class Weight: Assign more weight to minority class
First it identifies the hyperplane that lies closest to the data, and then it projects the data
onto it.
● Create the PCs using only the X. Later you will see, we draw a scatter plot
using the first two PCs and color the points based on the actual Y. Typically, if the
X’s were informative enough, you should see clear clusters of points belonging to
the same category.
● The key thing to understand is that, each principal component is the dot product
of its weights (in pca.components_) and the mean centered data(X). What I
mean by ‘mean-centered’ is, each column of the ‘X’ is subtracted from its own
mean so that the mean of each column becomes zero.
The PCs are usually arranged in the descending order of the variance(information)
explained. PC1 contributed 22%, PC2 contributed 10% and so on. The further you go,
the lesser is the contribution to the total variance.
Understanding Concepts behind PCA
To simplify things, let’s imagine a dataset with only two columns. Using these two
columns, I want to find a new column that better represents the ‘data’ contributed by
these two columns.This new column can be thought of as a line that passes through
these points. Such a line can be represented as a linear combination of the two
columns and explains the maximum variation present in these two columns. In
what direction do you think the line should stop so that it covers the maximum variation
of the data points? Such a line should be in a direction that minimizes the
perpendicular distance of each point from the line. It may look something like this:
But how to determine this line?I am only interested in determining the direction(u1) of
this line. Because, by knowing the direction u1, I can compute the projection of any
point on this line. This line u1 is of length 1 unit and is called a unit vector. This unit
vector eventually becomes the weights of the principal components, also called as
loadings. The PCA weights (Ui) are actually unit vectors of length 1. Because, it is
meant to represent only the direction.
Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance (the average of the squared distances from the
projected points (red dots) to the origin), that is to say, the lines that capture most
information of the data. The relationship between variance and information here, is that,
the larger the variance carried by a line, the larger the dispersion of the data points
along it, and the larger the dispersion along a line, the more the information it has. To
put all this simply, just think of principal components as new axes that provide the best
angle to see and evaluate the data, so that the differences between the observations
are better visible.
Objective function:
The objective is to determine u1 so that the mean perpendicular distance from the line
for all points is minimized. Here is the objective function (using Pythagoras Theorem):
It can be proved that the above equation reaches a minimum when the value of u1
equals the EigenVector of the covariance matrix of X. This EigenVector is same as
the PCA weights that we got earlier inside pca.components_ object. We’ll see what
Eigen Vectors are shortly.
Alright. But there can be a second PC to this data. The next best direction to explain the
remaining variance is perpendicular to the first PC. Actually, there can be as many
Eigen Vectors as there are columns in the dataset. And they are orthogonal to each
other.
Q) Why Standardize?
Ans) PCA is quite sensitive regarding the variances of the initial variables. That is, if
there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (For example, a variable
that ranges between 0 and 100 will dominate over a variable that ranges between 0 and
1), which will lead to biased results. So, transforming the data to comparable scales can
prevent this problem.
Step 2 Compute Covariance Matrix: Covariance measures how two variables are
related to each other, that is, if two variables are moving in the same direction with
respect to each other or not. When covariance is positive, it means, if one variable
increases, the other increases as well. The opposite true when covariance is negative.
The aim of this step is to understand how the variables of the input data set are varying
from the mean with respect to each other, or in other words, to see if there is any
relationship between them. Because sometimes, variables are highly correlated in
such a way that they contain redundant information. So, in order to identify these
correlations, we compute the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables. For example, for a 3-dimensional data set with 3 variables x, y,
and z, the covariance matrix is a 3×3 matrix of this from:
Step 3: Compute Eigen values and Eigen Vectors
Eigenvalues (amount of variance) and Eigenvectors (directions of the axes where there
is the most variance(most information) and that we call Principal Components.)
represent the amount of variance explained and how the columns are related to each
other. The length of Eigenvectors is one.
Disadvantage of PCA:
An important thing to realize here is that, the principal components are less
interpretable and don’t have any real meaning since they are constructed as linear
combinations of the initial variables.
Incremental PCA
One problem with the preceding implementation of PCA is that it requires the whole
training set to fit in memory in order for the SVD algorithm to run. Fortunately,
Incremental PCA (IPCA) algorithms have been developed: you can split the training
set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is
useful for large training
Kernel PCA
In Chapter 5 we discussed the kernel trick, a mathematical technique that implicitly
maps instances into a very high-dimensional space (called the feature space), enabling
nonlinear classification and regression with Support Vector Machines. Recall that a
linear decision boundary in the high-dimensional feature space corresponds to a
complex nonlinear decision boundary in the original space. It turns out that the same
trick can be applied to PCA, making it possible to perform complex nonlinear projections
for dimensionality reduction. This is called Kernel PCA (kPCA).6 It is often good at
preserving clusters of instances after projection, or sometimes even unrolling datasets
that lie close to a twisted manifold.
Q) What is the basic principle behind PCA ? What is the criteria used for reducing the
dimension in PCA ?
Q) What is SVD ? What are some of its areas of applications ?
19. KNN
● KNN stores the entire training dataset which it uses as its representation.
● KNN does not learn any model.
● KNN makes predictions just-in-time by calculating the similarity between an input
sample and each training instance.
● There are many distance measures to choose from to match the structure of your
input data.
● That it is a good idea to rescale your data, such as using normalization, when
using KNN.
20 .SVM
kernel: We have already discussed it. Here, we have various options available with
kernel like, “linear”, “rbf”,”poly” and others (default value is “rbf”). Here “rbf” and “poly”
are useful for non-linear hyper-plane.
gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. High value of gamma will try to
exactly fit the as per training data set and cause overfitting problem.
C: Penalty parameter C of the error term. It also controls the trade-off between smooth
decision boundaries and classifying the training points correctly.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of predictor given class.
● P(x) is the prior probability of predictor
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different classes based
on various attributes. This algorithm is mostly used in text classification and with
problems having multiple classes.
Types of normalization:
1. Standard Scaler: Standardize features by removing the mean and scaling to unit
variance. The standard score of a sample x is calculated as: z = (x - u) / s
2. Min-max scaler: In this approach, the data is scaled to a fixed range - usually 0
to 1. The cost of having this bounded range - in contrast to standardization - is
that we will end up with smaller standard deviations, which can suppress the
effect of outliers.A Min-Max scaling is typically done via the following equation:
Xsc = X−Xmin / Xmax−Xmin
22. CLUSTERING
SQL:
https://sqlzoo.net/
http://thedatamonk.com/
https://www.hackerrank.com/domains/sql?badge_type=sql
https://www.w3schools.com/sql/default.asp
1. Difference between UNIQUE (all values in col are diff), DISTINCT (removes
duplicate records) and DIFFERENT
2. When to use WHERE , HAVING (with groupby objects)
3. LPAD, RPAD
4. INSERT INTO (old table) and SELECT INTO (new table)
5. IFNULL, ISNULL
6. Create PROCEDURES
7. SELF JOIN and other JOINS
8. Wildcards
9. Aliases
10. ANY, ALL
11. EXISTS
12. CASE
13. UNION
14. Primary key, unique key and foreign key.
15. Windows Function
REGULARIZATION
FEATURE ENGINEERING
FEATURE SELECTION
TIME SERIES
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/14919622
91
Batch Learning (Offline Learning) - train on entire data and use the model. Update
model offline, after certain intervals of time - incapable of learning incrementally
vs
Online Learning (Mini-Batches) - on the fly - fast and cheap - requires less computation
resources
Overfitting:
Overgeneralizing - it means that the model performs well on the training data, but it
does not generalize well - detect patterns in the noise itself.
The possible solutions are:
• To simplify the model by selecting one with fewer parameters (e.g., a linear model
rather than a high-degree polynomial model), by reducing the number of attributes in the
training data or by constraining the model
• To gather more training data
• To reduce the noise in the training data (e.g., fix data errors and remove outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.
A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it
is not affected by the learning algorithm itself; it must be set prior to training and
remains constant during training.
You train multiple models with various hyperparameters using the training set, you
select the model and hyperparameters that perform best on the validation set, and
when you’re happy with your model you run a single final test against the test set to
get an estimate of the generalization error.
To avoid “wasting” too much training data in validation sets, a common technique is to
use cross-validation: the training set is split into complementary subsets, and each
model is trained against a different combination of these subsets and validated against
the remaining parts
If the dataset is large enough, train_test_split works fine, else we need to use
StratifiedShuffleSplit to ensure that the train/test set is representative of the entire
dataset.
The F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the
regular mean treats all values equally, the harmonic mean gives much more weight to
low values. As a result, the classifier will only get a high F1 score if both recall and
precision are high.
The receiver operating characteristic (ROC) curve is another common tool used with
binary classifiers. It is very similar to the precision/recall curve, but instead of plotting
precision versus recall, the ROC curve plots the true positive rate (another name for
recall) against the false positive rate.
Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder
how to decide which one to use. As a rule of thumb, you should prefer the PR curve
whenever the positive class is rare or when you care more about the false
positives than the false negatives, and the ROC curve otherwise. For example,
looking at the previous ROC curve (and the ROC AUC score), you may think that the
classifier is really good. But this is mostly because there are few positives (5s)
compared to the negatives (non-5s). In contrast, the PR curve makes it clear that the
classifier has room for improvement (the curve could be closer to the topright corner).
PR Curve: When imbalanced dataset
ROC: Balanced Dataset
Another difference: Use of true negatives in the False Positive Rate in the ROC Curve
and the careful avoidance of this rate in the Precision-Recall curve.
So should you use Gini impurity or entropy? The truth is, most of the time it does
not make a big difference: they lead to similar trees. Gini impurity is slightly faster to
compute, so it is a good default. However, when they differ, Gini impurity tends to isolate
the most frequent class in its own branch of the tree, while entropy tends to produce
slightly more balanced trees.
Reducing dimensionality does lose some information (just like compressing an image to
JPEG can degrade its quality), so even though it will speed up training, it may also
make your system perform slightly worse. It also makes your pipelines a bit more
complex and thus harder to maintain. So you should first try to train your system with
the original data before considering using dimensionality reduction if training is too slow.
In some cases, however, reducing the dimensionality of the training data may filter out
some noise and unnecessary details and thus result in higher performance (but in
general it won’t; it will just speed up training).
Apart from speeding up training, dimensionality reduction is also extremely useful for
data visualization.
This fact implies that high dimensional datasets are at risk of being very sparse: most
training instances are likely to be far away from each other. Of course, this also means
that a new instance will likely be far away from any training instance, making predictions
much less reliable than in lower dimensions, since they will be based on much larger
extrapolations. In short, the more dimensions the training set has, the greater the risk
of overfitting it.
In theory, one solution to the curse of dimensionality could be to increase the size of
the training set to reach a sufficient density of training instances. Unfortunately, in
practice, the number of training instances required to reach a given density grows
exponentially with the number of dimensions.
Projection
In most real-world problems, training instances are not spread out uniformly across all
dimensions. Many features are almost constant, while others are highly correlated (as
discussed earlier for MNIST). As a result, all training instances actually lie within (or
close to) a much lower-dimensional subspace of the high-dimensional space.
Manifold Learning
Many dimensionality reduction algorithms work by modeling the manifold on which the
training instances lie; this is called Manifold Learning. It relies on the manifold
assumption, also called the manifold hypothesis, which holds that most real-world
high-dimensional datasets lie close to a much dimensional manifold. This assumption is
very often empirically observed.
Once again, think about the MNIST dataset: all handwritten digit images have some
similarities. They are made of connected lines, the borders are white, they are more or
less centered, and so on. If you randomly generated images, only a ridiculously tiny
fraction of them would look like handwritten digits. In other words, the degrees of
freedom available to you if you try to create a digit image are dramatically lower
than the degrees of freedom you would have if you were allowed to generate any
image you wanted. These constraints tend to squeeze the dataset into a lower
dimensional manifold.
Exercises
1. What are the main motivations for reducing a dataset’s dimensionality? What are the
main drawbacks?
2. What is the curse of dimensionality?
3. Once a dataset’s dimensionality has been reduced, is it possible to reverse the
operation? If so, how? If not, why?
4. Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
5. Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained
variance ratio to 95%. How many dimensions will the resulting dataset have?
6. In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA,
or Kernel PCA?
7. How can you evaluate the performance of a dimensionality reduction algorithm
on your dataset?
8. Does it make any sense to chain two different dimensionality reduction algorithms?
1. Motivations and drawbacks: • The main motivations for dimensionality reduction are:
— To speed up a subsequent training algorithm (in some cases it may even remove
noise and redundant features, making the training algorithm per‐ form better). — To
visualize the data and gain insights on the most important features. — Simply to save
space (compression). • The main drawbacks are: — Some information is lost, possibly
degrading the performance of subse‐ quent training algorithms. — It can be
computationally intensive. — It adds some complexity to your Machine Learning
pipelines. — Transformed features are often hard to interpret.
2. The curse of dimensionality refers to the fact that many problems that do not exist in
low-dimensional space arise in high-dimensional space. In Machine Learning, one
common manifestation is the fact that randomly sampled highdimensional vectors are
generally very sparse, increasing the risk of overfitting and making it very difficult to
identify patterns in the data without having plenty of training data.
3. Once a dataset’s dimensionality has been reduced using one of the algorithms we
discussed, it is almost always impossible to perfectly reverse the operation, because
some information gets lost during dimensionality reduction. Moreover, while some
algorithms (such as PCA) have a simple reverse transformation procedure that can
reconstruct a dataset relatively similar to the original, other algorithms (such as T-SNE)
do not.
4. PCA can be used to significantly reduce the dimensionality of most datasets, even if
they are highly nonlinear, because it can at least get rid of useless dimensions.
However, if there are no useless dimensions—for example, the Swiss roll—then
reducing dimensionality with PCA will lose too much information. You want to unroll the
Swiss roll, not squash it.
5. That’s a trick question: it depends on the dataset. Let’s look at two extreme exam‐
ples. First, suppose the dataset is composed of points that are almost perfectly aligned.
In this case, PCA can reduce the dataset down to just one dimension while still
preserving 95% of the variance. Now imagine that the dataset is composed of perfectly
random points, scattered all around the 1,000 dimensions. In this case all 1,000
dimensions are required to preserve 95% of the variance. So the answer is, it depends
on the dataset, and it could be any number between 1 and 1,000. Plotting the explained
variance as a function of the number of dimensions is one way to get a rough idea of
the dataset’s intrinsic dimensionality.
6. Regular PCA is the default, but it works only if the dataset fits in memory. Incremental
PCA is useful for large datasets that don’t fit in memory, but it is slower than regular
PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA
is also useful for online tasks, when you need to apply PCA on the fly, every time a new
instance arrives. Randomized PCA is useful when you want to considerably reduce
dimensionality and the dataset fits in memory; in this case, it is much faster than regular
PCA. Finally, Kernel PCA is useful for nonlinear datasets.
7. Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of
dimensions from the dataset without losing too much information. One way to measure
this is to apply the reverse transformation and measure the reconstruction error.
However, not all dimensionality reduction algorithms provide a reverse transformation.
Alternatively, if you are using dimensionality reduction as a preprocessing step before
another Machine Learning algorithm (e.g., a Random Forest classifier), then you can
simply measure the performance of that second algorithm; if dimensionality reduction
did not lose too much information, then the algorithm should perform just as well as
when using the original dataset.
8. It can absolutely make sense to chain two different dimensionality reduction
algorithms. A common example is using PCA to quickly get rid of a large number of
useless dimensions, then applying another much slower dimensionality reduction
algorithm, such as LLE. This two-step approach will likely yield the same performance
as using LLE only, but in a fraction of the time.
26. TABLEAU
https://www.youtube.com/playlist?list=PLWPirh4EWFpGXTBu8ldLZGJCUeTMBpJ
FK
Live vs Extract
6 Data Types:
1. Numbers (decimal) and Numbers (whole)
2. Date and Time
3. Date
4. String
5. Boolean
6. Geographic Data Types
Scatter Plot
Line Graph
Reference lines
Bubble Chart
Bar and Stacked chart
Tree Map
Bump Chart: Line chart where rank changes with time
Funnel Chart
Waterfall (Gnan Chart)
Piechart
Maps
1.Suppose your dataset has a lot of missing values. How would you address this issue?
2.What kind of imputation can be used in case of missing categorical values and
missing continuous values?
3.Suppose your features are highly skewed (either left or right). Would this affect the
model training ? If yes, how to address the issue ?
4.What are outliers ? What are some common techniques used to detect outliers in the
data ?
5.Suppose Pearson correlation between V1 and V2 is zero. In such a case, is it right to
conclude that V1 and V2 do not have any relation between them?
6.What are some forms of Data Leakages ? How can they be prevented ?
7.What is prior probability and posterior probability ?
8.Can normal stratified K-fold cross validation be used for a time series ? Why ?
9.If not, what cross validation technique can be used as an alternative ?
10.Differentiate between a model’s parameters and hyperparameters.
11. Differentiate between factor analysis and cluster analysis.
12. You are provided with a very big dataset which goes into the order of millions of rows
and thousands of features and you have a lack of computing resources to process the
entire dataset at once. How would you approach this problem?
13.Differentiate between K-Means Clustering and Hierarchical clustering.
14. What is A/B Testing in ML?
15.Given a dataset, how would you apply KNN on it ?
16.Feature Engineering techniques
17..What is the loss function used in Random Forest ?
https://github.com/ashishpatel26/500-AI-Machine-learning-Deep-learning-Computer-visi
on-NLP-Projects-with-code
https://github.com/mbadry1/DeepLearning.ai-Summary
30. DECISION TREES
1. Root Node:
2. Splitting:
3. Decision Node:
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called
pruning. You can say opposite process of splitting.
6. Branch / Sub-Tree:
7. Parent and Child Node:
Advantages:
1. Easy to Understand: Decision Tree looks like simple if-else statements which are very
easy to understand and can be visualized easily.
2. Useful in Data exploration: one of the fastest way to identify most significant
variables and relation between two or more variables.
3. Less data cleaning required: It is not influenced by outliers and missing values
to a fair degree. No feature scaling required
4. Data type is not a constraint: It can handle both numerical and categorical
variables.
5. Can be used for both classification and regression problems.
6. Non Parametric Method: Decision tree is considered to be a non-parametric
method. This means that decision trees have no assumptions about the space
distribution and the classifier structure.
Disadvantages
1. Over fitting. This problem gets solved by setting constraints on model parameters
and pruning.
2. High variance: Due to the overfitting, there are very high chances of high variance
in the output which leads to many errors in the final estimation and shows high
inaccuracy in the results. In order to achieve zero bias (overfitting), it leads to high
variance.
3. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in different
categories.
4. Unstable: Adding a new data point can lead to re-generation of the overall tree and
all nodes need to be recalculated and recreated.
5. Not suitable for large datasets: If data size is large, then one single tree may grow
complex and lead to overfitting. So in this case, we should use Random Forest
instead of a single Decision Tree.
3. How does a tree decide where to split / which feature to split on?
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other
words, we can say that purity of the node increases with respect to the target variable.
Decision tree splits the nodes on all available variables and then selects the split
which results in most homogeneous sub-nodes.
The algorithm selection is also based on type of target variables. Let’s look at the four
most commonly used algorithms in decision tree:
1. Gini
Gini says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure.
1. Calculate Gini for sub-nodes, using the formula sum of square of probability for
success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Also, Gini Impurity = 1-Gini
The Gini coefficient measures the inequality among values of a frequency distribution
2. Chi-Square
It is an algorithm to find out the statistical significance between the differences between
sub-nodes and parent node. We measure it by sum of squares of standardized
differences between observed and expected frequencies of target variable.
1. Calculate Chi-square for individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of
each node of the split
Example, Split on Gender:
1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would
be 5 for both because parent node has probability of 50% and we have applied
same probability on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 –
5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using
formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer
below table for calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.
3. Information Gain:
Detailed Explanation: Entropy: How Decision Trees Make Decisions | by Sam T | Towards
Data Science
Information theory is a measure to define this degree of disorganization in a system
known as Entropy. If the sample is completely homogeneous, then the entropy is zero and
if the sample is an equally divided (50% – 50%), it has entropy of one.
Entropy can be calculated using formula:-
Here p and q are the probability of success and failure respectively in that node.
Entropy is also used with categorical target variables. It chooses the split which has lowest
entropy compared to parent node and other splits. The lesser the entropy, the better it is.
1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1
shows that it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for
male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 +
(20/30)*0.93 = 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for
Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will
split on Gender. We can derive information gain from entropy as Entropy of Parent Class
(=1 here) - Entropy of Splitted Nodes.
We simply subtract the entropy of Y given X from the entropy of just Y to calculate the
reduction of uncertainty about Y given an additional piece of information X about Y. This is
called Information Gain. The greater the reduction in this uncertainty, the more
information is gained about Y from X.
Example:
CR = Credit Rating
Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression
problems). This algorithm uses the standard formula of variance to choose the best split.
The split with lower variance is selected as the criteria to split the population:
5. What are the key parameters of tree modeling and how can we
avoid over-fitting in decision trees?
Preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:
● The number of features to consider while searching for a best split. These will be
randomly selected.
● As a thumb-rule, square root of the total number of features works great but we
should check upto 30-40% of the total number of features.
● Higher values can lead to over-fitting but depends on case to case.
This is exactly the difference between normal decision tree & pruning. A decision tree with
constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the
other hand if we use pruning, we in effect look at a few steps ahead and make a choice.
So we know pruning is better. But how to implement it in decision tree? The idea is simple.
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of
test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the
algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
This way high variance of a single decision tree is reduced to low variance.
In Random Forest, we grow multiple trees as opposed to a single tree in CART mode
To classify a new object based on attributes, each tree gives a classification and we say the
tree “votes” for that class. The forest chooses the classification having the most votes (over
all the trees in the forest) and in case of regression, it takes the average of outputs by
different trees
Tells us the most important settings are the number of trees in the forest
(n_estimators) and the number of features considered for splitting at each leaf node
(max_features).
It works in the following manner. Each tree is planted & grown as follows:
1. Assume number of cases in the training set is N. Then, sample of these N cases is
taken at random but with replacement. This sample will be the training set for
growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node,
m variables are selected at random out of the M. The best split on these m is used
to split the node. The value of m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority
votes for classification, average for regression).
Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
To find weak rule, we apply base learning (ML) algorithms with a different distribution.
Each time base learning algorithm is applied, it generates a new weak prediction rule. This
is an iterative process. After many iterations, the boosting algorithm combines these weak
rules into a single strong prediction rule.
Step 1: The base learner takes all the distributions and assign equal weight or attention to
each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay
higher attention to observations having prediction error. Then, we apply the next base
learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy
is achieved.
Boosting pays higher focus on examples which are mis-classified or have higher
errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy.
In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient
Boosting (GBM) and XGboost.
5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights.
(Here, the three misclassified blue-plus points will be given higher weights)
7. Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)
8. Similarly, multiple models are created, each correcting the errors of the previous
model.
9. The final model (strong learner) is the weighted mean of all the models (weak
learners).
Math Behind Gradient Boosting:
4.Handling Missing Values: XGBoost has an in-built routine to handle missing values.
User is required to supply a different value than other observations and pass that as a
parameter. XGBoost tries different things as it encounters a missing value on each
node and learns which path to take for missing values in future.
5. Tree Pruning: A GBM would stop splitting a node when it encounters a negative loss
in the split. Thus it is more of a greedy algorithm.
XGBoost on the other hand make splits upto the max_depth specified and then start
pruning the tree backwards and remove splits beyond which there is no positive
gain.
6. Built-in Cross-Validation
● XGBoost allows user to run a cross-validation at each iteration of the boosting
process and thus it is easy to get the exact optimum number of boosting
iterations in a single run.
● This is unlike GBM where we have to run a grid-search and only a limited values
can be tested.
2.1 Update the weights for targets based on previous run (higher for the ones mis-classified)
2.4 Update the output with current results taking into account the learning rate
Lets consider the important GBM parameters used to improve model performance in
Python:
1. learning_rate
○ This determines the impact of each tree on the final outcome (step 2.4). GBM
works by starting with an initial estimate which is updated using the output of
each tree. The learning parameter controls the magnitude of this change
in the estimates.
○ Lower values are generally preferred as they make the model robust to the
specific characteristics of a tree and thus allowing it to generalize well.
○ Lower values would require higher number of trees to model all the relations
and will be computationally expensive.
2. n_estimators
○ The number of sequential trees to be modeled (step 2)
○ Though GBM is fairly robust at higher number of trees but it can still overfit at
a point. Hence, this should be tuned using CV for a particular learning
rate.
3. Subsample
The fraction of observations to be selected for each tree. Selection is done by
random sampling.
Values slightly less than 1 make the model robust by reducing the variance.
Typical values ~0.8 generally work fine but can be fine-tuned further.
4. loss:It refers to the loss function to be minimized in each split.
5. Init: This affects initialization of the output. This can be used if we have made
another model whose outcome is to be used as the initial estimates for GBM.
6. Random_state: The random number seed so that same random numbers are
generated every time. This is important for parameter tuning. If we don’t fix the
random number, then we’ll have different outcomes for subsequent runs on the same
parameters and it becomes difficult to compare models.
7. Verbose: The type of output to be printed when the model fits. The different values
can be:
GBM in Python
#import libraries
clf.fit(X_train, y_train)
It’s feature to implement parallel computing makes it at least 10 times faster than existing
gradient boosting implementations. It supports various objective functions, including
regression, classification and ranking.
Python
2. GBM Parameters
A. min_samples_split
Defines the minimum number of samples (or observations) which are required in a
node to be considered for splitting. Too high values can lead to under-fitting hence, it
should be tuned using CV.
B. min_samples_leaf:
Defines the minimum samples (or observations) required in a terminal node or leaf.
Generally lower values should be chosen for imbalanced class problems because
the regions in which the minority class will be in majority will be very small.
C. min_weight_fraction_leaf:
E. max_leaf_nodes
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of
‘n’ would produce a maximum of 2^n leaves.
F. max_features
The number of features to consider while searching for a best split. These will be
randomly selected.
As a thumb-rule, square root of the total number of features works great but we
should check upto 30-40% of the total number of features.
Higher values can lead to over-fitting but depends on case to case.
2. Boosting parameters:
A. Learning_rate: This determines the impact of each tree on the final outcome (step
2.4). GBM works by starting with an initial estimate which is updated using the output
of each tree. The learning parameter controls the magnitude of this change in the
estimates.
Lower values are generally preferred as they make the model robust to the specific
characteristics of tree and thus allowing it to generalize well.
Lower values would require higher number of trees to model all the relations and will
be computationally expensive.
B. n_estimators: The number of sequential trees to be modeled. Should be tuned
using CV for a particular learning rate.
C. Subsample
The fraction of observations to be selected for each tree. Selection is done by
random sampling.
Values slightly less than 1 make the model robust by reducing the variance.
Typical values ~0.8 generally work fine but can be fine-tuned further.
3. Miscellaneous parameters:
loss
● The random number seed so that same random numbers are generated every time.
● This is important for parameter tuning. If we don’t fix the random number, then we’ll
have different outcomes for subsequent runs on the same parameters and it
becomes difficult to compare models.
● It can potentially result in overfitting to a particular random sample selected. We can
try running models for different random samples, which is computationally expensive
and generally not used.
verbose
● The type of output to be printed when the model fits. The different values can be:
○ 0: no output generated (default)
○ 1: output generated for trees in certain intervals
○ >1: output generated for all trees
warm_start
● This parameter has an interesting application and can help a lot if used judicially.
● Using this, we can fit additional trees on previous fits of a model. It can save a lot of
time and you should explore this option for advanced applications
presort
#Import libraries:
import pandas as pd
import numpy as np
%matplotlib inline
train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'
alg.fit(dtrain[predictors], dtrain['Disbursed'])
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
#Perform cross-validation:
if performCV:
if performCV:
print "CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" %
(np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score))
if printFeatureImportance:
1. Choose a relatively high learning rate. Generally the default value of 0.1 works but
somewhere between 0.05 to 0.2 should work for different problems
2. Determine the optimum number of trees for this learning rate. This should range
around 40-70. Remember to choose a value on which your system can work fairly
fast. This is because it will be used for testing various scenarios and determining the
tree parameters.
3. Tune tree-specific parameters for decided learning rate and number of trees. Note
that we can choose different parameters to define a tree and I’ll take up an example
here.
4. Lower the learning rate and increase the estimators proportionally to get more
robust models.
Fix learning rate and number of estimators for tuning tree-based
parameters
Booster Parameters
Though there are 2 types of boosters, I’ll consider only a tree booster here because it
always outperforms the linear booster and thus the latter is rarely used.
1. eta [default=0.3]
○ Analogous to learning rate in GBM
○ Makes the model more robust by shrinking the weights on each step
○ Typical final values to be used: 0.01-0.2
2. min_child_weight [default=1]
○ Here, min_child_weight means something like "stop trying to split once you
reach a certain degree of purity in a node and your model can fit it".
○ Defines the minimum sum of weights of all observations required in a child.
○ This is similar to min_child_leaf in GBM but not exactly. This refers to min
“sum of weights” of observations while GBM has min “number of
observations”.
○ Used to control over-fitting. Higher values prevent a model from learning
relations which might be highly specific to the particular sample selected for a
tree.
○ Too high values can lead to under-fitting hence, it should be tuned using CV.
3. max_depth [default=6]
○ The maximum depth of a tree, same as GBM.
○ Used to control over-fitting as higher depth will allow model to learn relations
very specific to a particular sample.
○ Should be tuned using CV.
○ Typical values: 3-10
4. max_leaf_nodes
○ The maximum number of terminal nodes or leaves in a tree.
○ Can be defined in place of max_depth. Since binary trees are created, a
depth of ‘n’ would produce a maximum of 2^n leaves.
○ If this is defined, GBM will ignore max_depth.
5. gamma [default=0]
○ A node is split only when the resulting split gives a positive reduction in the
loss function. Gamma specifies the minimum loss reduction required to make
a split.
○ Makes the algorithm conservative. The values can vary depending on the
loss function and should be tuned.
6. max_delta_step [default=0]
○ In maximum delta step we allow each tree’s weight estimation to be. If the
value is set to 0, it means there is no constraint. If it is set to a positive value,
it can help making the update step more conservative.
○ Usually this parameter is not needed, but it might help in logistic regression
when class is extremely imbalanced.
○ This is generally not used but you can explore further if you wish.
7. subsample [default=1]
○ Same as the subsample of GBM. Denotes the fraction of observations to be
randomly samples for each tree.
○ Lower values make the algorithm more conservative and prevents overfitting
but too small values might lead to under-fitting.
○ Typical values: 0.5-1
8. colsample_bytree [default=1]
○ Similar to max_features in GBM. Denotes the fraction of columns to be
randomly samples for each tree.
○ Typical values: 0.5-1
9. colsample_bylevel [default=1]
○ Denotes the subsample ratio of columns for each split, in each level.
○ I don’t use this often because subsample and colsample_bytree will do the
job for you. but you can explore further if you feel so.
10. lambda [default=1]
○ L2 regularization term on weights (analogous to Ridge regression)
○ This used to handle the regularization part of XGBoost. Though many data
scientists don’t use it often, it should be explored to reduce overfitting.
11. alpha [default=0]
○ L1 regularization term on weight (analogous to Lasso regression)
○ Can be used in case of very high dimensionality so that the algorithm runs
faster when implemented
12. scale_pos_weight [default=1]
○ A value greater than 0 should be used in case of high class imbalance as it
helps in faster convergence.
These parameters are used to define the optimization objective the metric to be calculated
at each step.
1. objective [default=reg:linear]
○ This defines the loss function to be minimized. Mostly used values are:
■ binary:logistic –logistic regression for binary classification, returns
predicted probability (not class)
■ multi:softmax –multiclass classification using the softmax objective,
returns predicted class (not probabilities)
■ you also need to set an additional num_class (number of
classes) parameter defining the number of unique classes
■ multi:softprob –same as softmax, but returns predicted probability of
each data point belonging to each class.
eval_metric [ default according to objective ]
You must be wondering that we have defined everything except something similar to the
“n_estimators” parameter in GBM. Well this exists as a parameter in XGBClassifier.
However, it has to be passed as “num_boosting_rounds” while calling the fit function in
the standard xgboost implementation.
1. Bootstrap sample: The idea behind bagging is combining the results of multiple models
(for instance, all decision trees) to get a generalized result. Here’s a question: If you
create all the models on the same set of data and combine it, will it be useful? There is a
high chance that these models will give the same result since they are getting the same
input. So how can we solve this problem? One of the techniques is bootstrapping.
2. Bootstrapping is a sampling technique in which we create subsets of observations from
the original dataset, with replacement. The size of the subsets is the same as the size of
the original set.
Light GBM beats all the other algorithms when the dataset is extremely large.
Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset.
LightGBM is a gradient boosting framework that uses tree-based algorithms and follows
leaf-wise approach while other algorithms work in a level-wise approach pattern. The
images below will help you understand the difference in a better way.
CatBoost: Handling categorical variables is a tedious process, especially when you have a
large number of such variables. When your categorical variables have too many labels (i.e.
they are highly cardinal), performing one-hot-encoding on them exponentially increases the
dimensionality and it becomes really difficult to work with the dataset.
CatBoost can automatically deal with categorical variables and does not require
extensive data preprocessing like other machine learning algorithms
AdaBoost
One way for a new predictor to correct its predecessor is to pay a bit more attention to the
training instances that the predecessor underfitted. This results in new predictors
focusing more and more on the hard cases. This is the technique used by AdaBoost. For
example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is
trained and used to make predictions on the training set. The relative weight of misclassified
training instances is then increased. A second classifier is trained using the updated
weights and again it makes predictions on the training set, weights are updated, and so on
Another very popular Boosting algorithm is Gradient Boosting.17 Just like AdaBoost,
Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor.
Stacking
The last Ensemble method we will discuss in this chapter is called stacking (short for stacked
generalization).18 It is based on a simple idea: instead of using trivial functions (such as hard
voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model
to perform this aggregation? Figure 7-12 shows such an ensemble performing a regression task
on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and
2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as
inputs and makes the final prediction (3.0).
LIGHTGBM
Boosting: Sequential ensemble - Weak learners are sequentially produced during the
training. The performance of the model is improved by assigning higher weightage to
incorrectly classified samples. Feed entire data-> makes prediction-> misclassified some
example-> more weightage to these misclassified examples.
Exercises
1. If you have trained five different models on the exact same training data, and they all
achieve 95% precision, is there any chance that you can combine these models to get
better results? If so, how? If not, why?
2. What is the difference between hard and soft voting classifiers?
3. Is it possible to speed up training of a bagging ensemble by distributing it across
multiple servers? What about pasting ensembles, boosting ensembles, random forests,
or stacking ensembles?
4. What is the benefit of out-of-bag evaluation?
5. What makes Extra-Trees more random than regular Random Forests? How can this
extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
6. If your AdaBoost ensemble underfits the training data, what hyperparameters should
you tweak and how?
7. If your Gradient Boosting ensemble overfits the training set, should you increase or
decrease the learning rate?
1. If you have trained five different models and they all achieve 95% precision, you can
try combining them into a voting ensemble, which will often give you even better results.
It works better if the models are very different (e.g., an SVM classi‐ fier, a Decision Tree
classifier, a Logistic Regression classifier, and so on). It is even better if they are trained
on different training instances (that’s the whole point of bagging and pasting
ensembles), but if not it will still work as long as the models are very different.
2. A hard voting classifier just counts the votes of each classifier in the ensemble and
picks the class that gets the most votes. A soft voting classifier computes the average
estimated class probability for each class and picks the class with the highest
probability. This gives high-confidence votes more weight and often performs better, but
it works only if every classifier is able to estimate class probabilities (e.g., for the SVM
classifiers in Scikit-Learn you must set probability=True).
3. It is quite possible to speed up training of a bagging ensemble by distributing it across
multiple servers, since each predictor in the ensemble is independent of the others. The
same goes for pasting ensembles and Random Forests, for the same reason. However,
each predictor in a boosting ensemble is built based on the previous predictor, so
training is necessarily sequential, and you will not gain anything by distributing training
across multiple servers. Regarding stacking ensembles, all the predictors in a given
layer are independent of each other, so they can be trained in parallel on multiple
servers. However, the predictors in one layer can only be trained after the predictors in
the previous layer have all been trained.
4. With out-of-bag evaluation, each predictor in a bagging ensemble is evaluated using
instances that it was not trained on (they were held out). This makes it pos‐ sible to
have a fairly unbiased evaluation of the ensemble without the need for an additional
validation set. Thus, you have more instances available for training, and your ensemble
can perform slightly better.
5. When you are growing a tree in a Random Forest, only a random subset of the
features is considered for splitting at each node. This is true as well for ExtraTrees, but
they go one step further: rather than searching for the best possible thresholds, like
regular Decision Trees do, they use random thresholds for each feature. This extra
randomness acts like a form of regularization: if a Random Forest overfits the training
data, Extra-Trees might perform better. Moreover, since Extra-Trees don’t search for the
best possible thresholds, they are much faster to train than Random Forests. However,
they are neither faster nor slower than Random Forests when making predictions.
6. If your AdaBoost ensemble underfits the training data, you can try increasing the
number of estimators or reducing the regularization hyperparameters of the base
estimator. You may also try slightly increasing the learning rate.
7. If your Gradient Boosting ensemble overfits the training set, you should try
decreasing the learning rate. You could also use early stopping to find the right number
of predictors (you probably have too many).
Q.What is pruning and what are the different types of pruning ?
Q.For a specific dataset, it is observed that a linear regression model outperforms all
tree based models. What conclusions can be drawn about the dataset from this ?
Q.Why are decision trees prone to overfitting ? How can this issue be addressed?