Machine Learning Project
Machine Learning Project
DESCRIPTION
Reduce the time a Mercedes-Benz spends on the test bench.
Problem
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has
stood for important automotive innovations. These include the passenger safety cell
with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz
applies for nearly 2000 patents per year, making the brand the European leader
among premium carmakers. Mercedes-Benz is the leader in the premium car industry.
With a huge selection of features and options, customers can choose the customized
Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit
the road, the company’s engineers have developed a robust testing system. As one
of the world’s biggest manufacturers of premium cars, safety and efficiency are
paramount on Mercedes-Benz’s production lines. However, optimizing the speed of
their testing system for many possible feature combinations is complex and time-
consuming without a powerful algorithmic approach.
You are required to reduce the time that cars spend on the test bench. Others will work
with a dataset representing different permutations of features in a Mercedes-Benz car
to predict the time it takes to pass testing. Optimal algorithms will contribute to faster
testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s
standards.
Following actions should be performed:
• If for any column(s), the variance is equal to zero, then you need to remove those
variable(s).
• Check for null and unique values for test and train sets.
• Apply label encoder.
• Perform dimensionality reduction.
• Predict your test_df values using XGBoost.
Objective:
This dataset contains an anonymized set of variables that describe different Mercedes
cars. The ground truth is labelled 'y' and represents the time (in seconds) that the car
took to pass testing.
Target Variable:
"y" variable has predicted and some analysis done on this variable.
Now, looking at the data type of all the variables present in the dataset.
Majority of the columns are integers with 8 categorical columns and 1 float column.
All the integer columns are binary with some columns have only one unique value 0.
Possibly exclude those columns in this modelling activity.
Now, exploring the categorical columns present in the dataset.
Binary Variables:
Now, looking into the binary variables. There are quite a few of them have seen before.
Then, started with getting the number of 0's and 1's in each of these variables.
Now, checking the mean y value in each of the binary variable.
Binary variables which shows a good colour difference in the above graphs between
0 and 1 are likely to be more predictive given the count distribution is also good
between both the classes. Then, dive more into the important variables in the later part
of the notebook.
ID variable:
One more important thing to look at it is ID variable. This will give an idea of how the
splits are done across train and test and also to help if ID has some potential prediction
capability.
There seems to be a slight decreasing trend with respect to ID variable, the IDs are
distributed across train and test.
It Seems like a random split of ID variable between train and test samples.
Important Variables: