RIP Correlation. Introducing The Predictive Power Score: Sign Up and Get An Extra One For Free
RIP Correlation. Introducing The Predictive Power Score: Sign Up and Get An Extra One For Free
RIP Correlation. Introducing The Predictive Power Score: Sign Up and Get An Extra One For Free
You have 1 free story left this month. Sign up and get an extra one for free.
Florian Wetschoreck
Apr 23 · 13 min read
. . .
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 1/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
Too many scenarios where the correlation is 0. This makes me wonder if I missed something… (Excerpt
from the image by Denis Boigelot)
If you are a little bit too well educated you know that the correlation matrix is
symmetric. So you basically can throw away one half of it. Great, we saved ourselves
some work there! Or did we? Symmetry means that the correlation is the same
whether you calculate the correlation of A and B or the correlation of B and A.
However, relationships in the real world are rarely symmetric. More often,
relationships are asymmetric. Here is an example: The last time I checked, my zip
code of 60327 tells strangers quite reliably that I am living in Frankfurt, Germany. But
when I only tell them my city, somehow they are never able to deduce the correct zip
code. Pff … amateurs. Another example is this: a column with 3 unique values will
never be able to perfectly predict another column with 100 unique values. But the
opposite might be true. Clearly, asymmetry is important because it is so common in the
real world.
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 2/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
. . .
Let’s say we have two columns and want to calculate the predictive power score of A
predicting B. In this case, we treat B as our target variable and A as our (only) feature.
We can now calculate a cross-validated Decision Tree and calculate a suitable
evaluation metric. When the target is numeric we can use a Decision Tree Regressor
and calculate the Mean Absolute Error (MAE). When the target is categoric, we can use
a Decision Tree Classifier and calculate the weighted F1. You might also use other
scores like the ROC etc but let’s put those doubts aside for a second because we have
another problem:
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 3/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
I guess you all know the situation: you tell your grandma that your new model has a F1
score of 0.9 and somehow she is not as excited as you are. In fact, this is very smart of
her because she does not know if anyone can score 0.9 or if you are the first human
being who ever scored higher than 0.5 after millions of awesome KAGGLErs tried. So,
we need to “normalize” our evaluation score. And how do you normalize a score? You
define a lower and an upper limit and put the score into perspective. So what should
the lower and upper limit be? Let’s start with the upper limit because this is usually
easier: a perfect F1 is 1. A perfect MAE is 0. Boom! Done. But what about the lower
limit? Actually, we cannot answer this in absolute terms.
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 4/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
Please note: the normalization formula for the MAE is different from the F1. For MAE
lower is better and the best value is 0.
. . .
Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging
from -2 to 2 and the target y is the square of x plus some error. In this case, x can
predict y very well because there is a clear non-linear, quadratic relationship — after
all that’s how we generated the data. However, this is not true in the other direction
from y to x. For example, if y is 4, it is impossible to predict whether x was roughly 2 or
-2. Thus, the predictive relationship is asymmetric and the scores should reflect this.
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 5/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
What are the values of the scores in this example? If you don’t already know what
you are looking for, the correlation will leave you hanging because the correlation is
0. Both from x to y and from y to x because the correlation is symmetric. However, the
PPS from x to y is 0.67, detecting the non-linear relationship and saving the day.
Nevertheless, the PPS is not 1 because there exists some error in the relationship. In
the other direction, the PPS from y to x is 0 because your prediction cannot be better
than the naive baseline and thus the score is 0.
Example 2: Comparing the Pearson correlation matrix (left) with the PPS matrix (right) for the Titanic dataset.
2. The correlation matrix shows a negative correlation between TicketPrice and Class
of medium strength (-0.55). We can double-check this relationship if we have a
look at the PPS. We will see that the TicketPrice is a strong predictor for the Class
(0.9 PPS) but not vice versa. The Class only predicts the TicketPrice with a PPS of
0.2. This makes sense because whether your ticket did cost 5.000$ or 10.000$ you
were most likely in the highest class. In contrast, if you know that someone was in
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 6/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
the highest class you cannot say whether they paid 5.000$ or 10.000$ for their
ticket. In this scenario, the asymmetry of the PPS shines again.
2. If you have a look at the column for TicketID, you can see that TicketID is a fairly
good predictor for a range of columns. If you further dig into this pattern, you will
find out that multiple persons had the same TicketID. Thus, the TicketID is
actually referencing a latent group of passengers who bought the ticket together,
for example the big Italian Rossi family that turns any evening into a spectacle.
Thus, the PPS helped me to detect a hidden pattern.
3. What’s even more surprising than the strong predictive power of TicketID is the
strong predictive power of TicketPrice across a wide range of columns. Especially,
the fact that the TicketPrice is fairly good at predicting the TicketID (0.67) and vice
versa (0.64). Upon further research you will find out that tickets often had unique
prices. For example, only the Italian Rossi family paid a price of 72,50$. This is a
critical insight! It means that the TicketPrice contains information about the
TicketID and thus about our Italian family. An information that you need to have
when considering potential information leakage.
4. Looking at the PPS matrix, we can see effects that might be explained by causal
chains. (Did he just say causal? — Of course, those causal hypotheses have to be
treated carefully but this is beyond the scope of this article.) For example, you
might be surprised why the TicketPrice has predictive power on the survival rate
(PPS 0.39). But if you know that the Class influences your survival rate (PPS 0.36)
and that the TicketPrice is a good predictor for your Class (PPS 0.9), then you
might have found an explanation.
. . .
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 7/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
After we learned about the advantages of the PPS, let’s see where we can use the PPS
in the real life.
Disclaimer: There are use cases for both the PPS and the correlation. The PPS clearly has
some advantages over correlation for finding predictive patterns in the data. However,
once the patterns are found, the correlation is still a great way of communicating found
linear relationships.
Find patterns in the data: The PPS finds every relationship that the correlation
finds — and more. Thus, you can use the PPS matrix as an alternative to the
correlation matrix to detect and understand linear or nonlinear patterns in your
data. This is possible across data types using a single score that always ranges from
0 to 1.
Feature selection: In addition to your usual feature selection mechanism, you can
use the predictive power score to find good predictors for your target column. Also,
you can eliminate features that just add random noise. Those features sometimes
still score high in feature importance metrics. In addition, you can eliminate
features that can be predicted by other features because they don’t add new
information. Besides, you can identify pairs of mutually predictive features in the
PPS matrix — this includes strongly correlated features but will also detect non-
linear relationships.
Detect information leakage: Use the PPS matrix to detect information leakage
between variables — even if the information leakage is mediated via other
variables.
Data Normalization: Find entity structures in the data via interpreting the PPS
matrix as a directed graph. This might be surprising when the data contains latent
structures that were previously unknown. For example: the TicketID in the Titanic
dataset is often an indicator for a family.
. . .
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 8/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
your own data, we have some good news for you: we open-sourced an implementation
of the PPS as a Python library named ppscore.
Before using the Python library, please take a moment to read through the calculation
details
pps.matrix(df)
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 9/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
. . .
Limitations
We made it — you are excited and want to show the PPS to your colleagues. However,
you know they are always so critical about new methods. That’s why you better be
prepared to know the limitations of the PPS:
2. The score cannot be interpreted as easily as the correlation because it does not tell
you anything about the type of relationship that was found. Thus, the PPS is
better for finding patterns but the correlation is better to communicate found
linear relationships.
3. You cannot compare the scores for different target variables in a strict
mathematical way because they are calculated using different evaluation metrics.
The scores are still valuable in the real world, but you need to keep this in mind.
4. There are limitations of the components used underneath the hood. Please
remember: you might exchange the components e.g. using a GLM instead of a
Decision Tree or using ROC instead of F1 for binary classifications.
5. If you use the PPS for feature selection you still want to perform forward and
backward selection in addition. Also, the PPS cannot detect interaction effects
between features towards your target.
. . .
Conclusion
After years of using the correlation we were so bold (or crazy?) to suggest an
alternative that can detect linear and non-linear relationships. The PPS can be applied
to numeric and categoric columns and it is asymmetric. We proposed an
implementation and open-sourced a Python package. In addition, we showed the
differences to the correlation on some examples and discussed some new insights that
we can derive from the PPS matrix.
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 10/11
6/3/2020 RIP correlation. Introducing the Predictive Power Score
Now it is up to you to decide what you think about the PPS and if you want to use it on
your own projects. We have been using the PPS for over a year as part of the library
bamboolib where the PPS is essential to add some advanced features and thus we
wanted to share the PPS with the broader community. Therefore, we hope to receive
your feedback about the concept and we would be thrilled if you try the PPS on your
own data. In case that there might be a positive reception, we are happy to hear about
your requests for adjustments or improvements to the implementation. As we
mentioned before, there are many ways on how to improve the speed and on how to
adjust the PPS for more specific use cases.
Github: https://github.com/8080labs/ppscore
Newsletter: if you want to hear more about the PPS and our other upcoming Data
Science projects and tools, you can subscribe here. We will not write about paid
products, you can unsubscribe anytime and — sad that we even have to mention this
— we will never give away your email.
. . .
https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 11/11