Introduction To Data Mining Using Orange
Introduction To Data Mining Using Orange
These notes include Orange Welcome to the course on Introduction to Data Mining! You will
workflows and visualizations we see how common data mining tasks can be accomplished without
will construct during the course. programming. We will use Orange to construct visual data mining
The working notes were workflows. Many similar data mining environments exist, but the
prepared by Blaž Zupan and authors of these notes prefer Orange for one simple reason—they
Janez Demšar with help from the are its authors. For the courses offered by Orange developers,
members of the Bioinformatics please visit https://orange.biolab.si/training.
Lab in Ljubljana that develop
and maintain Orange.
If you haven’t already installed Orange, please download the
installation package from http://orange.biolab.si.
Attribution-NonCommercial-NoDerivs
CC BY-NC-ND
University of Ljubljana 1
Zupan, Demsar: Introduction to Data Mining May 2018
A screenshot above shows a We construct workflows by dragging widgets onto the canvas and
simple workflow with two connecting them by drawing a line from the transmitting widget to
connected widgets and one the receiving widget. The widget’s outputs are on the right and the
widget without connections. The inputs on the left. In the workflow above, the File widget sends
outputs of a widget appear on data to the Data Table widget.
the right, while the inputs appear
on the left.
University of Ljubljana 2
Zupan, Demsar: Introduction to Data Mining May 2018
The File widget reads data from your local disk. Open the File
Widget by double clicking its icon. Orange comes with several
preloaded data sets. From these (“Browse documentation data
sets…”), choose brown-selected.tab, a yeast gene expression data
set.
After you load the data, open the other widgets. In the Scatter Plot
widget, select a few data points and watch as they appear in widget
Data Table (1). Use a combination of two Scatter Plot widgets,
where the second scatter plot shows a detail from a smaller region
selected in the first scatterplot.
University of Ljubljana 3
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 4
Zupan, Demsar: Introduction to Data Mining May 2018
We can connect the output of the Data Table widget to the Scatter
Plot widget to highlight the chosen data instances (rows) in the
scatter plot.
How does Orange distinguish between the primary data source and
the data selection? It uses the first connected signal as the entire
data set and the second one as its subset. To make changes or to
check what is happening under the hood, double click on the line
connecting the two widgets.
Orange comes with a basic set of The rows in the data set we are exploring in this lesson are gene
widgets for data input, profiles. We can use the Gene Info widget from the Bioinformatics
preprocessing, visualization and add-on to get more information on the genes we selected in any of
modeling. For other tasks, like the Orange widgets.
text mining, network analysis,
and bioinformatics, there are
add-ons. Check them out by
selecting “Add-ons…” from the
options menu.
University of Ljubljana 5
Zupan, Demsar: Introduction to Data Mining May 2018
Now again, add the File widget and open another documentation
data set: heart_disease. How does the data look?
University of Ljubljana 6
Zupan, Demsar: Introduction to Data Mining May 2018
In the Select Rows widget, we choose the female patients. You can
also add other conditions. Selection of data instances works well
with visualization of data distribution. Try having at least two
widgets open at the same time and explore the data.
University of Ljubljana 7
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 8
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 9
Zupan, Demsar: Introduction to Data Mining May 2018
You can save it (File→Save) and share it with your colleagues. Just
don't forget to put the data files in the same directory as the file
with the workflow.
University of Ljubljana 10
Zupan, Demsar: Introduction to Data Mining May 2018
One more trick: Pressing Ctrl-C Widgets also have a Report button, which you can use to keep a
(or ⌘-C, on Mac) copies a log of your analysis. When you find something interesting, like an
visualization to the clipboard, so unexpected Sieve Diagram, just click Report to add the graph to
you can paste it to another
your log. You can also add reports from the widgets on the path to
application.
this one, to make sure you don't forget anything relevant.
You can save the report as HTML or PDF, or to a file that includes
all workflows that are related to the report items and which you
can later open in Orange. In this way, you and your colleagues can
reproduce your analysis results.
University of Ljubljana 11
Zupan, Demsar: Introduction to Data Mining May 2018
Looks ok. Orange has correctly guessed that student names are
character strings and that this column in the data set is special,
meant to provide additional information and not to be used for
modeling (more about this in the coming lectures). All other
columns are numeric features.
University of Ljubljana 12
Zupan, Demsar: Introduction to Data Mining May 2018
and double click on the Data Table to see the data in the
spreadsheet format.
University of Ljubljana 13
Zupan, Demsar: Introduction to Data Mining May 2018
Lesson 5: Classification
In one of the previous lessons, we explored the heart disease data.
We wanted to predict which persons have clogged arteries — but
we did not make any predictions. Let's try it now.
This won't do: the widget Predictions shows the data, but no
makes no predictions. It can't. For this, it needs a model. Like this.
The data is fed into the Tree widget, which uses it to infer a
predictive model. The Predictions widget now gets the data from
the File widget and also a predictive model from the Tree widget.
This is something new: in our past workflows, widgets passed only
data to each other, but here we have a channel that carries a model.
University of Ljubljana 14
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 15
Zupan, Demsar: Introduction to Data Mining May 2018
Classification trees were hugely Trees place the most useful feature at the root. What would be the
most useful feature? It is the feature that splits the data into two
The Rank widget could be used
purest possible subsets. These are then split further, again by the
on its own. Say, to figure out
most informative features. This process of breaking up the data
which genes are best predictors
subsets to smaller ones repeats until we reach subsets where all
of the phenotype in some gene
expression data set. Or what data belongs to the same class. These subsets are represented by
experimental conditions to leaf nodes in strong blue or red. The process of data splitting can
consider to profile the genes and also terminate when it runs out of data instances or out of useful
assign their function. Oh, but we features (the two leaf nodes in white).
have already worked with a data
set of this kind. What does Rank
We still have not been very explicit about what we mean by “the
tell us about it? most useful” feature. There are many ways to measure this. We can
compute some such scores in Orange using the Rank widget,
which estimates the quality of data features and ranks them
according to how much information they carry. We can compute
the scores from the whole data set or from data corresponding to
some node of the classification tree in the Tree Viewer.
University of Ljubljana 16
Zupan, Demsar: Introduction to Data Mining May 2018
Just for fun, we have included a few other widgets in this workflow.
In a way, the Tree Viewer widget behaves like the Select Rows
widget, except that the rules used to filter the data are inferred
from the data itself and optimized to obtain purer data subsets.
University of Ljubljana 17
Zupan, Demsar: Introduction to Data Mining May 2018
Let us try this schema with the brown-selected data set. The
Predictions widget outputs a data table augmented with a column
that includes predictions. In the Data Table widget, we can sort
the data by any of these two columns, and manually select data
instances where the values of these two features are different (this
would not work on big data). Roughly, visually estimating the
accuracy of predictions is straightforward in the Distribution
widget, if we set the features in view appropriately.
University of Ljubljana 18
Zupan, Demsar: Introduction to Data Mining May 2018
Fine. There can be no classifier that can model this mess, right?
Let us test this. We will build classification tree and check its
performance on the messed-up data set.
University of Ljubljana 19
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 20
Zupan, Demsar: Introduction to Data Mining May 2018
Let’s check how the Distributions widget looks after testing the
classifier on the test data.
Turns out that for every class
value the majority of data
instances has been predicted to
the ribosomal class (green).
Why? Green again (like green
from the Scatter Plot of the
messed-up data)? Here is a hint:
use the Box Plot widget to
answer this question.
The first two classes are a complete fail. The predictions for
ribosomal genes are a bit better, but still with lots of mistakes. On
the class-randomized training data, our classifier fails miserably.
Finally, this is just as we would expect.
University of Ljubljana 21
Zupan, Demsar: Introduction to Data Mining May 2018
Note that in each iteration, Test & Score will pick part of the data
for training, learn the predictive model on this data using some
machine learning method, and then test the accuracy of the
resulting model on the remaining, test data set. For this, the
widget will need on its input a data set from which it will sample
data for training and testing, and a learning method which it will
use on the training data set to construct a predictive model. In
Orange, the learning method is simply called a learner. Hence, Test
& Score needs a learner on its input. A typical workflow with this
widget is as follows.
For geeks: a learner is an object
This is another way to use the Tree widget. In the workflows from
that, given the data, outputs a
the previous lessons we have used another of its outputs, called
classifier. Just what Test & Score
Model: its construction required the data. This time, no data is
needs.
needed for Tree, because all that we need from it a learner.
Cross validation splits the data Here we show Test & Score widget looks like. CA stands for
sets into, say, 10 different non- classification accuracy, and this is what we really care for for now.
overlapping subsets we call We will talk about other measures, like AUC, later.
folds. In each iteration, one fold
will be used for testing, while the
data from all other folds will be
used for training. In this way,
each data instance will be used
for testing exactly once.
University of Ljubljana 22
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 23
Zupan, Demsar: Introduction to Data Mining May 2018
Let us see if this is really so. We give two learners to the Test
Learners widget and check if cross-validated classification accuracy
is indeed higher for random forest. Choose different classification
data sets for this comparison, starting with those we already know
(hearth disease, iris, brown selected).
University of Ljubljana 24
Zupan, Demsar: Introduction to Data Mining May 2018
There are other classifiers we can try. We will briefly mention a few
more, but instead of diving into what they do (we could spend a
semester on this!), we’ll pass on to other important topics in data
mining. At this point, just add them to the workflow above and see
how they perform.
University of Ljubljana 25
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 26
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 28
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 29
Zupan, Demsar: Introduction to Data Mining May 2018
The Preprocess widget does Alternatively, we can include a preprocessor in a learning method.
not necessary require a data The preprocessor is now called on the training data set just before
set on its input. An alternative this learner performs inference of the predictive model.
use of this widget is to output
a method for data
preprocessing, which we can
then pass to either a learning
method or to a widget for
cross validation.
University of Ljubljana 30
Zupan, Demsar: Introduction to Data Mining May 2018
Not necessarily. His specialty, in fact, are rare diseases (2 out of 100
of his patients have it) and, being lazy, he always dismisses
everybody as healthy. His predictions are worthless — although
extremely accurate. Classification accuracy is not an absolute
measure, which can be judged out of context. At the very least, it
has to be compared with the frequency of the majority class, which
is, in case of rare diseases, quite … major.
For instance, on GEO data set GDS 4182, the classification tree
achieves 78% accuracy on cross validation, which may be
reasonably good. Let us compare this with the Constant model,
which implements Dr. Smith’s strategy by always predicting the
majority. It gets 83%. Classification trees are not so good after all,
are they?
Classes versus probabilities Maybe not, again. Say you fall down the stairs and your leg
estimated by logistic regression. hurts. You open Orange, enter some data into your favorite model
Can you replicate this image?
University of Ljubljana 31
Zupan, Demsar: Introduction to Data Mining May 2018
and compute a 20% of having your leg broken. So you assume your
leg is not broken and you take an aspirin. Or perhaps not?
What if the chance of a broken leg was just 10%? 5%? 0.1%?
Say we decide that any leg with a 1% chance of being broken will
be classified as broken. What will this do to our classification
threshold? It is going to decrease badly — but we apparently do
not care. What do we do care about then? What kind of “accuracy”
is important?
If you are interested in a complete So, if you’re classified as OK, you have a 90% chance of actually
list, see the Wikipedia page on being OK? No, it’s the other way around: 90% is the chance of
Receiver operating characteristic,
being classified as OK, if you are OK. (Think about it, it’s not as
https://en.wikipedia.org/wiki/ complicated as it sounds). If you’re interested in your chance of
Receiver_operating_characteristic
being OK if the classifier tells you so, you look for the negative
predictive value. Then there’s also precision, the probability of being
positive if you’re classified as such. And the fall-out and negative
likelihood ratio and … a whole list of other indistinguishable fancy
names, each useful for some purpose.
University of Ljubljana 33
Zupan, Demsar: Introduction to Data Mining May 2018
Here are the curves for logistic regression, SVM with linear
kernels and naive Bayesian classifier on the same ROC plot.
The curves show how the sensitivity (y-axis) and specificity (x-axis,
but from right to left) change with different thresholds.
University of Ljubljana 34
Zupan, Demsar: Introduction to Data Mining May 2018
There is a popular score derived from the ROC curve, called Area
under curve, AUC. It measures, well, the area under the curve.
This curve. If the curve goes straight up and then right, the area is
1; such an optimal AUC is not reached in practice. If the classifier
guesses at random, the curve follows the diagonal and AUC is 0.5.
Anything below that is equivalent to guessing + bad luck.
ROC curves and AUC are AUC also has a nice probabilistic interpretation. Say that we are
fascinating tools. To learn more, given two data instances and we are told that one is positive and
read T. Fawcett: ROC Graphs: the other is negative. We use the classifier to estimate the
Notes and Practical
probabilities of being positive for each instance, and decide that
Considerations for Researchers
the one with the highest probability is positive. It turns out that
the probability that such a decision is correct equals the AUC of
this classifier. Hence, AUC measures how well the classifier
discriminates between the positive and negative instances.
University of Ljubljana 35
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 36
Zupan, Demsar: Introduction to Data Mining May 2018
The question above requires us to define what a good fit is. Say,
this could be the error the fitted model (the line) makes when it
predicts the value of y for a given data point (value of x). The
prediction is h(x), so the error is h(x) - y. We should treat the
negative and positive errors equally, plus, let us agree, we would
prefer punishing larger errors more severely than smaller ones.
Therefore, it is perfectly ok if we square the errors for each data
point and then sum them up. We got our objective function! Turns
Do not worry about the strange out that there is only one line that minimizes this function. The
name of the widget Polynomial procedure that finds it is called linear regression. For cases where
Regression, we will get there in a we have only one input feature, Orange has a special widget in the
moment. educational add-on called Polynomial Regression.
Looks ok. Except that these data points do not appear exactly on
the line. We could say that the linear model is perhaps too simple
for our data sets. Here is a trick: besides column x, the widget
Univariate Regression can add columns x2, x3… xn to our data set.
The number n is a degree of polynomial expansion the widget
performs. Try setting this number to higher values, say to 2, and
then 3, and then, say, to 9. With the degree of 3, we are then fitting
the data to a linear function h(x) = 𝜃0 + 𝜃1x + 𝜃1x2 + 𝜃1x3.
University of Ljubljana 37
Zupan, Demsar: Introduction to Data Mining May 2018
The trick we have just performed (adding the higher order features
to the data table and then performing linear regression) is called
Polynomial Regression. Hence the name of the widget. We get
something reasonable with polynomials of degree 2 or 3, but then
the results get really wild. With higher degree polynomials, we
totally overfit our data.
University of Ljubljana 38
Zupan, Demsar: Introduction to Data Mining May 2018
More complex models can fit the training data better. The fitted
curve can wiggle sharply. The derivatives of such functions are
high, and so need to be the coefficients 𝜃. If only we could force
the linear regression to infer models with a small value of
Which inference of linear model coefficients. Oh, but we can. Remember, we have started with the
would overfit more, the one with optimization function the linear regression minimizes, the sum of
high λ or the one with low λ? squared errors. We could simply add to this a sum of all 𝜃 squared.
What should the value of λ be to
And ask the linear regression to minimize both terms. Perhaps we
cancel regularization? What if
the value of λ is really high, say should weigh the part with 𝜃 squared, say, we some coefficient λ,
1000? just to control the level of regularization.
University of Ljubljana 39
Zupan, Demsar: Introduction to Data Mining May 2018
Internally, if no learner is present Here we go: we just reinvented regularization, a procedure that
on its input, the Polynomial helps machine learning models not to overfit the training data. To
Regression widget would use observe the effects of the regularization, we can give Polynomial
just its ordinary, non-regularized
Regression our own learner, which supports these kind of settings.
linear regression.
Now for the test. Increase the degree of polynomial to the max.
Use Ridge Regression. Does the inferred model overfit the data?
How does degree of overfitting depend on regularization strength?
University of Ljubljana 40
Zupan, Demsar: Introduction to Data Mining May 2018
The core of this lesson is to compare the error on the training and
test set while varying the level of regularization. Remember,
regularization controls overfitting - the more we regularize, the less
tightly we fit the model to the training data. So for the training set,
we expect the error to drop with less regularization and more
overfitting, and to increase with more regularization and less
fitting. No surprises expected there. But how does this play out on
the test set? Which sides minimizes the test-set error? Or is the
optimal level of regularization somewhere in between? How do we
estimate this level of regularization from the training data alone?
University of Ljubljana 41
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 42
Zupan, Demsar: Introduction to Data Mining May 2018
We can play around with this workflow by painting the data such
that the regression would perform well on blue data point and fail
on the red outliers. In the scatter plot we can check if the
difference between the predicted and true class was indeed what
we have expected.
University of Ljubljana 43
Zupan, Demsar: Introduction to Data Mining May 2018
A similar workflow would work for any data set. Take, for instance,
the housing data set (from Orange distribution). Say, just like
above, we would like to plot the relation between true and
predicted continuous class, but would like to add information on
the absolute error the predictor makes. Where is the error coming
from? We need a new column. The Feature Constructor widget
(albeit being a bit geekish) comes to the rescue.
University of Ljubljana 44
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 45
Zupan, Demsar: Introduction to Data Mining May 2018
Now to the core of this lesson. Our workflow reads the data,
coninuizes it such that we also normalize all the features to bring
them the to equal scale, then we load the data into Linear
Regression widget and check out the feature coefficients in the
Data Table.
University of Ljubljana 46
Zupan, Demsar: Introduction to Data Mining May 2018
How do we measure the We need to start with a definition of “similar”. One simple
similarity between clusters if we measure of similarity for such data is the Euclidean distance:
only know the similarities square the differences across every dimension, some them and take
between points? By default, the square root, just like in Pythagorean theorem. So, we would
Orange computes the average like to group data instances with small Euclidean distances.
distance between all their pairs
of data points; this is called Now we need to define a clustering algorithm. We will start with
average linkage. We could each data instance being in its own cluster. Next, we merge the
instead take the distance clusters that are closest together - like the closest two points - into
between the two closest points one cluster. Repeat. And repeat. And repeat. And repeat until you
in each cluster (single linkage), end up with a single cluster containing all points.
or the two points that are
furthest away (complete This procedure constructs a hierarchy of clusters, which explains
linkage). why we call it hierarchical clustering. After it is done, we can
University of Ljubljana 47
Zupan, Demsar: Introduction to Data Mining May 2018
Let us see how this works. Load the data, compute the distances
and cluster the data. In the Hierarchical clustering widget, cut
hierarchy at a certain distance score and observe the
corresponding clusters in the Scatter plot.
University of Ljubljana 48
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 49
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 50
Zupan, Demsar: Introduction to Data Mining May 2018
For this data set, though, we can do something even better. The
data already contains some predefined groups. Let us check how
University of Ljubljana 51
Zupan, Demsar: Introduction to Data Mining May 2018
well the clusters match the classes - which we know, but clustering
did not.
University of Ljubljana 52
Zupan, Demsar: Introduction to Data Mining May 2018
Use the Paint widget to paint some data - maybe five groups of
points. Feed it to Interactive k-means and set the number of
centroids to 5. You may get something like this.
University of Ljubljana 53
Zupan, Demsar: Introduction to Data Mining May 2018
For this, we abandon our educational toy and connect Paint to the
widget k-Means. We tell it to find the optimal number of clusters
between 2 and 8, as scored by the Silhouette.
University of Ljubljana 54
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 55
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 56
Zupan, Demsar: Introduction to Data Mining May 2018
The data points in the green cluster are well separated from those
in the other two. Not so for the blue and red points, where several
points are on the border between the clusters. We would like to
quantify the degree of how well a data point belongs to the cluster
to which it is assigned.
For a given data point (say the blue point in the image on the left),
we can measure the distance to all the other points in its cluster
and compute the average. Let us denote this average distance with
A. The smaller A, the better.
University of Ljubljana 57
Zupan, Demsar: Introduction to Data Mining May 2018
On the other hand, we would like a data point to be far away from
the points in the closest neighboring cluster. The closest cluster to
our blue data point is the red cluster. We can measure the distances
between the blue data point and all the points in the red cluster,
and again compute the average. Let us denote this average distance
as B. The larger B, the better.
The point is well rooted within its own cluster if the distance to
the points from the neighboring cluster (B) is much larger than the
distance to the points from its own cluster (A), hence we compute
B-A. We normalize it by dividing it with the larger of these two
numbers, S = (B -A) / max{A, B}. Voilá, S is our silhouette score.
Orange has a Silhouette Plot widget that displays the values of the
silhouette score for each data instance. We can also choose a
C3 is the green cluster, and all its particular data instance in the silhouette plot and check out its
points have large silhouettes. position in the scatter plot.
Not so for the other two.
University of Ljubljana 58
Zupan, Demsar: Introduction to Data Mining May 2018
Ah, one more thing: Silhouette Plot can be used on any data, not
just on data sets that are the output of clustering. We could use it
with the iris data set and figure out which class is well separated
from the other two and, conversely, which data instances from one
class are similar to those from another.
We don't have to group the instances by the class. For instance, the
silhouette on the left would suggest that the patients from the
heart disease data with typical anginal pain are similar to each
other (with respect to the distance/similarity computed from all
features), while those with other types of pain, especially non-
anginal pain are not clustered together at all.
University of Ljubljana 59
Zupan, Demsar: Introduction to Data Mining May 2018
How much sense does it make? Austin and San Antonio are closer
to each other than to Houston; the tree is then joined by Dallas.
On the other hand, New Orleans is much closer to Houston than
to Miami. And, well, good luck hitchhiking from Anchorage to
Honolulu.
University of Ljubljana 60
Zupan, Demsar: Introduction to Data Mining May 2018
The real problem is New Orleans and San Antonio: New Orleans is
close to Atlanta and Memphis, Miami is close to Jacksonville and
Tampa. And these two clusters are suddenly more similar to each
other than to some distant cities in Texas.
We can’t run k-means clustering In general, two points from different clusters may be more similar
on this data, since we only have to each other than to some points from their corresponding
distances, and k-means runs on clusters.
real (tabular) data. Yet, k-means
would have the same problem as To get a better impression about the physical layout of cities,
hierarchical clustering. people have invented a better tool: a map! Can we reconstruct a
map from a matrix of distances? Sure. Take any pair of cities and
put them on paper with a distance corresponding to some scale.
Add the third city and put it at the corresponding distance from
the two. Continue until done. Excluding, for the sake of scale,
Anchorage, we get the following map.
University of Ljubljana 61
Zupan, Demsar: Introduction to Data Mining May 2018
Does the map make any sense? Are similar animals together? Color
the points by the types of animals and you should see.
The map of the US was accurate: one can put the points in a plane
so that the distances correspond to actual distances between cities.
For most data, this is usually impossible. What we get is a
projection (a non-linear projection, if you care about mathematical
finesses) of the data. You lose something, but you get a picture.
The MDS algorithm does not always find the optimal map. You
may want to restart the MDS from random positions. Use the
slider “Show similar pairs” to see whether the points that are
placed together (or apart) actually belong together. In the above
case, the honeybee belongs closer to the wasp, but could not fly
there as in the process of optimization it bumped into the hostile
region of flamingos and swans.
University of Ljubljana 62
Zupan, Demsar: Introduction to Data Mining May 2018
Spoiler figures successfully Yes, the first scatter plot looks very useful: it tells us that x and y
removed :-). are highly correlated and that we have three clusters of somewhat
irregular shape. But remember: this data is three dimensional.
What is we saw it from another, perhaps better perspective?
(No spoilers here, but we'll add the figure after the lecture!)
Think about what we've done. What are the properties of the best
projection?
We again talk about two dimensional projection only for the sake
of illustration. Imagine that we have ten thousand dimensional
University of Ljubljana 63
Zupan, Demsar: Introduction to Data Mining May 2018
data and we would like, for some reason, keep just ten features.
Yes, we can rank the features and keep the most informative, but
what if these are correlated and tell us the same thing? Or what if
our data does not have any target variable: with what should the
"good features" be correlated? And what if the optimal projection
is not aligned with the axes at all, so "good" features are
combinations of the original ones?
Imagine you are observing a swarm of flies; your data are their
exact coordinates in the room, so the position of each fly is
described by three numbers. Then you discover that your flies
actually fly in a formation: they are (almost) on the same line. You
could then describe the position of each fly with a single number
that represents the fly's position along the line. Plus, you need to
know where in the space the line lies. We call this line the first
principal component. By using it, we reduce the three-dimensional
space into a single dimension.
After some careful observation, you notice the flies are a bit spread
in one other direction, so they do not fly along a line but along a
band. Therefore, we need two numbers, one along the first and one
along the — you guessed it — second principal component.
It turns out the flies are actually also spread in the third direction.
Thus you need three numbers after all.
Or do you? It all depends on how spread they are in the second and
in the third direction. If the spread along the second is relatively
small in comparison with the first, you are fine with a single
dimension. If not, you need two, but perhaps still not three.
Let's step back a bit: why would one who carefully measured
expressions of ten thousand genes want to throw most data away
and reduce it to a dozen dimensions? The data, in general, may not
and does not have as many dimensions as there are features. Say
you have an experiment in which you spill different amounts of
University of Ljubljana 64
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 65
Zupan, Demsar: Introduction to Data Mining May 2018
The data separated so well that these two dimensions alone may
suffice for building a good classifier. No, wait, it gets even better.
The data classes are separated well even along the first
component. So we should be able to build a classifier from a
single feature!
University of Ljubljana 66
Zupan, Demsar: Introduction to Data Mining May 2018
In the above schema we use the ordinary Test & Score widget, but
renamed it to “Test on original data” for better understanding of
the workflow.
PCA is thus useful for multiple purposes. It can simplify our data
by combining the existing features to a much smaller number of
features without losing much data. The directions of these features
may tell us something about the data. Finally, it can find us good
two-dimensional projections that we can observe in scatter plots.
University of Ljubljana 67
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 68
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 69
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 70
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 71
Zupan, Demsar: Introduction to Data Mining May 2018
University of Ljubljana 72