Dat Science: CLASS 11: Clustering and Dimensionality Reduction
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
CLASS 11:
CLUSTERING AND
DIMENSIONALITY
AGENDA 2
I. CLUSTERING
CLUSTERING 4
• Clustering is an unsupervised method where observations are grouped together due to
their feature similarity, but in a way not optimized to predict a certain class or feature.
• You can think of clustering as jut another form of dimensionality reduction – we are
reducing k features used to make the cluster to just 1 feature – the clusters themselves.
• K-means clustering creates clusters of data around centroids the ‘average’ points of all
the points in the data.
• Hierarchical clustering groups data together by absolute distance, and then further
groups up the hierarchy when distances cross a given threshold.
K-MEANS IMPLEMENTATION 6
• The most popular implementation of K-means (called Lloyd’s algorithm) uses
the following process to ‘lock in’ on the data’s proper centroids:
• Cons:
• Tendency to converge to local minima or dense regions of data (especially if you
pick your starting points at random)
• Produces nonsensical centroids if data is not closely and tightly dispersed
K-MEANS EXTENSIONS 9
• Due to the limitations of K-means, a number of related methods are more commonly
used to derive cluster meaning:
• One simple extension of K-means is to repeatedly run the algorithm with different
initialization sets, and average the results.
• K-medoids assigns centroids to actual observations in your dataset.
• Expectation-Maximization (EM) clustering derives clusters by calculating
confidence that the found centroids are the ‘true’ centroids for the dataset.
• Density Based Scanning (DBSCAN) looks at the median difference between
points in the cluster, so is robust to non-linear cluster combinations.
K-MEDOIDS 1
0
EM CLUSTERING 1
1
HIERARCHICAL CLUSTERING 1
2
• While k-means clustering tries to determine a given
set of discrete clusters, hierarchical clustering
attempts to determine the relationship between each
observation and cluster.
II. PRINCIPAL
COMPONENT
ANALYSIS
PRINCIPAL COMPONENT ANALYSIS 1
6
• Recall that in previous classes, you learned feature selection, i.e. a recursive process to
determine the bag of variables that allow your model to optimize predictive accuracy.
• However, the simpler methods you will learn today (such as PCA) will only
decompose your data properly if they have a linear relationship with one another.
• In addition, PCA is dependent on the initial scaling of the variables – so if one variable
is scaled to have a larger magnitude, it will dominate your decomposition.
PRINCIPAL COMPONENT ANALYSIS 1
9
• Principal Component Analysis (PCA) decomposes your (sometimes-correlated)
variables into a set of linearly uncorrelated variables called principal
components.
• Here’s how principal component analysis works:
1. Scale each feature around 0 by subtracting the mean from each observation.
2. Calculate a covariance matrix between each scaled variable in the data.
3. Calculate eigenvalues and eigenvectors of the matrix.
• Eigenvectors effectively work like OLS regressions, by best fitting the data. Each
subsequent eigenvector fits the residuals of the previous eigenvector.
4. Sort the eigenvectors by the size of their corresponding eigenvalues and and
determine a cutoff (typically around 0) below which you discard the eigenvector.
5. Fit each eigenvector to your original data. The first eigenvector you fit is called your
first principal component.
• See http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf for
more information.
WHAT’S AN EIGENVALUE AND AN 2
EIGENVECTOR? 0
• Eigenvectors are in essence a linear system that solve each row of a matrix to zero.
Eigenvalues are the scaler or ‘fit’ on the eigenvector.
• Use a scree plot to determine how many principal components you should keep.
III. SUPPORT
VECTOR MACHINES
SUPPORT VECTOR MACHINES 2
6
• Support Vector Machines (SVMs) are a set of classifiers that use similar techniques to
PCA and other matrix-based dimensionality reduction techniques.
• SVMs apply a function (called a kernel function) on the independent features and find
the best interaction of the function results that separate the classes (for classification)
or best follow the variance of the response feature (for regression).
• SVMs work best if you have multiple ‘weak inputs’ that have some sort of strong
underlying signal between them.
SUPPORT VECTOR MACHINES 2
7
• Pros: • Cons:
• Accuracy on par with RFs GBMs • Computationally intensive
• Very good for picture and text analysis • Hard to debug
• Works well with trending data (unlike RFs!) • Typically no intuition on which
• No ‘jagged edges’ in regression. kernel function works best
KERNEL FUNCTIONS 2
8
• SVMs include the option of many different kernel functions, including:
• Linear
• Polynomial
• RBF
• Sigmoid
• Any Python function you want!
• You’ll have to use grid search to determine which kernel is best, which can take a very
long time!
IN-CLASS EXERCISE: SVMS 2
9
• Run a SVM with the kernel as ‘polynomial’
• Perform grid search to find the optimal kernel for our use case.
• Compare the accuracy of the final model to the PCA result and the raw random forest.
‣CLUSTERING AND DIMENSIONALITY
REDUCTION
ADIOS, AMIGOS!