Lecture 10 Clustering and Classification
Lecture 10 Clustering and Classification
Alexander Zhigalov / Dept. of CS, University of Helsinki and Dept. of NBE, Aalto University
Time series analysis in neuroscience 2
Outline / overview
Section 1. Machine learning
Section 2. Clustering
Section 3. Classification
Section 4. Regression
Time series analysis in neuroscience 3
Machine learning
Machine learning
Machine learning
Section 2. Clustering
Time series analysis in neuroscience 8
Clustering
Section 2 Clustering
Time series analysis in neuroscience 9
K-means clustering
How it works
Partitions data into k number of mutually exclusive clusters.
How well a point fits into a cluster is determined by the
distance from that point to the cluster’s center.
Result
Cluster centers
Section 2 Clustering
Time series analysis in neuroscience 10
# measurements
Y = X[np.random.permutation(M*R), :]
# clustering outcome
labels = model.labels_
Z = model.cluster_centers_
inertia = model.inertia_
print(inertia) # within-cluster sum-of-squares
inertia = 0.0
See, “L10_clustering_kmeans.py”
Section 2 Clustering
Time series analysis in neuroscience 11
# measurements
Y = X[np.random.permutation(M*R), :]
# clustering outcome
labels = model.labels_
Z = model.cluster_centers_
inertia = model.inertia_
print(inertia)
Section 2 Clustering
Time series analysis in neuroscience 12
# create copies
X[i] = np.tile(S[i, :], (R, 1)) +
np.random.randn(R, N) * 0.5
# measurements
Y = X[np.random.permutation(M*R), :]
inertia = 2550.0
See, “L10_clustering_kmeans.py”
Section 2 Clustering
Time series analysis in neuroscience 13
# measurements
Y = X[np.random.permutation(M*R), :]
Section 2 Clustering
Time series analysis in neuroscience 14
Hierarchical clustering
How it works
Produces nested sets of clusters by analyzing similarities
between pairs of points and grouping objects into a binary,
hierarchical tree.
Result
Dendrogram showing the hierarchical relationship
between clusters
Section 2 Clustering
Time series analysis in neuroscience 15
See, “L10_clustering_hierarchical.py”
Section 2 Clustering
Time series analysis in neuroscience 16
# clustering
model =
cluster.AgglomerativeClustering()
model.fit(Y)
labels = model.labels_
children = model.children_
See, “L10_clustering_hierarchical.py”
Section 2 Clustering
Time series analysis in neuroscience 17
Noisy measurements
# measurements
Y = X[np.random.permutation(M*R), :]
See, “L10_clustering_hierarchical.py”
Section 2 Clustering
Time series analysis in neuroscience 18
# clustering
model =
cluster.AgglomerativeClustering(n_clusters=K)
model.fit(Y)
labels = model.labels_
# sort matrix
indices = np.squeeze(np.argsort(labels))
CZ = CY
CZ = CZ[indices, :]
CZ = CZ[:, indices]
See, “L10_clustering_hierarchical_2D.py”
Section 2 Clustering
Time series analysis in neuroscience 19
# covariance
CX = np.cov(X)
CY = np.cov(Y)
# clustering
model =
cluster.AgglomerativeClustering(n_clusters=K)
model.fit(Y)
labels = model.labels_
# sort matrix
indices = np.squeeze(np.argsort(labels))
CZ = CY
CZ = CZ[indices, :]
CZ = CZ[:, indices]
See, “L10_clustering_hierarchical_2D.py”
Section 2 Clustering
Time series analysis in neuroscience 20
How it works
Partition-based clustering where data points come from
different multivariate normal distributions with certain
probabilities.
Result
A model of Gaussian distributions that give probabilities of
a point being in a cluster
Section 2 Clustering
Time series analysis in neuroscience 21
# generate data
Z = [np.random.randn(1, N) * 0.5 + 0.0,
np.random.randn(1, N) * 1.25 + 4.0]
# gaussian PDF
b = np.linspace(-3, 10, 1000)
p0 = norm.pdf(b, np.mean(Z[0]), np.std(Z[0]))
p1 = norm.pdf(b, np.mean(Z[1]), np.std(Z[1]))
See, “L10_clustering_gmm.py”
Section 2 Clustering
Time series analysis in neuroscience 22
# fit model
model = mixture.GaussianMixture(n_components=K)
model.fit(X)
# model properties
Y = model.predict(X)
model_mu = model.means_
model_cov = model.covariances_
See, “L10_clustering_gmm.py”
Section 2 Clustering
Time series analysis in neuroscience 23
Section 3. Classification
Time series analysis in neuroscience 24
Classification
Section 3 Classification
Time series analysis in neuroscience 25
How it works
Classifies data by finding the linear decision boundary
(hyperplane) that separates all data points of one class from
those of the other class. The best hyperplane for an SVM is
the one with the largest margin between the two classes.
Result
Training/fitting transforms data and labels to coefficients,
while testing/prediction transforms data and coefficients to
labels.
Section 3 Classification
Time series analysis in neuroscience 26
When it works
SVM works if data has exactly two classes.
Section 3 Classification
Time series analysis in neuroscience 27
# binary labels
y = get_sequence(5, 0.8, N)
# train classifier
model = SVC(kernel='linear')
model.fit(XY.T, Y)
See, “L10_classification_svm_2_signals.py”
Section 3 Classification
Time series analysis in neuroscience 28
# decision function
Z = np.zeros(N)
for i in range(0, N):
Z[i] = np.sum(X[:, i] * coef) + intercept
# testing
v = U
u = model.predict(XU.T)
u = u > 0.5
# accuracy
a = np.mean(v == u)
print('accuracy: %1.2f' % (a))
accuracy = 99%
See, “L10_classification_svm_2_signals.py”
Section 3 Classification
Time series analysis in neuroscience 29
# binary labels
y = get_sequence(5, 0.8, N)
# train classifier
model = SVC(kernel='linear')
model.fit(XY.T, Y)
See, “L10_classification_svm_2_signals.py”
Section 3 Classification
Time series analysis in neuroscience 30
# decision function
Z = np.zeros(N)
for i in range(0, N):
Z[i] = np.sum(X[:, i] * coef) + intercept
# testing
v = U
u = model.predict(XU.T)
u = u > 0.5
# accuracy
a = np.mean(v == u)
print('accuracy: %1.2f' % (a))
accuracy = 76%
See, “L10_classification_svm_2_signals.py”
Section 3 Classification
Time series analysis in neuroscience 31
# binary labels
y = get_sequence(5, 0.8, N)
accuracy = 100%
See, “L10_classification_svm.py”
Section 3 Classification
Time series analysis in neuroscience 32
# binary labels
y = get_sequence(5, 0.8, N)
accuracy = 67%
See, “L10_classification_svm.py”
Section 3 Classification
Time series analysis in neuroscience 33
Discriminant Analysis
How it works
Discriminant analysis classifies data by finding linear
combinations of features. Discriminant analysis assumes
that different classes generate data based on Gaussian
distributions.
Result
Coefficients
Section 3 Classification
Time series analysis in neuroscience 42
Logistic regression
How it works
Fits a model that can predict the probability of a binary
response belonging to one class or the other. Because of its
simplicity, logistic regression is commonly used as a starting
point for binary classification problems.
Result
Coefficients
Section X Classification
Time series analysis in neuroscience 34
Section 4. Regression
Time series analysis in neuroscience 35
Regression
Section 4 Regression
Time series analysis in neuroscience 36
Linear regression
How it works
Linear regression is a statistical modeling technique used to
describe a continuous response variable as a linear function
of one or more predictor variables.
Result
Coefficients
Section 4 Regression
Time series analysis in neuroscience 37
Nonlinear regression
How it works
Nonlinear regression is a statistical modeling technique that
helps describe nonlinear relationships in experimental data.
Result
Coefficients
Section 4 Regression
Time series analysis in neuroscience 38
SVM regression
How it works
SVM regression algorithms work like SVM classification
algorithms, but are modified to be able to predict a
continuous response.
Result
Coefficients
Section 4 Regression
Time series analysis in neuroscience 39
Literature
Time series analysis in neuroscience 40
• Data analysis
- Andreas Müller and Sarah Guido “Introduction to Machine Learning with Python: A Guide for Data Scientists”
Literature