Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31
Clustering in Python
Vijay Kumar Dwivedi
Clustering: Concept • Given a set of records (instances, examples, objects, observations, …), organize them into clusters (groups, classes) • Clustering: the process of grouping physical or abstract objects into classes of similar objects What is a Cluster? • A cluster is a subset of objects which are “similar” . • A subset of objects such that the distance between any two objects in the cluster is less than the distance between any object in the cluster and any object not located inside it. • A connected region of a multidimensional space containing a relatively high density of objects What is Clustering? • Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub- classes, called clusters. • Help users understand the natural grouping or structure in a data set. • Clustering: unsupervised classification: no predefined classes. • Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms. What is Good Clustering? • A good clustering method will produce high quality clusters in which: – the intra-class (that is, intra-cluster) similarity is high. – the inter-class similarity is low. • The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. • However, objective evaluation is problematic: Clustering: Applications • Economic Science (especially market research). • WWW – Document classification – Cluster Weblog data to discover groups of similar access patterns • Pattern Recognition • Spatial Data Analysis • Image Processing Main Categories of Clustering • Methods Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion. – Global optimal: exhaustively enumerate all partitions. – Heuristic methods: k-means and k-medoids algorithms. – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster. Simple Clustering: K-means • Basic version works with numeric data only • Pick a number (K) of cluster centers - centroids (at random) • Assign every item to its nearest cluster center (e.g. using Euclidean distance) • Move each cluster center to the mean of its assigned items • Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold) Illustrating K-Means: Working KMeans: Numerical Example • Cluster the following eight points (with (x, y) representing locations) into three clusters: – A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) • Solution: – Initial cluster centers are: • A1(2, 10), A4(5, 8) and A7(1, 2). (Randomly chosen) • Select cluster centers in such a way that they are as farther as possible from each other. – Calculate the distance between each data point and each cluster center. – The distance may be calculated either by using given distance function or by using manhatan distance formula. • Calculating Distance Between A1(2, 10) and C1(2, 10)- – Ρ(A1, C1)== |x2 – x1| + |y2 – y1| – = |2 – 2| + |10 – 10| = 0 • Calculating Distance Between A1(2, 10) and C2(5, 8)- • Ρ(A1, C2)= |x2 – x1| + |y2 – y1| • = |5 – 2| + |8 – 10|= 3 + 2= 5 • Calculating Distance Between A1(2, 10) and C3(1, 2)- • Ρ(A1, C3)= |x2 – x1| + |y2 – y1| • = |1 – 2| + |2 – 10|= 1 + 8= 9 • Similarly compute distance for remaining Three Clusters are: C1:A1(2, 10) C2: A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A8(4, 9) C3: A2(2, 5), A7(1, 2) Recompute New Cluster Center The new cluster center is computed by taking mean of all the points contained in that cluster. We have only one point A1(2, 10) in Cluster-01. So, cluster center remains the same. For Cluster-02: Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)= (6, 6) For Cluster-03: Center of Cluster-03: = ((2 + 1)/2, (5 + 2)/2)= (1.5, 3.5) Iteration 2 Again compute distance of all points with newly computed cluster centers C1(2,10), C2(6,6), C3(1.5,3.5) After second Iteration, the cluster centers are: C1(3, 9.5) C2(6.5, 5.25) C3(1.5, 3.5) This process is continued until the assignment of any point remains changing to new clusters Kmeans: Advantages Kmeans: Disadvantages Choosing K manually Being dependent on initial values. Clustering data of varying sizes and density. Clustering outliers. Scaling with number of dimensions. K-means Clustering • Importing K-means – from sklearn.cluster import Kmeans • #Loading Data Set – df=pd.read_csv('iris.csv') • Extracting columns – X=df[['sepal_length','sepal_width']] – Y=df['species'] • Creating K-Means based Clustreing Model • kmeans = KMeans(n_clusters=3, random_state=0).fit(X) K-means Clustering Contd… • Printing Clustering Centroids – print(kmeans.cluster_centers_) • # Get the cluster labels – print(kmeans.labels_) • # Plotting the cluster centers and the data points on a 2D plane – plt.scatter(X['sepal_length'], X['sepal_width']) – plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x') – plt.title('Data points and cluster centroids') – plt.show() K-Means Clustering Contd… • Checking Accuracy using silhouette_score – from sklearn.metrics import silhouette_score – print(silhouette_score(X, kmeans.labels_)) Hierarchical Clustering: Agglomerative In this technique, initially each data point is considered as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed. Hierarchical Clustering: Agglomerative The basic algorithm of Agglomerative is as follows: Compute the proximity matrix Let each data point be a cluster Repeat: Merge the two closest clusters and update the proximity matrix Until only a single cluster remains Hierarchical Clustering: Agglomerative Example Hierarchical Clustering: Agglomerative Example Hierarchical Clustering: Visualization The Hierarchical clustering Technique can be visualized using a Dendrogram. A Dendrogram is a tree-like diagram that records the sequences of merges or splits. Agglomerative Clustering • Importing Packages – import scipy.cluster.hierarchy as shc – from sklearn.cluster import AgglomerativeClustering • Plotting Dendograms – dend = shc.dendrogram(shc.linkage(X, method='ward')) • Creating Model – cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward') Agglomerative Clustering Contd… • Performing Predictions – cluster.fit_predict(X) • Plotting Clusters – plt.scatter(X['sepal_length'], X['sepal_width'], c=cluster.labels_, cmap='rainbow')