Prof. Swapna Dutta Khan: Submitted by
Prof. Swapna Dutta Khan: Submitted by
SUBMITTED TO
LAVLESH UPADHYAY MAMTA SINGH MADHURIMA ROY MAHENDER SUYAL MAHESH KUMAR
CLUSTER ANALYSIS
INTRODUCTION
Cluster analysis is a major technique for classifying a mountain of information into manageable meaningful piles. It is a data reduction tool that creates subgroups that are more manageable than individual datum. Like factor analysis, it examines the full complement of interrelationships between variables. Both cluster analysis and discriminant analysis are concerned with classification. However, the latter requires prior knowledge of membership of each cluster in order to classify new cases. In cluster analysis there is no prior knowledge about which elements belong to which clusters. The grouping or clusters are defined through an analysis of the data. Subsequent multi-variate analyses can be performed on the clusters as groups.
illnesses. In the field of business, clusters of consumer segments are often sought for successful marketing strategies. Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful taxonomies, groups, or clusters, based on combinations of IVs, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. In this sense, CA creates new groupings without any preconceived notion of what clusters may arise, whereas discriminant analysis classifies people and items into already known groups. CA provides no explanation as to why the clusters exist nor is any interpretation made. Each cluster thus describes, in terms of the data collected, the class to which its members belong. Items in each cluster are similar in some ways to each other and dissimilar to those in other clusters.
store layouts, maximizing spontaneous purchasing opportunities. Banking institutions have used hierarchical cluster analysis to develop a typology of customers, for two purposes, as follows: To retain the loyalty of members by designing the best possible new financial products to meet the needs of different groups (clusters), i.e. new product opportunities. To capture more market share by identifying which existing services are most profitable for which type of customer and improve market penetration. One major bank completed a cluster analysis on a representative sample of its members, according to 16 variables chosen to reflect the characteristics of their financial transaction patterns. From this analysis, 30 types of members were identified. The results were useful for marketing, enabling the bank to focus on products which had the best financial performance; reduce direct mailing costs and increase response rates by targeting product promotions at those customer types most likely to respond; and consequently, to achieve better branding and customer retention. This facilitated a differential direct advertising of services and products to the various clusters that differed inter alia by age, income, risk taking levels, and self-perceived financial needs. In this way, the bank could retain and win the business of more profitable customers at lower costs. Cluster analysis, like factor analysis, makes no distinction between dependent and independent variables. The entire set of interdependent relationships are examined. Cluster analysis is the obverse of factor analysis. Whereas factor analysis reduces the number of variables by grouping them into a smaller set of factors, cluster analysis reduces the number of observations or cases by grouping them into a smaller set of clusters.
Because we usually dont know the number of groups or clusters that will emerge in our sample and because we want an optimum solution, a two-stage sequence of analysis occurs as follows: 1 We carry out a hierarchical cluster analysis using Wards method applying squared Euclidean Distance as the distance or similarity measure. This helps to determine the optimum number of clusters we should work with.
2 The next stage is to rerun the hierarchical cluster analysis with our selected number of clusters, which enables us to allocate every case in our sample to a particular cluster.
This example illustrates clusters B and C being combined at the fusion value of 2, and BC with A at 4. The fusion values or linkage distances are calculated by SPSS. The goal of the clustering algorithm is to join objects together into successively larger clusters, using some measure of similarity or distance. At the left of the dendrogram we begin with each object or case in a class by itself (in our example above there are only four cases). In very small steps, we relax our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster.
As a result, we link more and more objects together and amalgamate larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together as one cluster. In these plots, the horizontal axis denotes the fusion or linkage distance. For each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. As a general process, clustering can be summarized as follows: The distance is calculated between all initial clusters. In most analyses, initial clusters will be made up of individual cases. Then the two most similar clusters are fused and distances recalculated. Step 2 is repeated until all cases are eventually in one cluster.
DISTANCE MEASURES
Distance can be measured in a variety of ways. There are distances that are Euclidean (can be measured with a ruler) and there are other distances based on similarity. For example, in terms of kilometer distance (a Euclidean distance) Perth, Australia is closer to Jakarta, Indonesia, than it is to Sydney, Australia. However, if distance is measured in terms of the cities characteristics, Perth is closer to Sydney (e.g. both on a big river estuary, straddling both sides of the river, with surfing beaches, and both English speaking, etc.). A number of distance measures are available within SPSS. The squared Euclidean distance is the most used one.
Euclidean distance is used more often than the simple Euclidean distance in order to place progressively greater weight on objects that are further apart. Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. Having selected how we will measure distance, we must now choose the clustering algorithm, i.e. the rules that govern between which points distances are measured to determine cluster membership. There are many methods available, the criteria used differ and hence different classifications may be obtained for the same data. This is important since it tells us that, although cluster analysis may provide an objective method for the clustering of cases, there can be subjectivity in the choice of method. SPSS provides five clustering algorithms, the most commonly used one being Wards method.
Wards method
This method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In general, this method is very efficient. Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares.
K-means clustering
This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by. K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. You may want to tell the computer to form exactly three clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k-means clustering algorithm. In general, the k-means method will produce the exact k different clusters demanded of greatest possible distinction. Very frequently, both the hierarchical and the k-means techniques are used successively. The former (Wards method) is used to get some sense of the possible number of clusters and the way they merge as seen from the dendrogram. Then the clustering is rerun with only a chosen optimum number in which to place all the cases (k means clustering). One of the biggest problems with cluster analysis is identifying the optimum number of clusters. As the fusion process continues, increasingly dissimilar clusters must be fused, i.e. the classification becomes increasingly artificial. Deciding upon the optimum
number of clusters is largely subjective, although looking at a dendrogram (see Fig. 23.1) may help. Clusters are interpreted solely in terms of the variables included in them. Clusters should also contain at least four elements. Once we drop to three or two elements it ceases to be meaningful. Example A keep fit gym group wants to determine the best grouping of their customers with regard to the type of fitness work programs they want in order to facilitate timetabling and staffing by specially qualified staff. A hierarchical analysis is run and three major clusters stand out on the dendrogram between everyone being initially in a separate cluster and the final one cluster. This is then quantified using a k-means cluster analysis with three clusters, which reveals that the means of different measures of physical fitness measures do indeed produce the three clusters (i.e. customers in cluster 1 are high on measure 1, low on measure 2, etc.). Interpretation of results The cluster centroids produced by SPSS are essentially means of the cluster score for the elements of cluster. Then we usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly different means for most, if not all dimensions, used in the analysis. The magnitude of the F values performed on each dimension is an indication of how well the respective dimension discriminates between clusters. It is useful to create on SPSS as you will see below a new variable on the data view fi le which indicates the cluster to which each case has been assigned. This cluster membership variable can be used in further analyses. Techniques for determining reliability and validity of clusters are as yet not developed. However, one could conduct cluster analysis using several different distance measures provided by SPSS and compare results. Alternatively, if the sample is large enough, it can be split in half with clustering performed on each and the results compared.
1. ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE) 1 Less than $10,000 2 -$10,000 to $14,999 3 -$15,000 to $19,999 4 -$20,000 to $24,999 5 -$25,000 to $29,999 6 -$30,000 to $39,999 7 -$40,000 to $49,999 8 -$50,000 to $74,999 9 -$75,000 or more
3. 25 thru 34
4. 35 thru 44
7. 65 and Over
5. EDUCATION 1. Grade 8 or less 2. Grades 9 to 11 3. Graduated high school 4. 1 to 3 years of college 5. College graduate 6. Grad Study
6. OCCUPATION 1. Professional/Managerial 2. Sales Worker 3. Factory Worker/Laborer/Driver 4. Clerical/Service Worker 5. Homemaker 6. Student, HS or College 7. Military 8. Retired 9. Unemployed
7. LIVED HOW LONG IN THE SAN FRANCISCO /OAKLAND/SAN JOSE AREA 1. Less than one year 2. One to three years 3. Four to six years 4. Seven to ten years 5. More than ten years
9. PERSONS IN YOUR HOUSEHOLD 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more
1. One
2. Two 3. Three
4. Four
5. Five 6. Six
7. Seven
8. Eight 9. Nine or more
13. ETHNIC CLASSIFICATION 1. American Indian 2. Asian 3. Black 4. East Indian 5. Hispanic 6. Pacific Islander 7. White 8. Other
An important variable in determining clusters for the k-Means algorithm is Annual Income. In the plot below, the frequencies of each income category for the two clusters are available. Lower income individuals are more often classified in Cluster 2, and higher income individuals are classified in Cluster 1.
Conclusion:
Summarizing the analyses of the best predictors into a short table, we get the following results. Note that this table indicates the respective dominant class or category for each of the variables listed in the rows, i.e., the class or category that occurred with the greatest frequency within the respective cluster.
We can conclude from this table that Cluster 1 includes Married Individuals from Dual income households, between the Ages of 25 and 44, who Own houses, hold Professional/ Managerial positions, and have Annual Incomes between $50,000 and $75,000. Cluster 2 members are Single, between the Ages of 18-24, Rent houses, are in High School or College, and earn less than $10,000. The above information clearly indicates that the clustering technique has helped us identify two meaningful distinct groups of shoppers from our marketing data set. This information could be exploited to better serve the needs of the customers visiting the mall, which can improve sales, paving the way for higher profitability. For example, special marketing campaigns could be designed based on this better understanding of who visits the mall (e.g., special promotions for college students); this information could also be used to market store locations in the mall to prospective retailers who specifically cater to the groups that we identified. In general, the better we know and understand our customers, the better we are prepared to serve them and, hence, ensure a successful retail enterprise.