advertisement

Cluster Analysis 2 Theory and Practice

40 %
60 %
advertisement
Information about Cluster Analysis 2 Theory and Practice
Education

Published on May 11, 2008

Author: untellectualism

Source: authorstream.com

advertisement

Slide 1: 1 Cluster Analysis Theory and Practice Discuss cluster analysis for different kind of data Obtain clusters Evaluate clusters with respect to internal/external variables Combine results arising from different approaches Discuss the influence of outliers on cluster analysis Slide 2: 2 Example1. Information and communications data – World’s countries Cluster Analysis:Numerical variables Slide 3: 3 Cluster Analysis for numerical variables Dissimilarity measures for numerical variables Euclidean distance Statistical distance Where zij is the standardized value corresponding to xij In the case when clusters have to be obtained on the basis of a vector of measurements on p variables (data matrix), the dissimilarity between two cases may be calculate by referring to the standard Euclidean distance or to the statistical distance Notice that the squared deviations are considered. As a consequences, extreme values on a given variable will have a great influence on the resulting dissimilarity. Moreover, extreme observations will be very dissimilar from the others, and hence regular observations will possibly be clustered together independently on their differences (clusters of regular vs clusters of extreme obs) Slide 4: 4 Cluster Analysis for numerical variables Manhattan (or City block) distance Also in this case a transformation may be applied similar to standardization. The absolute deviation relative to the j-th variable may be divided by: The range, Rj = (highest value– lowest value) for the j-th variable The MAD, the median of the absolute deviations from the median. The second criterion is less sensible to outliers (outliers may strongly influence the range) and it is similar to the standardization (dividing by a ‘standard’ deviation, in this situation a synthesis (median) of the deviance from the median) Dissimilarity measures for numerical variables An alternative criterion based on absolute rather than squared deviations is the Slide 5: 5 Cluster Analysis for numerical variables – SAS procedures Hierarchical cluster analysis for numerical vars Euclidean or statistical distance (STD option) PROC CLUSTER data=dataset STD other_options; id label_obs; var list_of_numerical_vars ; run; Mahalanobis distance PROC DISTANCE data=dataset method= cityblock out=d_cityblock; copy label_obs; var interval(list_of_numerical_vars / std=method); run; PROC CLUSTER data=d_cityblock(type=distance) other_options; id label_obs; var dist: ; run; STD option should be specified to work with statistical distance method is the method (optional) used to standardize, RANGE or MAD The partitioning algorithm may be applied only to a data matrix (not dissimilarity) Slide 6: 6 Cluster Analysis for numerical variables Example1. Information and communications data – World’s countries Preliminary analysis. A factor analysis was applied to the considered data in order to obtain some meaningful factors describing the considered variables in a meaningful way. These factors will not be used to obtain the clusters but only, later, to describe their main characteristics. Factor1 = 'ICT Level 2006-Not recent tel. growth'; Factor2 = 'Old-Recent cell/tel/internet growth'; Factor3 = 'Recent Infrastr. cell/tell growth'; Factor4 = 'Old Infrastr/internet growth'; Slide 7: 7 Cluster Analysis for numerical variables We notice that the outliers are really extreme. In order to avoid an excessive influence on results, we remove outliers from the dataset Slide 8: 8 Cluster Analysis for numerical variables Example1. Ward’s algorithm (statistical distance) 2 is the minimum number of clusters to be considered. Also, the 3 clusters solution is evidenced as a suitable one here. If we do not consider the very high dissimilarity between clusters joined at the last stages, also a greater number of clusters could be considered The 2 and 3 clusters solutions are suggested by all the internal and external criteria (the F statistic provides no information). Thus, we present plots of criteria limiting attention to a number of clusters between 4 and 20. Height (semipartial R2) indicates 3 clusters (the other indications are ‘masked’ by the strongest one) Slide 9: 9 Cluster Analysis for numerical variables Example1. Internal/external criteria (nr clusters = 4 to 20) Height (semipartial R2) indicates 7/8 clusters – also 12 Starting from the 8-clusters solution a more and more consistent decrease is observed Pseudo F: no indications Pseudo T: 6, 10 clusters R square: Slide 10: 10 Cluster Analysis for numerical variables Example1. The number of clusters According to the internal and external criteria, we should consider at least 3 clusters. By analyzing the other solutions, we see that a larger number of clusters could be selected, for example 6 (suggested by the T statistic) or 8 (height of the dendrogram and R2). One possibility to choose the number of clusters, i.e., to define the partition of observations is to compare partitions of different degree. As we will see later, another possibility is t combine results arising from different algorithms. Here is the cross tabulation between 6 and 8 clusters We immediately observe that moving from 6 to 8 clusters we are simply disaggregating cluster 4, which is probably the most complex and heterogeneous. Thus in the following we will consider the 8 clusters solution keeping in mind the clusters which would be joined in the 6-clusters solution. Notice that the refinement of the 6 clusters solution here occurs only by splitting one or more clusters (due to the hierarchy of the procedure) Slide 11: 11 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution We illustrate how to analyze clusters. even if ‘final’ considerations about clusters will be made later. The first thing we consider is the dispersion within clusters and the R2 for the whole set of variables and for each singular variable. We consider standardized variables. RMS Std dev=measure of the dispersion within each cluster with respect to all the vars. Notice that clusters 1 is the biggest but it has the smallest dispersion Within STD = dispersion – heterogeneity - within clusters evaluated with respect to each singular variable. R-square = proportion of the variance of each variable explained by clusters. A variable characterized by a high R2 is well explained by clusters, i.e., the means conditioned to clusters are quite different. Thus, clusters will be characterized by this variable. Instead, one variable with low R2, for example Internet_users_incr 2003-06, is not well explained by clusters. Thus possible differences between conditional means are not so informative. Slide 12: 12 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution Means of the variables at the basis of the clustering process and/or of factors (useful syntheses to consider to have easier to read indications – sensible only if factors are describing well all vars) We postpone the detailed analysis of these means. We first present some graphical tools which may prove useful in inspecting clusters’ characteristics We here illustrate how to analyze/compare clusters by referring to their centroids. We will present the main results, even if ‘final’ considerations about clusters will be made later. Slide 13: 13 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution Means of the variables at the basis of the clustering process and/or of factors (useful syntheses to simplify the analysis – sensible only if factors are describing well all vars) Slide 14: 14 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution Cluster 1 and 2 (beginners) are characterized by a developed ICT sector (higher tendency for cluster 1). The development started before 1999 and during the period 1999-2006 we do not observe relevant increases of the ICT indicators. Cluster 3 (static) is also characterized by low increases even if the degree of development is slightly lower than the mean. Cluster 4-8 (followers) show inverse characteristics, on the average, with a much less developed ICT sector but with different development patterns. In particular, cluster 4 is characterized by a generalized development in 2003-2006 not involving digitalization (recent telephone development). Cluster 5 is the least developed cluster but is shows a continuous increase from 1999 to 2006 wrt users, not infrastructures (enduring increas. users). Cluster 6 is characterized by relatively high increase, especially in 1999-2003 even if the development does not involve the digitalization of population (internet) (relatively decreasing telephone development) Cluster 7showing a level of development similar to that of cluster 3, is characterized by a high and enduring increase with respect to all the indicators (enduring high increase). Cluster 8 is characterized by a development below the average but also by an enduring infrastructures (tel lines) and digitalisation development (infrastructures and internet enduring increase) We here illustrate how to analyze/compare clusters by referring to their centroids. We will present the main results, even if ‘final’ considerations about clusters will be made later. Slide 15: 15 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution Cluster 1 and 2 (beginners – more or less developed Cluster 3 (static) Cluster 4*** (follower) - recent telephone development. Cluster 5 (follower) - enduring increas. users. Cluster 6*** (follower) relatively decreasing telephone development) Cluster 7 (follower) enduring high increase Cluster 8*** (follower) infrastructures and internet enduring increase Slide 16: 16 Cluster Analysis for numerical variables Example1. Analysis of the 8 clusters solution – factorial maps Old Infrastr/internet growth ICT Level 2006-Not recent tel. growth 'Old-Recent cell/tel/internet growth 'Recent Infrastr. cell/tell growth Legend | 7 5 3 1 8*** 2 6*** 4*** --------------+------------------------------------------------------------------------ Colors | blue red green cyan magenta orange gold lilac Slide 17: 17 Cluster Analysis for numerical variables Example1. Stability of Ward’s results Instead of comparing (nested) partitions with different degree another possibility to evaluate the stability or the strength of a solution consists in comparing one solution: with that/those obtained using different agglomerative methods with that obtained using a different measure of dissimilarity (for example Manhattan distance) with that obtained using another algorithm (partitioning – k-means algorithm) Our baseline is the 6-clusters partition obtained using Ward’s method and statistical distance. For the sake of simplicity we here consider the combination of Ward’s result with partitions obtained using 1 and 3. Slide 18: 18 Cluster Analysis for numerical variables Example1. Complete algorithm – statistical distance Height (max distance) 7 /2 clusters Pseudo F: 4/2 R square: 4/2 Semipartial R2: 4/2 Pseudo T: 4,2,10 Slide 19: 19 Cluster Analysis for numerical variables Example1. Partitioning algorithm (random seeds – 6 clusters) Cluster Summary Statistics for Variables Cl ward 6 Slide 20: 20 Cluster Analysis for numerical variables ICT Level 2006-Not recent tel. growth 'Old-Recent cell/tel/internet growth Ward’s (6) clusters Complete (4) clusters Partitioning alg (6) clusters ICT Level 2006-Not recent tel. growth ICT Level 2006-Not recent tel. growth Clusters obtained with the three approaches are all quite well represented on the factorial maps. We observe some ‘stable’ groups of observations, grouped by all the algorithms, or ‘recognized as similar’ whatever the algorithm used to obtain clusters. Instead some observations change cluster membership depending on the criterion used to explore data. Slide 21: 21 Cluster Analysis for numerical variables Old Infrastr/internet growth 'Recent Infrastr. cell/tell growth Ward’s (6) clusters Complete (4) clusters Partitioning alg (6) clusters We may observe here a weaker ‘relationship’ with the less relevant factors. We may observe some clusters in particular/extreme positions with respect to these factors. Notice that Ward’s cluster 4 (magenta) – which as we already know is the most dispersed one – is split by other criteria. Also notice that some obs in ward’s cluster 5 (red) are combined with other obs by the other algorithms Slide 22: 22 Cluster Analysis for numerical variables In order to analyse clusters’ stability, we analyse the combinations of ward’s (6 clusters solution) complete and partitioning algorithms results. Here we have some groups of observations which are ‘strongly related’ and which are placed in the same cluster by every algorithm. Notice once more that cluster 4 is the most disaggregated and may then be considered as a residual cluster, that is a cluster containing observations which are far from observations in the other clusters but which are not much similar. Slide 23: 23 Cluster Analysis for numerical variables Example1. Analyse stable clusters The centroids of these stable clusters provide a description of the most relevant tendencies in the data set. Slide 24: 24 Cluster Analysis for numerical variables Example1. Analyse stable clusters. Old Infrastr/internet growth ICT Level 2006-Not recent tel. growth 'Old-Recent cell/tel/internet growth 'Recent Infrastr. cell/tell growth Legend | 1-1-6 2-1-6 3-2-4 4-2-1 5-4-2 6-3-3 --------------+------------------------------------------------------ Colors | blue red green cyan magenta orange --------------------------------------------------------------------- Slide 25: 25 Cluster Analysis for numerical variables Example1. Analyse stable clusters. Slide 26: 26 Cluster Analysis for numerical variables Example1. Analyse stable clusters These are our typical profiles (the nr indicates the ‘stable component’ of ward): 1. Developed countries: development started before 1999 and since 1999 we observe increase well below the average with respect to the ICT selected indicators. 2. Medium developed countries: these are countries less developed as compared to countries in cluster 1, particularly as concerns infrastructures. These countries show a low increase in indicators, even if not as low as for the first country. 3. Less Developed countries showing no increase 4. Less Developed countries which experienced an improvement in 1999-2003 but that also show a slowdown of the development process during 2003-2006, especially with respect to the infrastructures (maybe this means that the development process started before 1999, if we assume that the presence of improvement is related to delay in the development process). Moreover, in these countries the main development did not involve digitalisation. 6. Less developed countries showing a high enduring improvement in 1999-2006 5. Least developed countries. Despite of a great, enduring increase in 1999-2006, with respect to digitalisation (telephone subscriptions, internet users) their situation in 2006 is the worse one. Within the groups of less developed countries, this is the cluster where the lowest increase is observed with respect to infrastructures (main lines). Of course, not all the available observations/countries have been ‘associated’ to one of these profiles (only obs in the stable clusters). Nevertheless, once these profiles have been individuated we may associate each country to the centroid it is closest to. Slide 27: 27 Cluster Analysis for numerical variables Example1. Final partition We may now analyze the dispersion within clusters and the R2 by referring only to obs in stable clusters or by considering a partition defined on the whole dataset, in order to draw some conclusions on the whole sample. Here we follow the second approach Slide 28: 28 Cluster Analysis – Some concluding remarks Weight of correlated variables If the dataset contains variables highly correlated, that is sharing information, these variables may have a relative weight on clusters creation which is higher than the weight of variables related to other information possibly captured by less variables. This is due to the fact that the dissimilarity between two obs with respect to the ‘first kind of information’ will be ‘replicated’ more than the dissimilarity relative to the other kind of information. In these situations, we may select one variable per group of correlated variables or also we may apply cluster analysis to factors rather than to the original variables. Outliers In our example we removed outliers. Nevertheless, the illustrated procedure, possibly emphasizing observations which are weakly related to clusters, should evidence itself the presence of outliers. If there are severe outliers they may be removed. The aim of this deletion is not to discard information but only to avoid an excessive weight of outliers on results. Of course, outliers should be analyzed (why are these observations extreme?) and eventually associated to stable clusters if we are interested in a classification of all the observations. Slide 29: 29 Cluster Analysis:Qualitative or mixed data Slide 30: 30 We now consider the case when the p variables we are considering are not all numerical. Numerical variables Qualitative variables Mixed variables Binary Nominal Ordinal Cluster analysis for qualitative data We introduce criteria to properly measure the dissimilarity between two observations based on not numerical variables (please notice that we are not talking about methods to measure the dissimilarity between clusters) Slide 31: 31 Dissimilarity measures for qualitative or mixed data Binary (dummy): Variables indicating the presence (1) or absence (0) of a characteristic. Example. Group (the enterprise is part of a group): Yes/No; Intramural R&D: Yes/No UK: Yes/No Multinomial: qualitative variables merely distinguishing classes Example. Innovation (1/2/3 = in-house, 4/5/6=cooperation, 1/4=Product, 2/5=Process,3/6=Product/Process,0=No) In-house Innovation(1=Prod.,2=Proc.,3=Prod/Proc,0=No) Country (1=UK,2=DE,3=DK,0=Other) Similarity and dissimilarity between nominal variables may be calculated by referring to the number of matches (matching characteristics) or mismatches (not matching characteristics) An important distinction is that between symmetric and asymmetric nominal variables Dissimilarity measures for nominal variables Slide 32: 32 Dissimilarity measures for nominal variables Symmetric variables (SAS: nominal) Group (the enterprise is part of a group): Yes/No; Innovation (1/2/3 = in-house, 4/5/6=cooperation, 1/4=Product, 2/5=Process,3/6=Product/Process,0=No) Matches=0 Mismatches=2 Matches=? Mismatches=? The absence label “0” is a category, and measures one characteristics. Group 0-0: the two firms are both not part of a group– they are similar) Innovation 0-0: the two firms are similar with respect to innovation. They did not innovate neither product or process, neither in-house or in cooperation with other enterprises or institutions In this case, 0-0 is a match. Matches=2 Mismatches=2 Dissimilarity measures for qualitative or mixed data Slide 33: 33 Dissimilarity measures for nominal variables Asymmetric variables (SAS: anominal) Matches=? Mismatches=? Here “0” indicates the absence of a characteristic BUT NOT the presence of the opposite characteristic. UK. 0-0; two firms not located in UK are not necessarily similar with respect to location In-house innovation 0-0 the two firms are not necessarily similar. (0 means both no innovation and innovation in cooperation) In this case 0-0 cannot be treated neither as a match or as a mismatch Please notice that we may decide to consider the variables as symmetric in some applications, if we consider the ‘0’ as a characteristic. UK: Yes/No; In-house innovation (1= Product, 2=Process, 3=Product/Process, 0=No) Dissimilarity measures for qualitative or mixed data Slide 34: 34 Matches non 0-0 Matches 0-0 Mismatches Simple matching – (treated as) symmetric nominal variables. Dissimilarity: proportion of not matching attributes. di,k = mismatches / (matchesnon 0-0 + matches0-0 + mismatches) = mismatches / p Jaccard coefficient – (treated as) asymmetric nominal variables. Dissimilarity: proportion of not matching attributes (excluding 0-0) di,k = mismatches / (matchesnon 0-0 + mismatches) = mismatches / (p – matches0-0) Dissimilarity measures for nominal variables Dissimilarity measures for qualitative or mixed data Slide 35: 35 Dissimilarity measures for nominal variables Weighting of variables If one variable has a high nr of categories, a match is ‘more difficult’. Hence, the classification choices may have an impact on dissimilarity. To avoid this problem, some authors suggest to assign a lower weight to mismatches relative to variables with a higher nr of categories. This operation is similar to standardization. Weighting of matches If one variable is characterized by a rare (unusual) state, this match is ‘more difficult’. This is the reason why some authors suggest to assign a higher weight (for a given variable) to matches corresponding to rare states d(j)i,k = 1 if there is a mismatch with respect to the j-th variable (j)i,k = 1 when both values are not missing and is put equal to 0 otherwise. If the j-th variable is asymmetric (and we decide to treat it as asymmetric) , it is set equal to 0 in the case of a 0-0 match. In this way the numerator coincides with the nr of mismatches and the denominator with (p – nr of 0-0 matches). Dissimilarity measures for qualitative or mixed data Slide 36: 36 Dissimilarity measures for ordinal variables Ordinal: qualitative variables whose levels can be ordered Example. Innovated product (1=significantly improved 2=new 3=new to the market) Level of digitalization (1=none, 2=low, 3=medium, 4=high) If we treat the levels as nominal, the dissimilarity between obs will not depend on the ‘distance’ between levels (1 and 2 or 1 and 3 are equally dissimilar). Also, treating levels as numerical is not a very good idea (levels could be coded using different numbers). One possibility is to substitute levels with their rank. SAS: order obs according to the levels and calculate the rank for each observation Obtain the median rank of obs characterized by the same level. Substitute the original levels with the median ranks (named ranks for simplicity) Dissimilarity measures for qualitative or mixed data Slide 37: 37 Cluster analysis for qualitative data – dissimilarity Dissimilarity measures for ordinal variables Once each observation has been substituted with its rank, ordinal variables may be treated as numerical variables. Usually the Manhattan distance is calculated for these data. Also, the absolute deviations between observations are usually divided (similar to standardization) by the range (with ordinal vars we have not the problem of outliers) (also, the statistical or the Euclidean distance may be applied to the transformed vars). Notice that a binary symmetric variable may also be treated as an ordinal variable (the dissimilarity calculated with respect to this variable will always be 0 or 1). This transformation procedure is very useful since we may treat ordinal or binary symmetric variables as numerical variables. Slide 38: 38 Cluster analysis for qualitative data – dissimilarity Dissimilarity measures for mixed variables In some applications we may be interested in evaluating the dissimilarity between two observations on the basis of a set of variables having different nature (nominal – symmetric or asymmetric – ordinal, numerical). In this case, we usually refer to the so called Gower coefficient. According to Gower’s proposal, the dissimilarity between two observations should be calculated as: (j)i,k = 1 when both values are not missing and =0 otherwise. If the j-th variable is nominal asymmetric, it is set equal to 0 in the case of a 0-0 match. Nominal variables d(j)i,k = 1 if the variable is nominal and the two obs present a mismatch, 0 otherwise Ordinal variables Numerical variables rij = ranked level for the i-th obs (j-th var), Rj range of ranked values Rj = range for the j-th variable Slide 39: 39 Hierarchical cluster analysis for nominal or mixed variables proc distance data= dataset method= dgower out=d_gower; copy label_obs; var interval(numerical_vars_list) anominal(anominal_vars_list) nominal(nominal_vars_list) ordinal(ordinal_vars_list); run; PROC CLUSTER data=d_gower(type=distance) options; id label_obs; var dist: ; run; The partitioning algorithm cannot be applied in this case since we have no a data matrix, but only a dissimilarity matrix. As we said before, in the particular case when only ordinal or binary symmetric variables are considered, a data matrix can be obtained using rank-transformed values. Cluster analysis for qualitative data – SAS procedures Slide 40: 40 Cluster analysis for ordinal/binary (symm.) vars /*substitute variables with ranks*/ PROC RANK data= dataset out= rank ; var list_of_variables; run; /*standardization using the range */ proc stdize data= rank out=sdz method=range; var vars_list; run; This new dataset can now be analyzed with the procedures illustrated for the numerical variables. Of course, also the Manhattan distance can be considered with these data (PROC DISTANCE) Since a (numerical) data matrix is available in this case, it is possible to apply also partitioning algorithm. Cluster analysis for ordinal and binary data – SAS procedures Slide 41: 41 In our application we used cluster analysis to group obs, factor/pca analysis to ‘group’ vars and to describe relationships between vars, factorial maps to combine/visualize results arising from the two techniques. When cluster analysis is based upon statistical distances, the two procedures are related since we are trying to explain/find groups of obs close one to another and these groups of observations will possibly induce ‘tendencies’. But in cases when we are working with a dissimilarity matrix, for example with the Manhattan distance, the relationship between clusters and factors is less strong. Things are even more complicated in the case of nominal or mixed variables (Gower coefficient). Cluster analysis and factorial maps

Add a comment

Related presentations

Related pages

Cluster Theory and Practice: Advantages for the Small ...

Cluster Theory and Practice: ... Vol. 9-2: 195-224. Pandit, ... T. and den Hertog, P. (1999), Cluster analysis and cluster-based policy making in OECD ...
Read more

Clusters in Theory and Practice - University of Strathclyde

... Cluster Analysis Techniques ... the future development of the cluster. 2. ... Clusters in Theory and Practice
Read more

Cluster analysis - Wikipedia, the free encyclopedia

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) ... theory ...
Read more

Cluster Analysis: Basic Concepts and Algorithms

to cluster analysis. 8.1.2 ... of a cluster that prove useful in practice. ... data set and the cluster labels from a cluster analysis of ...
Read more

Cluster Analysis: A practical example - Focus-Balkans ...

one reminding cluster What cluster analysis does. ... What cluster analysis does Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5. ... theory but at random
Read more

Lab 13 — Cluster Analysis

Lab 13 — Cluster Analysis ... can see that Berberis repens only occurs in cluster 2, ... to the average distance to other clusters. In practice, ...
Read more

Chapter 15 Cluster analysis - York University

Chapter 15 Cluster analysis ... 2 Chapter 15: Cluster analysis ... in most cases in practice the number of all possible clusters is very large
Read more

Conduct and Interpret a Cluster Analysis - Statistics ...

Cluster Analysis. Conduct and Interpret a Cluster Analysis; ... 2) link the clusters, and 3) choose a solution by selecting the right number of clusters. ...
Read more

Chapter 9 Cluster Analysis - Springer - International ...

Fig. 9.2 Steps in a cluster analysis ... clusters. 250 9 Cluster Analysis. ... Methodological reasons for the theory/practice divide in market
Read more

Problems and Prospects for Clusters in Theory and Practice

Title: Problems and Prospects for Clusters in Theory and Practice Author: scspnc Last modified by: stojp6 Created Date: 4/12/2006 2:44:00 PM Company
Read more