8323 Stats - Simple Correspondence Analysis

100 %
0 %
Information about 8323 Stats - Simple Correspondence Analysis
Education

Published on May 11, 2008

Author: untellectualism

Source: authorstream.com

Slide 1: 1 Simple Correspondence Analysis Motivation: why correspondence analysis Association Row and columns profiles Introduction to correspondence analysis Slide 2: 2 Example. 4° Framework Programme for Research and Technological Development. Data covering all the research and technological development (RTD) activities funded by the European Commission during the period 1994-1998. The data we will consider Is it possible to analyze the relationships between two of the considered (categorical) variables? In the case of association, is it possible to describe what association is due to? We are interested in describing the main attractions/avoidance between the levels of the involved variables. For example, if we consider Resp_country and Resp_org, is it possible to say something about which types of organisations are more active in the different countries? Also, if we consider Resp_country and Topic, is it possible to draw some conclusions about the research directions preferred by European countries (organisations)? Slide 3: 3 Analysis of Association – Contingency table Example 1. We first consider the responsible of the projects. We want to study the association between nationality (resp_country) and type of organisation (resp_org). For teaching convenience, we start with a limited set of countries. The information on the joint distribution of the two vars is contained in the so called: A (frequencies) contingency table is a table displaying the classification of n cases according to two nominal variables. Its elements, nij, are the joint frequencies. In the last row and in the last column the marginal frequencies are reported, nCj and nRi, obtained as the sums of the elements in each column and row respectively. Contingency Table (frequencies)‏ Slide 4: 4 A (percentages) contingency table is a table displaying the classification of n cases according to two nominal variables. Its elements, pij, are the joint percentages. In the last row and in the last column the marginal percentages are reported, Cj and Ri, obtained as the sums of the elements in each column and row respectively. Contingency Table It is also possible to consider the percentages contingency table (or proportions, we will use the same notation)‏ Analysis of Association – Contingency table Slide 5: 5 The two vars may also be jointly analyzed by taking into account the conditional frequencies, i.e., percentages of the row totals, (or percentages of the column totals). These frequencies are useful to compare the relative importance/ incidence of each category of one var in the sub-population induced by the categories of the other one. Analysis of Association – Conditional frequencies Row conditional frequency : rij = (nij / nRi) = (pij / Ri) cij =(nij /nCj)‏ =(pij /Cj) Column conditional frequency Slide 6: 6 Analysis of Association – Row Profiles The row profiles are the frequencies conditioned to the rows categories. In this case, these profiles describe the distribution of the column variable (Resp_org) conditioned to the row variable (Resp_country). The row profiles may be compared with the marginal row profile, Cj, describing the distribution of the column variable in the whole population. Observe, for example, that within organisations in Belgium, the proportion of Education (0.2948) is higher than that characterizing the whole population, (0.1612). Also, observe that within the organisations in Denmark the proportion of Industry (0.4167) is lower than that observed in the whole population (0.5805)‏ Slide 7: 7 Analysis of Association – Column Profiles The column profiles are the frequencies conditioned to the column categories. In this case, these profiles describe the distribution of the row variable (Resp_country) conditioned to the column variable (Resp_org). The column profiles may be compared with the marginal column profile, Ri, describing the distribution of the row variable in the whole population. Observe, for example, that within organisations in Education, the proportion of UK (0.3236) is higher than that characterizing the whole population, (0.2083). Also, observe that within the organisations in Research the proportion of Italy (0.991) is lower than that observed in the whole population (0.1122)‏ Slide 8: 8 Two categorical variables are associated if it is possible to observe attraction or avoidance between the levels of the two variables. This is a very general definition. It is easier to describe the situation when two variables are not associated or independent. Consider the following simplified example. Analysis of Association Row Profiles Column Profiles Notice that in this situation there is no attraction or avoidance between the levels of the two vars. Actually, the row profiles are all identical one to each other (and to the marginal profile) and the same holds for column profiles. Thus, the incidence of one level of the row (resp. column) variable is exactly the same in all the subpopulations induced by the levels of the column (resp., row) variable. In this case, we say that the two variables are independent Slide 9: 9 From the previous discussion, we may conclude the following. Two categorical variables are independent if and only if: Analysis of Association – Independence rij = Cj for every i,j cij = Ri for every i,j Independence The two conditions are equivalent pij = Ri × Cj for every i,j The three conditions are equivalent Slide 10: 10 Observed joint frequencies / proportions: pij nij Expected joint frequencies / proportions: p*ij n*ij Two categorical variables are independent if and only if: Analysis of Association – Independence pij = p*ij = (Ri × Cj) for every i,j nij = n*ij = n × Ri × Cj for every i,j Once independence has been defined, we can measure the extent of the association by quantifying the deviation of the observed frequencies (proportions) from those expected under the hypothesis of independence The index may also be rewritten as a function of the frequencies, observed – nij – and expected under the hypothesis of independence – n*ij. Also, the index may be rewritten as a function of the deviance of the row/column profiles from the (respective) marginal profiles Slide 11: 11 Analysis of Association Observed and expected Slide 12: 12 Analysis of Association Observed and expected Attraction: frequency (and proportion) greater than the expected Avoiding: frequency (and proportion) lower than expected Categories with (relatively) high attraction/avoiding highly contribute to the chi-square (deviation – in any direction – from the hypothesis of independence)‏ Slide 13: 13 Contribution to the chi-square – Cells, rows, columns Slide 14: 14 Simple correspondence analysis In our application, the two variables are associates, since Chi-Square = 278.110 (p-value< 0.0001). By analyzing the table of the contribution to the chi-square and the attraction/avoiding between pairs of categories we could understand the structure of this association, i.e., which are the categories responsible of the deviation from independence. Similarly to what we do with the other factorial technique, our aim is to extract factors describing the structure of the association, i.e., to synthesize at best the information contained in the contingency table (with respect to the association). The resulting factorial maps (correspondence maps) allow the researcher to visualize relationships among categories of categorical variables (particularly useful for large data sets). In order to understand the logic underlying simple correspondence analysis, it is convenient to remember here that two variables are associated if and only if the row (resp. column) profiles are different one from each other or, which is the same, are as much different as possible from the marginal row (resp. column) profile Slide 15: 15 The matrix of row profiles has I rows r1, r2, … , rl . The i-th row is the i-th row profile, ri The marginal row profile, C, describes the distribution of the column variable in the whole population (that, is, disregarding the categories of the row variable)‏ The marginal row profile can be obtained as an average of the row profiles, weighted by the marginal proportion of the row, also said row mass, Ri. The same holds for the marginal column profile, which is an average of column profiles, each weighted by the marginal proportion of the column, also said column mass, Cj: Hence, if we ‘extend’ the concept of centroid and consider the weights, C may be regarded as the centroid of the row profiles, and R may be regarded as the centroid of the column profiles. Simple correspondence analysis – Row and column profiles The matrix of column profiles has J columns c1, c2, … , cJ . The j-th column is the j-th column profile, cj The marginal column profile, R, describes the distribution of the row variable in the whole population (that, is, disregarding the categories of the column variable)‏ C = i ri × Ri R = j cj × Cj Slide 16: 16 Simple correspondence analysis – Row profiles Each row profile, ri may be regarded as a vector in the J-dimensional space, J being the number of elements of ri – i.e., the number of columns (in our example J = 6). Hence we can calculate the distance between the row profiles in the J-dimensional space, or, equivalently, we can calculate the distance between each row profile and the centroid of the row profiles, i.e. the marginal row profile (remember that the higher these distances the higher the association between the two considered variables)‏ As in Principal Component Analysis, we are thus interested in studying the dispersion of the vectors around their centroid. Slide 17: 17 Distance between one row profile and the marginal row profile In SCA the distance between the i-th row profile, ri , and the marginal row profile, C, is calculated using a weighted Euclidean Distance, called Chi-square distance: Each deviation is weighted with (1/Cj), i.e., each component of the row profile (each column) is inversely weighted by its “importance” (measured by the column marginal proportion, column mass). In this way we avoid an excessive influence on the distance of the elements of ri and C with high frequency (operation similar to standardization). Consider the following example (we have percentages here – the final result should be divided by 100)‏ Simple correspondence analysis – Row profiles Observe that the squared deviation relative to Education is lower than that relative to Industry. Nevertheless this is due to the fact that Industry is characterized by a higher frequency in the dataset. Hence, the deviation is weighted so as to take this into account. Slide 18: 18 Synthesis of the distances of row profiles around their centroid The I distances between the row profiles and their centroid are synthesized as follows: Simple correspondence analysis – Row profiles In this synthesis (which is similar to the total variance in PCA), each row profile is weighted by the corresponding row mass. In this way, a higher weight is assigned to the most relevant row profiles (importance here is measured by the frequency). This synthesis coincides with the Chi-square index, measuring association between the two variables. The quantity G2R / n is called TOTAL INERTIA. The total inertia, related to the strength of association, is the information we are interested to preserve at best in SCA. Slide 19: 19 Simple correspondence analysis – Row profiles Simple correspondence analysis consists in a Principal Components Analysis applied to the row profiles. In this particular application of PCA, we do not consider standard Euclidean distances but, rather, a weighted Euclidean distance. Our aim is to project the row profiles onto a factorial space so as to reproduce at best the distances between row profiles and the marginal row profile. As it was mentioned before, in this context the information which should be reproduced at best in the factorial space is he total inertia, which is related to the strength of association. As in PCA, we extract eigenvalues (principal inertias), eigenvectors and factors. The sum of the eigenvalues in this case coincides with the total inertia. Also, as in PCA the factors are extracted in a decreasing order of importance, the importance being measured by the proportion of explained total inertia. The maximum number of factors which can be extracted from a contingency table with I rows and J columns is: min(I 1, J1)‏ Slide 20: 20 Simple correspondence analysis – Column profiles Using the same arguments as before, each column profile, cj may be regarded as a vector in the I-dimensional space, I being the number of elements of cj – i.e., the number of rows (in our example I = 11). We can calculate the distance between the column profiles in the I-dimensional space, or, equivalently, the distance between each column profile and the centroid of the column profiles, the marginal column profile (again, high distances  high association)‏ Also in this situation, a Principal Component Analysis may be applied to study and describe the dispersion of the vectors around their centroid. Slide 21: 21 Simple correspondence analysis – Column profiles Distance between one column profile and the marginal column profile The Chi-square distance between the j-th row profile, ci , and the marginal row profile, R, is Synthesis of the distances of row profiles around their centroid The J distances between the column profiles and their centroid are synthesized as : Hence: total inertia of the profiles, G2R / n = total inertia of column profiles, G2C / n . If we apply PCA to the column profiles using the described weighting procedures, we will extract eigenvalues, eigenvectors and factors relative to the columns. Again the maximum nr of factors which can be extracted is min(I 1, J1), and the factors are extracted in a decreasing order of importance, the importance being measured by the proportion of explained total inertia. It can be shown that the eigenvalues, eigenvectors, and factors extracted by analyzing the rows are in correspondence with those extracted by analyzing the columns. Slide 22: 22 Simple correspondence analysis Independence: the joint proportions coincide with the product of the marginal proportions. Row profiles all coincide with the marginal row profile and column profiles all coincide with the marginal column profile. Chi-Square statistic: the observed frequencies (or row/column profiles) are compared to those expected under the hypothesis of independence. The differences between observed and expected all contribute to association and, then to the Chi-square. Simple correspondence analysis: the total dispersion of the row profiles around their centroid is analyzed and attributed to min(I 1, J1) factors extracted in a decreasing order of importance. Using standard criteria attention is limited to a subset of factors, explaining a satisfactory proportion of explained inertia. It can be shown that the factors extracted for row profiles are in correspondence with those extracted for column profiles. These factors describe the association, since they describe the distance of the row/column profiles from their centroids, and thus explain why the observed frequencies are different from those expected under the hypothesis of independence. The row/column profiles which are more distant from the respective centroid can be considered as those more responsible of association, since they are influencing the total dispersion (total inertia) we are explaining using factors. The corresponding row/column categories are attracted/avoiding some column/row categories. Slide 23: 23 Simple Correspondence Analysis Theory and Practice Choose the number of SCA factors Evaluate the SCA factors with respect to row and column profiles Obtain and analyze Correspondence maps. Comment the position of row/column profiles Extend the concept of ‘outliers’ to the present case, and discuss their influence on SCA factors Slide 24: 24 Simple correspondence analysis Example 1. We want to study the association between nationality (resp_country) and type of organisation (resp_org) of the responsible of the project Resp_country 11 categories Resp_org 6 categories Association: Chi-square and total inertia 2 = 278.110 (significant)‏ Total Inertia = 2/n = 278.110 /3182=0.0874. The total inertia can be decomposed along min(I 1, J1)= min(11 1, 61)= 5 dimensions (factors). Slide 25: 25 Simple correspondence analysis In this table: singular values (square root of the eigenvalues), eigenvalues (principal inertia), contribution to the Chi-Square (Chi-Square), % of inertia explained by each factor (percent) , Cumulative proportion of inertia explained by factors (cumulative percent). In the last column a sort of scree plot, with a graphical representation of the % of inertia explained by each factor. Total Inertia The factors are extracted so as to reproduce at best the distance between row/column profiles and their centroids in the original spaces. The 1° dimension explains the 66.28% of the total inertia, the 2° explains the 22% and 2 dimensions explain together the 88.28%. Association can be analyzed using 2 dims. Slide 26: 26 Simple correspondence analysis We decide to extract only 2 dimensions. As in the other factorial techniques, our final goal is to obtain a factorial map providing us with a description of the dispersion of row/column profiles. Usually in our previous lessons we started with an evaluation of the explanatory power of the factors and/or with an evaluation of the quality of the representation for each point, thus postponing an analysis of these maps. Here, instead we will first of all describe maps, in order to understand their meaning Correspondence maps for rows and for column profiles. Analysis of the position of the points. These plots are shown here only for teaching convenience. Combination of the maps Then we will ‘start again’ and analyze Quality of the representation for row/column profiles - this analysis should always precede the analysis of the map Slide 27: 27 Correspondence map of row profiles (only for teaching convenience)‏ Simple correspondence analysis The origin here represents the marginal row profile, C. In the case of independence the row profiles will be similar to the marginal row profile: the points in this case will be clustered around the origin. Nevertheless, some caution is necessary here since also small deviations may be emphasized on the map (so, carefully analyze the chi-square to be sure there is some association to be explained)‏ If a point is relatively close to the origin (here Germany), the row profile, i.e. the distribution of the column var (Resp_org) conditioned to the row is similar to C. If a point is far from the origin (for example here Belgium/ Italy along the 1° dim and Denmark/Greece along the 2°), this means that the distribution of the column var conditioned to that row is different from C. Of course distances along the 1° dim are ‘more important’ wrt association. Notice that here we can not understand why these row profiles differ from the marginal one. Slide 28: 28 Correspondence map of row profiles (only for teaching convenience)‏ Simple correspondence analysis Closeness between row profiles. Two row profiles which are close in the factorial space are similar. This means that the distribution of the column var (Resp_org) conditioned to the two rows. Here, for example Denmark e Netherlands are close (wrt to both dimensions) and also France and Italy (especially wrt to the 1° dim, but they are relatively close also wrt the 2°)‏ In the same way, two points which are distant in the map represent row profiles which are different, i.e. two different conditional distributions. Here, for example Belgium/UK and France/Italy are far from each other along the 1° dimension and Denmark and Greece are distant along the 2° dimension. Again, from this map, representing only row profiles it is not possible to understand the reasons why two profiles are similar or dissimilar. Slide 29: 29 Closeness and distance of row profiles (only for teaching convenience)‏ Simple correspondence analysis Inspection of row profiles: ideas about the reasons of closeness/ distance between points. Notice that Germany is the row profile most similar to C. UK and France are distant because their deviation from the marginal profile is different. For UK we observe a higher % of Education (25.04 vs 16.12), for France a lower (6.5 vs 16.12). UK has a lower % of Industry and the contrary holds for France (49.92 and 67.67 vs 58.05). Notice that for the other column levels, the two countries are instead not so different. Denmark and Netherlands are similar (wrt both dimensions) since their deviations from the marginal profile are similar. As compared to C , they both show a % of Education on the average, a % of Industry lower than the average, a % of Non Commercial and of Research higher than the average. Slide 30: 30 Correspondence map of row profiles (only for teaching convenience)‏ Simple correspondence analysis Of course, we are interested in interpreting the deviations from the origin. Along the 1° dimension, which is the most important, we observe the opposition between Belgium and Portugal (and, less strong, Denmark, Netherlands, Finland UK and Greece) to France and Italy (and, less strong, Spain and Germany). Along the 2° dimension, we observe the opposition between Denmark and Netherlands (and, less strong, France and Finland) to Portugal and Greece (and, less strong, Italy, Spain and UK). From the analysis of the row profiles in the previous slide, we understand that the distance from the centroid along to one dimension is related to the attraction/avoiding for particular column categories. To analyze this aspect more into details we have to consider the map of the column profiles Slide 31: 31 Correspondence map of column profiles (only for teaching convenience)‏ Simple correspondence analysis The origin here represents the marginal column profile, R. As concerns the column profiles, we observe that the 1° dimension opposes Industry (on the left) to all the other types of organisations, in particular to Education. Along the 2° dimension, Non commercial (and, less strongly, Research) is opposed to Education. Also in this case, we realize that the deviations of the points from the origin along a particular dimension is related to the attraction/repulsion for some countries. Slide 32: 32 Correspondence map of row and column profiles – combination Simple correspondence analysis In order to analyze and to interpret the distance of the row profiles from their centroid and of the column profiles from their centroid, i.e., in order to understand which are the attractions/avoidances responsible of the deviation from the hypothesis of independence, the two correspondence maps of row and column profiles are combined. This is possible thanks to the correspondence between the factors extracted for the two sets of vectors. It is important to remember that row and column profiles lie in different original spaces. Consequently, while it is possible to interpret the closeness between row (resp. column) profiles as similarity of the conditional distribution, it is not possible to interpret in this way the closeness between row and column profiles. In this second case, closeness reflects attraction and distance reflects avoidances. Before proceeding, please notice that we did not consider the reliability of the maps, that is we did not consider possible projection errors. Hence, the next map is only proposed in order to understand the main output of correspondence analysis, but is usually analyzed after a detailed inspection of the quality of the representation of rows and column profiles on the map itself. Slide 33: 33 Correspondence map Simple correspondence analysis The opposition between industry and the other types (in particular Education) along the 1° dim corresponds to the opposition between France, Italy Germany and Spain to the other states, in particular to Belgium. This means that the profiles more relevant to the association (the 1° dim is the most important) mainly differ with respect to the % of Industry. To the opposition between Non commercial / Research and Education corresponds an opposition between Denmark/Netherlands and Portugal/Greece. Notice that the 2° dimension in this case provides a more detailed description of the types of organisation other than Industry. We may expect that Denmark/Netherlands are attracted from Non commercial and are avoiding Education, and the reverse holds for Greece and Portugal. Slide 34: 34 Analysis of the quality of the factorial solution Simple correspondence analysis As it was mentioned before, in order to interpret and to properly analyze the dimensions, the main oppositions, and the ‘structure’ of the association in terms of attraction/avoidance, it is important to evaluate the quality of the obtained solution. In particular, the position of one point (row or column) on the map may be due to a projection error. Also, it may happen that a particular profile is not well described by the most important map/maps, being attracted/avoiding ‘uncommon’ categories. In the following we thus introduce some criteria useful to evaluate the quality of the solution with respect to each row/column profile, so that a more reliable analysis of the map is possible. Slide 35: 35 Simple correspondence analysis Quality of the factorial solution – Row profiles Quality: is the quality of the representation of the profile on the selected factorial space (2-dimensional in our application). It is a global measure of quality and may be interpreted as the squared cosine between the row profile in the original space and its projection on the factorial space. Low quality  the selected dimensions do not provide a good representation of the profile. Thus, the main oppositions described on the map are not relevant for the profile. Here we observe a very low quality for Finland and a medium quality for Greece and Spain Mass and Inertia: are general measures and they do not depend upon the nr of selected dims. Mass: marginal row proportions (see slide 4, where the percentages are reported)‏ Inertia: proportional contribution to inertia of each row (see slide 13 where the percentages are reported). Notice that here points with high inertia are all well represented, and the same is true for points with high mass. Slide 36: 36 Simple correspondence analysis Quality of the factorial solution – Row profiles Remember that each dim accounts for a certain % of total inertia. In this application, the 1° and the 2° dim explain respectively the 66.28% and the 22% of total inertia. The inertia associated/ explained by each dim can be decomposed and attributed to the row profiles. Thus, each row profile is ‘responsible’ of a certain % of the inertia explained by the dimension (partial contribution to Inertia). Points with high contribution are those defining/ calling for the dim itself. Points with low contribution are not necessarily badly represented on the dimensions (see the next slide). SAS also produces a Best table with the indicators of which points best explain inertia of each dim. We will not consider it here. The profiles more relevant and thus dominating the 1° dim are Belgium, France, Italy and UK (remember that these row points were the most extreme on the 1° dimension). The profiles more relevant and thus dominating the 2° dim are Denmark, Greece, Netherlands. Slide 37: 37 Simple correspondence analysis Quality of the factorial solution – Row profiles The squared cosine with a given dimension is the squared cosine of the angle between the row profile in the original space and its projection onto a specific dimension. The sum of the squared cosines equals the Quality. A squared cosine  1  the profile is well represented on the considered dimension. The position of the profile on the dimension is reliable and can thus be interpreted and commented. The dimension “describes well” the main features of the profile. Usually, points with a high partial contribute to inertia are also characterized by high squared cosines. The reverse is not true. 1° dim: Belgium, France, Italy and UK (also dominating the dimension) Germany and Portugal 2° dim: Denmark, Greece, Netherlands (also dominating the dimension), Portugal and Spain. (please consider that the value of the cosines should be evaluated in relative terms – remember that the 2° dim is less relevant than the 1° and may be consequently characterized by a lower explanatory power) Slide 38: 38 Simple correspondence analysis Quality of the factorial solution – Row profiles We will consider Critical a profile with quality < 0.2 Dim1: 0.4 < sqcos1 < 0.7  M1 sqcos1 >= 0.7  H1 Dim2: 0.2 < sqcos2 < 0.4  M2 sqcos2 >= 0.4  H2 The dominant profiles will be flagged by hand on the plot. Slide 39: 39 Simple correspondence analysis Quality of the factorial solution – Column profiles Which considerations? Education, Industry and Non commercial are the categories best represented in the factorial space. They are also characterized by high mass and high inertia. Education and Industry dominate the 1° dim and are (of course) well represented on it. Education has a medium influence also on the 2° dimension. Non commercial dominates instead the 2° dim. Consultancy has low mass/inertia and the quality of its representation is medium as for Research (characterized by medium mass and inertia). The explanation of Consultancy is related only to the 1° dim, whilst the explanation of Research is mostly related to the 2° dim. Other is not well represented (maybe it is related to Finland, which is not represented adequately on this map?). Slide 40: 40 Simple correspondence analysis Map coloured according to the quality of the representation Legend | CR H1 H2 M1M2 M2 M1 ------------+---------------------------------------------------------------- Colors | blue red green cyan magenta orange ------------------------------------------------------------------------------ The position of Other and Finland should not be commented here. The two profiles are projected in the best possible position, once the dimensions have been defined on the basis of the most relevant profiles. Nevertheless, their closeness/distance from the other profiles only describe marginal characteristics of these profiles. Slide 41: 41 Simple correspondence analysis Analysis of the unexplained profiles – Increase the nr of dimensions We extract now 4 dimensions (98% of the total inertia) in order to understand the attractions/ avoidances characterizing the unexplained profiles. This analysis is seldom considered. Here we only show how it is possible to describe into details association. As in PCA, the first map remains unchanged (the extraction of more factors has no influence on the first ones). Legend | NE MH3MH4 MH3 Colors | blue red green With respect to the 3° and 4° dim, we flagged as MH3 and MH4 the profiles with a squared cosine > 0.1. The profiles well explained on one dim, here are all dominating it. Notice that Spain is opposed to Research (close on the 1° dim), Other is opposed to Finland (close on the 1° dim)‏ Slide 42: 42 Simple correspondence analysis – Rare categories/outliers We consider again Resp_country and Resp_org but we do not limit attention only to the most active countries. From the contingency table we notice some rare categories Slide 43: 43 Simple correspondence analysis – Rare categories/outliers Consider now the row profiles Notice that the rare row categories (countries) are characterized by strongly peculiar profiles. This is due to the low nr of obs for these categories. The few cases will of course be attracted by few columns and will show a strong repulsion for the others. Thus, these rare categories, which are not relevant, will result the most relevant ones due to rarity. Slide 44: 44 Simple correspondence analysis – Rare categories/outliers Correspondence analysis. Here 2 dims explain the 80% of the total inertia. Nevertheless, for reasons which will become clear in a minute, we consider 4 dims, explaining together the 97% of total inertia. Consider the quality for row profiles. Slide 45: 45 Simple correspondence analysis – Rare categories/outliers Observe that the less frequent categories are those influencing more the map. Of course we could remove these profiles from the maps, but also in this case the structure of association is here clearly and strongly influenced by the rare categories. These influencing categories are those characterized by a low mass and a high quality (as for example Switzerland in this application). As in the case of outliers, we may be interested in obtaining the maps without taking into account these row/column profiles. In correspondence analysis profiles excluded from the analysis and only projected onto the factorial maps are called supplementary profiles. In some applications all the profiles with low mass are projected as supplementary profiles. Influence of rare categories on the maps (4 dimensions)‏ Slide 46: 46 Simple correspondence analysis Example 2. We consider the association between the nationality of the organisation responsible of the projects (resp_country) and the topic of the project (topic). Statistics for Table of resp_country by topic The Chi-square is significant. The two variables are associated and correspondence analysis may be applied, even if the strength of the association is not so strong (look at the relative indices). Slide 47: 47 Simple correspondence analysis Inertia and Chi-Square Decomposition The first 3 dimensions are the most relevant (look at the ‘scree plot’) and together explain the 76% of total inertia. Nevertheless (guided by ex-post considerations) we extract 4 dimensions so as to have a clear idea of the most relevant tendencies (and without loosing too much information)‏ Slide 48: 48 Simple correspondence analysis Analysis of row profiles – quality Observe that the most frequent profiles are explained quite well. Nevertheless, some rare profiles are characterized by high quality and dominate dimensions Slide 49: 49 Simple correspondence analysis Analysis of column profiles – quality Observe here that the 1° dim is weakly dominated and that it explains only 2 profiles. Also notice that Bio_tech, which is the least frequent profile is characterized by a very high quality and, also, dominates the 2° dim. In order to visualise the impact of the rare (row/column) profiles on the results of correspondence analysis, we consider the correspondence maps. Slide 50: 50 Simple correspondence analysis Correspondence maps Also in this case, as in the previous application, it is preferable to project the low mass profiles as supplementary points, since they are influencing too much the structure of association captured by the dimensions. Actually, we could project as supplementary points only the 3 rare countries, having a strong impact on the dimensions. Nevertheless, initially we will project also Bio_tech as a supplementary point. Slide 51: 51 Simple correspondence analysis Analysis with supplementary profiles Inertia and Chi-Square Decomposition Notice that the Chi-square is changed here but we cannot compare it with the previous one, since the contingency table correspondence analysis is based upon is changed. Also notice that the number of dimension is lower now, since we projected one column profile (bio_tech) and 3 row profiles (Iceland, Luxembourg, Switzerland) as supplementary points. 2 dimensions seem appropriate in this case. Slide 52: 52 Simple correspondence analysis – Row profiles This is just to let you notice that bio_tech, the supplementary column profile is no longer considered here Slide 53: 53 Simple correspondence analysis – Column profiles Again, notice that the supplementary row profiles are not taken into account in the definition of the column profiles. We observed the existence of a strong attraction between Switzerland and Bio_tech. By projecting these points as supplementary, we do not consider the attraction/repulsion between the other countries and Bio_tech and/or between the other topics and the excluded countries Slide 54: 54 Simple correspondence analysis Analysis with supplementary profiles – Row profiles / quality Slide 55: 55 Simple correspondence analysis Analysis with supplementary profiles – Column profiles / quality Slide 56: 56 Simple correspondence analysis Legend | CR H1 M2 H2 M1M2 H1M2 M1 SUP_COL ------------+------------------------------------------------------------------------------------- Colors | blue red green cyan magenta orange gold lilac --------------------------------------------------------------------------------------------------- Since Switzerland and Bio_tech are projected as supplementary points, the attraction between these two profiles is not accounted for. in this map. We also considered the solution obtained when only countries are projected as supplementary points. Slide 57: 57 Simple correspondence analysis Switzerland, Luxembourg Iceland If we project Bio_tech as an active profile, its relationship with Switerland is correctly described.

Add a comment

Related presentations

Related pages

Correspondence Analysis - UNESCO | Building peace in the ...

6.5 Correspondence Analysis. ... are computed by the correspondence analysis. A simple way to think of such ... Correspondence matrix is ...
Read more

Correspondence Analysis - Quick-R: Home Page

... Adv Stats |Graphs |Adv Graphs ... The first graph is the standard symmetric representation of a simple correspondence analysis with rows and column ...
Read more

Multivariate Analysis: Correspondence Analysis: Example

The goal of this example is to use correspondence analysis to examine relationships between and among ... Correspondence analysis plots all the categories ...
Read more

Correspondence Analysis: Simple ( CA) and Detrended (DCA)

Correspondence Analysis: Simple ( CA) and Detrended (DCA) Vamsi Sundus Shawnalee ... Correspondence Analysis That was CA utilized in a simplistic example.
Read more

Simple and multiple correspondence analysis in Stata

Simple and multiple correspondence analysis in Stata. Contents: Author info; Abstract; Bibliographic info; Download info; Related research; References ...
Read more

CRAN Task View: Multivariate Statistics

... from stats. sca provides simple ... analysis analagous to both PCA and correspondence analysis. ... related to multivariate analysis
Read more

interpretation - Interpreting 2D correspondence analysis ...

Interpreting 2D correspondence analysis plots. up vote 12 down vote favorite. 8. ... Cross Validated (stats) Theoretical Computer Science; Physics ...
Read more

Statistical software and data analysis in Excel - XLSTAT ...

The XLSTAT statistical analysis software is compatible with all Excel versions from version 97 to version 2016 (except 2016 for Mac), ...
Read more