# 8323 Stats - Simple Multiple Correspondence

63 %
38 %
Information about 8323 Stats - Simple Multiple Correspondence
Education

Published on May 11, 2008

Author: untellectualism

Source: authorstream.com

Slide 1: 1 Multiple Correspondence Analysis Motivation: why Multiple Correspondence Analysis Indicator and Burt matrices Introduction to Multiple Correspondence Analysis Slide 2: 2 Example. 4° Framework Programme for Research and Technological Development. Data covering all the research and technological development (RTD) activities funded by the European Commission during the period 1994-1998. The data we will consider We are now interested to the association between more than two categorical variables. For example, we want to study the main attractions/avoidance between the levels of the characteristics of the funded projects, Topic, duration and size of the team. At this aim, we have to extend simple correspondence analysis to more than two variables. Slide 3: 3 Multiple Correspondence Analysis Multiple correspondence analysis (MCA) is an extension of correspondence analysis (SCA) which allows one to analyze the pattern of relationships of several categorical dependent variables. As such, it can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative. Technically MCA is obtained by applying a standard correspondence analysis to an indicator matrix (i.e., a matrix whose entries are 0 or 1) whose rows are cases and whose columns are all the categories of the involved variables. This analysis is equivalent to a correspondence analysis of a particular matrix, called Burt matrix which is obtained as the internal product of the indicator matrix. Slide 4: 4 Multiple Correspondence Analysis – Indicator matrix Each nominal variable comprises several levels. In the indicator matrix, each of these levels is coded as a binary variable. For example, in our application dur_class (duration of the project) is a categorical variable with 5 levels, dur1-dur5. 5 binary variables will then be defined, one for each level. The pattern for a project (row) with duration dur3 will be 0 0 1 0 0 (with respect to this variable). The complete data table is composed of binary columns with one and only one column taking the value “1” per nominal variable. The coding schema of MCA implies that each row has the same total, which for SCA implies that each row has the same mass. Instead the column masses will coincide with the frequencies characterizing each level. *** (Please notice that for the sake of convenience, we limit attention here only to 8 out of 9 levels of the variable topic) Slide 5: 5 Multiple Correspondence Analysis – Indicator matrix Indicator matrix The indicator matrix is regarded as a contingency table, with cases (here projects) on the rows and all the possible categories on the column. To this table a SCA is applied. The interpretation in MCA is often based upon proximities between points in a low-dimensional map. Remember that proximities are meaningful only between points from the same set (i.e., rows with rows, columns with columns). Specifically, when two row points are close to each other they tend to select the same levels of the variables. Usually, row profiles are not presented on the map. The focus is usually on column profiles (distributions of rows conditioned to categories). For the proximity between column profiles we need to distinguish two cases. The proximity between levels of different variables, say dur2 and team3, means that these levels tend to appear together in the observations (cases characterized by dur2 are also characterized by team3); the reverse is true if two column profiles are far one from each other. The levels of the same variable cannot occur together, so a different type of interpretation is needed. Here the proximity between levels means that the groups of observations associated with these two levels are themselves similar (with respect to the other variables). Hence we are analyzing attraction/avoidance between all the possible levels of the considered categorical variables. Slide 6: 6 Multiple Correspondence Analysis – Burt Table The so called Burt table is obtained as the internal product of the indicator matrix. On the main diagonal we have diagonal matrices whose elements are the frequencies of the categories. Off-diagonal blocks: all the possible two-way contingency tables between the considered variables. It is similar to a correlation matrix- with correlations substituted by contingency tables. The Burt table is symmetric Slide 7: 7 Multiple Correspondence Analysis – Burt Table The Burt table associated to an indicator matrix is important in MCA because using SCA on the Burt table gives the same factors as the analysis of the indicator matrix but is often computationally easier. The nr of dimensions which can be extracted from Burt table equals q – p where q is the total number of categories (the nr of columns of the Burt table and, also, of the indicator matrix) and p is the number of variables taken into account. In this case the total inertia (=sum of the principal inertias, eigenvalues) equals (q – p )/p. It can be shown that the sum of the squared eigenvalues coincides with the average of the total inertias calculated for all the p2 two-way tables in Burt table. MCA thus analyzes the deviation from the hypothesis of independence for all the possible contingency tables in the Burt matrix. Multi-way association is therefore based upon the two-way associations (think to correlation matrix in PCA). The most relevant dimensions are then those describing at best all the possible associations. Notice that the tables on the main diagonal of the Burt table are not properly contingency tables (they describe the marginal distribution of the variables). So the total inertia here is related to the chi-squares describing two-way association between all the possible pairs of variables but it also includes some not relevant information (inertias for tables on the main diagonal) Slide 8: 8 The Burt matrix also plays an important theoretical role because the eigenvalues obtained from its analysis give a better approximation of the inertia explained by the factors than the eigenvalues of the indicator matrix. MCA codes data by creating several binary columns for each variable with the constraint that one and only one of the columns gets the value 1. This coding schema creates artificial additional dimensions because one categorical variable is coded with several columns. This is the same problem already mentioned for the Burt table, containing also the tables on the main diagonal which are not properly contingency tables. As a consequence, the inertia of the solution space is artificially inflated and therefore the percentage of inertia explained by the first dimensions is severely underestimated. In fact, it can be shown that all the factors with an eigenvalue less or equal to 1/ p simply code these additional dimensions (p is the number of variables). This is the reason why Benzécri introduced a correction formula for the principal inertias obtained by referring to the Burt table. Benzécri suggested to consider not significant/relevant the eigenvalues (principal inertias) lower than 1/p. With the Benzécri correction formula the significant principal inertias are adjusted (we do not enter into details here) and the contribution of each dimension is calculated by referring only to the modified principal inertias. Multiple Correspondence Analysis – Inertias and correction Slide 9: 9 Multiple Correspondence Analysis Theory and Practice Choose the number of MCA factors Evaluate the MCA factors with respect to column profiles Obtain and analyze Correspondence maps. Comment the position of column profiles Supplementary profiles and variables in MCA Slide 10: 10 Multiple correspondence analysis Example. We want to study the association between the topic of a project (topic) its duration (dur_class) and the size of the team (team_class) Topic 9 categories Dur_class 5 categories p = 3, q = 19 Team_class 5 categories Exploratory analysis: evaluate two-way associations The variables are related, so we proceed with multiple correspondence analysis Slide 11: 11 Multiple correspondence analysis (Burt table) Nr of dimensions : q – p = 19 – 3 = 16 Total inertia: (q – p )/p = 16/3 = 5.3333 Notice the slow decay of principal inertias. Also notice the low proportion of inertia explained by the first dimensions, which are the most important. This tendency, which usually would be interpreted as a lack of structure, here is instead related to the redundant information included in total inertia. We should apply a correction in order to have a clearer indication of relative importance Benzecri correction: 1/p = 1/3 = 0.33333 Only principal inertias higher than 0.33333 should be considered (the 8-th inertia is not considered due to rounding). Inertia and Chi-Square Decomposition Slide 12: 12 Multiple correspondence analysis (Burt table) Benzecri Adjusted Inertia Decomposition This correction adjust for the redundant information, and in this case it suggests the opportunity to consider only 2 dimensions. Nevertheless, it usually gives an optimistic estimation of the percentage of inertia. This is the reason why it is worth to take into account a large number of dimensions, explaining a quite high proportion of total adjusted inertia. For example, here we consider 4 dimensions. Slide 13: 13 Multiple correspondence analysis. Quality of the solution Summary Statistics for the Column Points Partial Contributions to Inertia for the Column Points Befor proceeding, let us look at the maps. Here we observe some low mass profiles with quite high quality, and dominating dimensions Slide 14: 14 Multiple correspondence analysis. Correspondence Maps Envir_protect is a low mass profile characterized by a relatively high influence on the dimensions. We decide to project this profile as a supplementary point. Slide 15: 15 Multiple correspondence analysis. Supplementary points In SAS supplementary points may be projected only in the case when MCA is applied to the indicator matrix. This is not a problem since, as we said, the results obtained by referring to the indicator matrix coincide (as concerns factors) with those obtained by referring to the Burt table. Nevertheless, in the first case the correction of Benzecrì can not be applied. Hence, we first of all apply MCA to the Burt table simply removing the supplementary point, in order to evaluate if the nr of dimensions to be considered changes. We consider again 4 dimensions. Slide 16: 16 Multiple correspondence analysis. Supplementary points Summary Statistics for Column Points Partial Contributions to Inertia for Column Points Squared Cosines for Column Points Slide 17: 17 Multiple correspondence analysis. Supplementary points Legend | CR H1 H1H2 M1H2 H1M2 H2 NE M1 SUP_COL ------------+--------------------------------------------------------------------------------------------- Colors | blue red green cyan magenta orange gold lilac olive 1° dim (H1: red, green magenta; M1: cyan, lilac): opposes projects with low duration and size (team1) to projects with high duration and size (team 5 and team 4). Team 2 and 3 are weakly related to the 1° dim. Correspondingly opposes projects related to Energy, Electronics, Safety to projects related to Natural resources and Envir_protection, even if we know that this point has a low quality of representation. 2° dim (H2: orange, green, M2: magenta): provides a detail wrt small team sizes. On the upper side of the plot we find projects developed by small teams, having possibly high duration and concerning Materials technology (very weakly attracts Energy – see the squared cosines). On the bottom side, we find projects with low-medium size team (team2) concerning Electronics(very weakly attracts Safety, Natural resources). Slide 18: 18 Multiple correspondence analysis. Supplementary points Legend | CR H3 NE M4 H4 H3M4 M3 H3H4 SUP_COL ------------+---------------------------------------------------------------------------------------------Colors | blue red green cyan magenta orange gold lilac olive 3° dim (H3: red, orange, lilac; M3: gold): opposes dur1 to dur3, dur5, team2. Standards is opposed to dur1 along this dim. Also, Energy and, more weakly Safety and Nat_res are opposed to dur1. In the previous map, Safety was close to dur1 and dur3. This map emphasize that the closeness to dur1 is not to be taken into account. Instead, Nat_res was already opposed to dur1 in the previous map. This map emphasize thus the relation with team2. Telecom and Electronics are instead associated to dur1 along the 3° dim. 4° dim (H4: magenta, lilac, M4: cyan, orange): Opposes team5 to team3/dur5, and Telecom to Mat_tech and Electronics. Mat_tech was already close to dur5 in the previous map, but there it was associated to team1. Here the relation with team3/dur5 is described. As for Electronics, along the 4° dim its attraction for dur5 / team3 is described. Slide 19: 19 Multiple correspondence analysis. Supplementary points Squared Cosines for Column Points Consider for example Mat_tech. The most relevant dim for this profile is the 2°. Mat_tech is projected close to team1, dur2, dur5 and opposed to dur3, dur4, team2. The second relevant dimension for the profile is the 4°. Along this dim, Mat_tech is projected close to team3/dim5 and opposed to team5. Then, projects with this topic are attracted by team1 with dur2 or dur5 but also by team3 with dur5. It is interesting to observe that dimensions 1 and 2 are mainly dominated by duration and team size, while the last dimensions are explaining better the topics. Thus, in the first map only the topics strongly attracted by the particular combinations of duration/size inducing the factors are represented in a quite satisfactory way. Slide 20: 20 Supplementary column points and "multiple regression" for categorical variables. In SCA and in the previous application of MCA, we evaluated the opportunity to plot some categories/profiles as supplementary points. Usually this is done for low masses profiles. Rare profiles may actually influence too much the obtained results. Also, we may be interested in obtaining the map by taking into account the most relevant (high masses) profiles and to project subsequently low masses profiles as supplementary points in order to describe at best the most relevant characteristics. A different approach consists instead in projecting a variable as a supplementary one. In some applications, this is done so as to improve the interpretation of the map, and to add some information without using it explicitly to obtain the map. Nevertheless, this practice allows you to perform the equivalent of a Multiple Regression for categorical variables. In this case, (1) the summary statistics for the quality of representation for the columns of the supplementary variables would give you an indication of how well you can "explain" the supplementary variable as functions of the variables active in MCA; (2) the display of the column points in the final coordinate system would provide an indication of the nature (e.g., direction) of the relationships between the categories of the active variables and those of the supplementary variables. This technique (adding supplementary variables to an MCA analysis) is also sometimes called predictive mapping. Multiple correspondence analysis and supplementary vars Slide 21: 21 In our previous application, we studied the relationship between the variables describing the ‘structural’ characteristics of the projects, in terms of their topic, duration and team size. Now, we are interested in evaluating if and to which extent these structural characteristics are related to the nationality (resp_country ) and/or to the type (resp_org ) of the organisation responsible of the project. We already evidenced before the opportunity to plot one topic (envir_protect) as a supplementary point. The new map will not change at all, since the categories of the supplementary variables are simply projected onto the map without influencing it. Do you expect a relationship between the structure of the project and the resp_country and resp_org?? Multiple correspondence analysis and supplementary vars Slide 22: 22 Multiple correspondence analysis and supplementary vars As expected, we have a low explanatory power. Nevertheless, we may observe some attractions. Slide 23: 23 Multiple correspondence analysis and supplementary vars Would the projection of the categories of resp_country resp_org change if we project them together as supplementary variables? No, since their categories do not influence the map. Also, their association is not the described by the map, We do not report here the squared cosines for country profiles, but they are quite low.

 User name: Comment:

## Related presentations

#### Robotic Process Automation (RPA) versus Traditiona...

December 15, 2018

#### Top CBSE Primary School In Nelamangala

December 15, 2018

#### How SEO can be beneficial to others

December 15, 2018

#### ΣΤΗ ΧΩΡΑ ΤΟΥ ΞΕ

December 15, 2018

#### boarding school in India |MIT Vishwashanti Gurukul...

December 15, 2018

#### ACC 548 Inspiring Innovation -- acc548.com

December 15, 2018

## Related pages

### TRPV1 Ablation Aggravates Inflammatory Responses and Organ ...

Address correspondence to Donna H ... A simple multiple system organ failure scoring system predicts mortality of ... Article Usage Stats. Article Usage ...

### Leukemia Inhibitory Factor Induces DNA Synthesis in Swiss ...

↵ 3 To whom correspondence should be addressed: ... (STATs), thereby ... MAPK activation is more complex than a simple linear pathway.

### Upper Limits for Exceedance Probabilities Under the One ...

↵ * Author to whom correspondence should be ... The methods are conceptually as well as computationally simple. ... we computed c as 4.8323. Then solving ...