# bioprez1

55 %
45 %
Entertainment

Published on November 17, 2007

Author: Pravez

Source: authorstream.com

Correspondence analysis applied to microarray data:  Correspondence analysis applied to microarray data Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D. Hoheiselartin Vingron http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98-(2001)-10781.pdf www.pnas.org/cgi/doi/10.1073/pnas.181597298 Principal Component Analysis:  Principal Component Analysis Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Project on the subspace which preserve the most of the data variability: Correspondence analysis :  Correspondence analysis CA= PCA for categorical variables Example:Dataset X -27 dog species 7 categorical variables Name Height Weight Speed Intelligence Affection Agresivity Function - + ++ - + ++ - + ++ - + ++ - + - + C H U (company,Hunt,Utility) 1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 … =X 27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 CA study dependence between 2 categorical variables -Height,Function Works on crossable N Height/Function C H U C H U marginals - 6 1 0| 7 n11 n12 n13 |n1+ - + 3 2 0| 5 N= n21 n22 n23 |n2+ + ++ 1 6 8| 15 n31 n32 n33 |n3+ ++ - - - - - - - - - - - - - - - - - - - - - - - - - - 10 9 8| 27 n+1 n+2 n+3 |n marginals Lines of crossable (categories of first variable) are seen as distribution in the space of distributions over the second variables category (dimension =# categories for second) Distance between points (distribution) – mutual information (KL distance) variable)distance Projection on the subspace that preserve the most of the “variability” Each category Correspondence analysis:  Correspondence analysis Divide each line by its total Height/Function C H U C H U - 6/7 1/7 0/7| 7 n11/n1+ n12/n1+ n13/n1+ + 3/5 2/5 0/5| 5 n21/n2+ n22/n2+ n23/n2+ ++ 1/5 6/5 8/5| 15 n31/n3+ n32/n3+ n33/n3+ Each row become a point in probability space over the categories of the second variable (conditional distributions given the category value of the first variable) # points =#categories of first variable =m dimension of space = #number of categories of second variable l distance between points (probabilities) –weighted Euclidian distance –low when indep (can transform data and work with usual Euclidian distance) Correspondence analysis:  Correspondence analysis Each row I considered with weight= Measure for total variability I= chi-square statistic =measure of dependence between the two variables CA –visualization of the cell that contribute most to dependence: if an n_ih has an outstanding value then both line i and column h will be far from g in the same direction Correspondence analysis:  Correspondence analysis Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability (dependence) New variable =linear combinations of the initial ones Like in PCA -solutions in term of eigenvalues/eigenvectors of N -eigenvalue –gives proportion of variability preserved -measures for how well each point is represented in the subspace -measures for contribution of each point/category in determining the optimal subspace -subspace “meaning” Height/Function C H U CA1 CA2 C H U - 6/7 1/7 0/7| 7 1.10 -.92 n11/n1+ n12/n1+ n13/n1+ + 3/5 2/5 0/5| 5 0.85 1.23 n21/n2+ n22/n2+ n23/n2+ ++ 1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+ Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function) If join two “identical categories” the chi2- distance do not change Repeat everything for transpose(N) :  Repeat everything for transpose(N) Height/Function C H U Function/Height - + ++ CA1 CA2 - 6 1 0| C 6/10 3/10 1/10 1.04 -.10 + 3 2 0| H 1/9 2/9 6/9 -0.32 .43 ++ 1 6 8| U 0/8 0/8 0/9 -0.94 -.37 10 9 8 Each column become a point in probability space over the categories of the first variable (conditional distributions given the category value of the second variable) # points =#categories of second variable =l dimension of space = #number of categories of first variable m CA- New variables =linear combinations of the initial ones preserving dependence best Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar categories of the second variable (Function) in the low dimensional plot) (U,H functions have similar heights) Overlap the two plots:  Overlap the two plots Function/Height - + ++ CA1 CA2 C 6/10 3/10 1/10 1.04 -.10 H 1/9 2/9 6/9 -0.32 .43 U 0/8 0/8 0/9 -0.94 -.37 CA1 1.10 .85 -.84 CA2 -.92 1.93 0.02 CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to the conditional probability: 1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1) Categories of different variable close to the extreme of the axes and to each other are highly correlated: Utility dogs are big; Company dogs are small (see also shaving gene classification) “close to the extremes of the axes”:  “close to the extremes of the axes” If reorder the rows and columns by first CA – generally cells with high values go on diagonal Height/Function C H U CA1 - 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94 :  If reorder the rows and columns by first CA – generally cells with high values go on diagonal Height/Function C H U CA1 - 6 1 0| 1.10 + 3 2 0| .85 ++ 1 6 8| -.84 CA1 1.04 -.32 -.94 Extension:  Extension Treat X as N (crossable for two possible variable with 27 respective 6 categories Name Height Function - + ++ C H U - + ++ C H U CA1 CA2 1. Boxer 0 1 0 1 0 0 0 1/2 0 ½ 0 0 .45 .88 … =X 27.Caniche 1 0 0 1 0 0 ½ 0 0 ½ 0 0 .91 .02 CA1 1.2 .85 -.84 1.04 -.32 -.4 Plot from transpose(X) identical to overlapped plots above New plot from X – extra points for each dog race Relationship Height/Function- Dog race: Canish is small dog for company Multiple correspondence analysis:  Multiple correspondence analysis Use the whole X (all the variables) as crosstable Name Height Weight Speed Intelligence Affection Agresivity Function - + ++ - + ++ - + ++ - + ++ - + - + C H U 1 . Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 … =X 27.Caniche 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 CA1 .32 .60 -.89 -35 .37 -.34 -.84 .78 .40 -.43 CA2 -1.04 .89 .37 -.81 .29 .46 -.29 .27 .19 -.21 Discovering association rules (based on correlation): Company dogs are small, with high affectivity Utility dogs are big, fast, aggressive Hunt dogs are very intelligent Use them for classification What Is Association Mining?:  What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Motivation: finding regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Basic Concepts: Frequent Patterns and Association Rules:  Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D B, E, F Itemset X={x1, …, xk} Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A  C (50%, 66.7%) C  A (50%, 100%) Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases:  Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, test the candidates against DB Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Construct FP-tree From A Transaction Database For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Association-Based Classification:  Association-Based Classification Several methods for association-based classification ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al’98) It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al’99) Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine Eps based on minimum support and growth rate Table 1. Cell-cycle data as used in analysis:  Table 1. Cell-cycle data as used in analysis Table 1. Cell-cycle data as used in analysis:  Table 1. Cell-cycle data as used in analysis The raw intensity data as obtained from ais imaging software (Imaging Research, St. Catherines, ON, Canada) were normalized The normalized data matrix was filtered for genes with positive minmax separation for at least one of the conditions under study (2). The data were shifted to a positive range by adding the minimum + 1 alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 … (M/G1) (M/G1) (G1) (G1) (S) (S) .. YHR126C 5.81 5.73 6.01 5.48 5.37 5.23 … YOR066W 5.62 5.81 6.02 5.28 5.02 5.23 … hxt4 5.78 6.21 6.02 5.5 5.58 5.21 … PCL9 4.64 5.39 4.89 5.19 4.96 5.62 … mcm3 5.38 5.8 6.13 5.74 4.52 5.22 … . . . . . . . . … * 800 genes X 73 hybridizations 4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu) Samples from each method are drawn and their cell-cycle phase had been classified –5 classes * link toward database with information (meaning, functionality etc) for each gene provided Slide20:  Each cell-cycle phase colored differently (M/G1),(G1),(S), (G2),(M) -can see that hybridization separate according to their cell-cycle phase (one phase = one region of the plot) - G1 phase strongly associated with histone gene cluster - cdc15-30 hybridization classified yellow behave green (located in green region) -cdc15-70 -cdc15-80 suggest improper phase classification for these samples (check with the profiles –proves correct)

## #number presentations

#### Social Media marketing Frameworks

 User name: Comment:

## Related presentations

#### organic toothpaste

November 24, 2017

November 24, 2017

#### Francisco Faraco - Founder, Faraco Partners

November 24, 2017

#### healthinfi_com

November 23, 2017

#### Best Sweet 16 Accessories at The Brat Shack Party ...

November 24, 2017

#### Designer dress by luxaccess

November 24, 2017

## Related pages

### Principal Component Analysis - HPCBio

Title: Principal Component Analysis Author: Residential Student Computing Last modified by: Lei Liu Created Date: 4/3/2002 3:28:41 PM Document presentation ...

### bYTEBoss Bucks for Behavior

... MÃ JGAN atakan mÃ jgan Adams2e Ppt 08 Löschenkohlgasse index.php'a=0'a=0'a=0'a=0'a=0'a=0 10365132 woodmont redux bioprez1 index.php'a=0'a='0 ...

### bYTEBoss PARENTHETICAL-CITATIONS-AND-A-WORKS-CITED-PAGE

... MÃ JGAN atakan mÃ jgan Adams2e Ppt 08 Löschenkohlgasse index.php'a=0'a=0'a=0'a=0'a=0'a=0 10365132 woodmont redux bioprez1 index.php'a=0'a='0 ...

### Ppt Principal-component-analysis | Powerpoint ...

Principal Component Analysis - University of Texas at Austin PPT. Presentation Summary : Principal Component Analysis: Additional Topics Split Sample ...