Published on February 22, 2014
Metabolomic Data Analysis for the Study of Diseases Dmitry Grapov, PhD
State of the art facility producing massive amounts of biological data… >13,000 samples/yr >160 studies ~32,000 data points/study
Analysis at the Metabolomic Scale
Univariate vs. Multivariate Multivariate Predictive Modeling Group 2 Group 1 Univariate Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
Univariate vs. Multivariate univariate/bivariate vs. multivariate outliers? mixed up samples?
Data Complexity Meta Data m n variables Experimental Design = complexity samples Data m-D 1-D 2-D Variable # = dimensionality
Statistical Analysis •Identify differences in sample population means •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) wide •optimal for long data •assumed independence •false discovery rate (FDR) long n-of-one
Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n)
False Discovery Rate (FDR) Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested FDR correction • p-value adjustment or estimate of FDR (Fdr, q-value) Bioinformatics (2008) 24 (12):1461-1462
FDR correction FDR adjusted p-value Benjamini & Hochberg (1995) (“BH”) •Accepted standard Bonferroni •Very conservative •adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 ) p-value
Multivariate Analysis Clustering • Grouping based on similarity/dissimilarity Principal Components Analysis (PCA) • Identify modes of variance in the data Partial Least Squares (PLS) •Identify modes of variance in the data correlated with a hypothesis
Cluster Analysis Use similarity/dissimilarity to group a collection of samples or variables Linkage Approaches •hierarchical (HCA) •non-hierarchical (k-NN, k-means) •distribution (mixtures models) •density (DBSCAN) Distribution •self organizing maps (SOM) k-means Density
Hierarchical Cluster Analysis similarity/dissimilarity defines “nearness” or distance objects are grouped based on linkage methods
Hierarchy of Similarity How does my metadata match my data structure? Hierarchy of effect sizes x x x Similarity x
Projection of Data Raw data PCA dimensions http://www.scholarpedia.org/article/Eigenfaces The algorithm defines the position of the light source Principal Components Analysis (PCA) • unsupervised • maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) • supervised • maximize covariance (Y ~ X) PC1 PC2 James X. Li, 2009, VisuMap Tech.
Interpreting PCA Results Variance explained (eigenvalues) Row (sample) scores and column (variable) loadings
How are scores and loadings related?
Centering and Scaling PMID: 16762068
Use PLS to test a hypothesis Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis) PLS PCA time = 0 120 min.
PLS model validation is critical Determine in-sample (Q2) and outof-sample error (RMSEP) and compare to a random model •permutation tests •training/testing
Biochemical domain information Databases for organism specific biochemical information: Multiple organisms •KEGG •BioCyc •Reactome Human •HMDB •SMPDB
Pathway Enrichment Analysis enrichment topological importance http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp
Network Mapping Biochemical Structural Similarity doi:10.1186/1471-2105-13-99
Data visualization as form of analysis Dextromethorphan = additives in DM Liver CYP2D6 •high fructose corn syrup dextrorphan • antioxidants •flavor
Identification of relationships between altered metabolites urea cycle protein glycosylation nucleotide synthesis
Identification of treatment effects
Analysis of differential metabolic responses Treatment 1 Treatment 2
Resources •DeviumWeb- Dynamic multivariate data analysis and visualization platform url: https://github.com/dgrapov/DeviumWeb •imDEV- Microsoft Excel add-in for multivariate analysis url: http://sourceforge.net/projects/imdev/ •MetaMapR: Network analysis tools for metabolomics url: https://github.com/dgrapov/MetaMapR •TeachingDemos- Tutorials and demonstrations •url: http://sourceforge.net/projects/teachingdemos/?source=directory •url: https://github.com/dgrapov/TeachingDemos •CDS Blog- Data analysis case studies url: http://imdevsoftware.wordpress.com/
firstname.lastname@example.org metabolomics.ucdavis.edu This research was supported in part by NIH 1 U24 DK097154
High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by ...
High dimensional biological data shares many qualities with ... High Dimensional Biological Data Analysis and ... Big Data, R jobs, visualization ...
Posts about data visualization ... longitudinal metabolomic analysis carried out over ... long term biological studies are plagued with ...
... High-Dimensional Microarray Data ... The explosive growth in biological data ... microarray analysis and visualization software.
MULTIVARIATE HIGH DIMENSIONAL VISUALIZATION AND ANALYSIS ... simulation and visualization of complex biological systems data. The functionality and
Document Analysis; High-Dimensional Biological Data; ... Visualization. ... • Document Analysis • High-Dimensional Biological Data • Music ...
Needs Assessment for Scientific Visualization of Multivariate, High-Dimensional Microarray Data ... the analysis and visualization of these
Visual Representation in Data Analysis and Management. DATA VISUALIZATION ... high-dimensional data ... data prior to biological analysis.
This also makes exploratory data analysis and visualization essential steps ... These data are high dimensional by ... measurements for each biological ...