Information about High Dimensional Biological Data Analysis and Visualization

Examples of data analysis and visualization of high dimensional metabolomic data.

State of the art facility producing massive amounts of biological data… >13,000 samples/yr >160 studies ~32,000 data points/study

Goals?

Analysis at the Metabolomic Scale

Univariate vs. Multivariate Multivariate Predictive Modeling Group 2 Group 1 Univariate Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

Univariate vs. Multivariate univariate/bivariate vs. multivariate outliers? mixed up samples?

Data Complexity Meta Data m n variables Experimental Design = complexity samples Data m-D 1-D 2-D Variable # = dimensionality

Statistical Analysis •Identify differences in sample population means •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) wide •optimal for long data •assumed independence •false discovery rate (FDR) long n-of-one

Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n)

False Discovery Rate (FDR) Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested FDR correction • p-value adjustment or estimate of FDR (Fdr, q-value) Bioinformatics (2008) 24 (12):1461-1462

FDR correction FDR adjusted p-value Benjamini & Hochberg (1995) (“BH”) •Accepted standard Bonferroni •Very conservative •adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 ) p-value

Multivariate Analysis Clustering • Grouping based on similarity/dissimilarity Principal Components Analysis (PCA) • Identify modes of variance in the data Partial Least Squares (PLS) •Identify modes of variance in the data correlated with a hypothesis

Cluster Analysis Use similarity/dissimilarity to group a collection of samples or variables Linkage Approaches •hierarchical (HCA) •non-hierarchical (k-NN, k-means) •distribution (mixtures models) •density (DBSCAN) Distribution •self organizing maps (SOM) k-means Density

Hierarchical Cluster Analysis similarity/dissimilarity defines “nearness” or distance objects are grouped based on linkage methods

Hierarchy of Similarity How does my metadata match my data structure? Hierarchy of effect sizes x x x Similarity x

Projection of Data Raw data PCA dimensions http://www.scholarpedia.org/article/Eigenfaces The algorithm defines the position of the light source Principal Components Analysis (PCA) • unsupervised • maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) • supervised • maximize covariance (Y ~ X) PC1 PC2 James X. Li, 2009, VisuMap Tech.

Interpreting PCA Results Variance explained (eigenvalues) Row (sample) scores and column (variable) loadings

How are scores and loadings related?

Centering and Scaling PMID: 16762068

Use PLS to test a hypothesis Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis) PLS PCA time = 0 120 min.

PLS model validation is critical Determine in-sample (Q2) and outof-sample error (RMSEP) and compare to a random model •permutation tests •training/testing

Biochemical domain information Databases for organism specific biochemical information: Multiple organisms •KEGG •BioCyc •Reactome Human •HMDB •SMPDB

Pathway Enrichment Analysis enrichment topological importance http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

Network Mapping Biochemical Structural Similarity doi:10.1186/1471-2105-13-99

Data visualization as form of analysis Dextromethorphan = additives in DM Liver CYP2D6 •high fructose corn syrup dextrorphan • antioxidants •flavor

Identification of relationships between altered metabolites urea cycle protein glycosylation nucleotide synthesis

Identification of treatment effects

Analysis of differential metabolic responses Treatment 1 Treatment 2

Resources •DeviumWeb- Dynamic multivariate data analysis and visualization platform url: https://github.com/dgrapov/DeviumWeb •imDEV- Microsoft Excel add-in for multivariate analysis url: http://sourceforge.net/projects/imdev/ •MetaMapR: Network analysis tools for metabolomics url: https://github.com/dgrapov/MetaMapR •TeachingDemos- Tutorials and demonstrations •url: http://sourceforge.net/projects/teachingdemos/?source=directory •url: https://github.com/dgrapov/TeachingDemos •CDS Blog- Data analysis case studies url: http://imdevsoftware.wordpress.com/

dgrapov@ucdavis.edu metabolomics.ucdavis.edu This research was supported in part by NIH 1 U24 DK097154

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by ...

Read more

High dimensional biological data shares many qualities with ... High Dimensional Biological Data Analysis and ... Big Data, R jobs, visualization ...

Read more

Posts about data visualization ... longitudinal metabolomic analysis carried out over ... long term biological studies are plagued with ...

Read more

... High-Dimensional Microarray Data ... The explosive growth in biological data ... microarray analysis and visualization software.

Read more

MULTIVARIATE HIGH DIMENSIONAL VISUALIZATION AND ANALYSIS ... simulation and visualization of complex biological systems data. The functionality and

Read more

Document Analysis; High-Dimensional Biological Data; ... Visualization. ... • Document Analysis • High-Dimensional Biological Data • Music ...

Read more

Needs Assessment for Scientific Visualization of Multivariate, High-Dimensional Microarray Data ... the analysis and visualization of these

Read more

Visual Representation in Data Analysis and Management. DATA VISUALIZATION ... high-dimensional data ... data prior to biological analysis.

Read more

This also makes exploratory data analysis and visualization essential steps ... These data are high dimensional by ... measurements for each biological ...

Read more

## Add a comment