Published on February 17, 2014
Biology Chemistry Informatics Evaluation of sample processing protocols for the analysis of pumpkin leaf metabolites Statistics Goals: Compare different extraction and drying protocols to identify the “optimal” sample processing approach Topics: 1. Data quality overview 2. Statistical comparisons 3. Power analysis
Data Quality Overview Biology Chemistry Informatics Goal: Calculate and visualize the summary statistics for each metabolite/treatment (Use DATA: Pumpkin data 1.csv) Calculate: 1. Mean and standard deviation (sd) 2. The percent relative standard deviation, %RSD, (sd/mean)*100 Statistics Visualize: 1. The relationship between mean vs. sd, mean and %RSD 2. Compare mean metabolite values for all treatments Exercises: 1. Describe the relationship between analyte mean and sd, mean and %RSD? 2. Describe what constitutes an “optimal” method? 3. Which extraction/treatment should be chosen to process further samples?
Summary statistics Biology Chemistry Statistics Informatics
Mean vs. SD Biology Chemistry Informatics Mean and sd are highly correlated Larger means have larger sd This effect is also called heteroscedasticity Statistics SD • • • Mean
Mean vs. %RSD Biology Chemistry Informatics Statistics %RSD • %RSD is minimally correlated with the mean Can be used as criteria for: • Comparing method reproducibility • Identifying data quality Mean
Qualities of %RSD Biology Chemistry Informatics • • • %RSD (also called the coefficient of variation or CV) is the sd (variation) scaled by the mean (magnitude). Removes the relationship between variation and magnitude Provides a single value which can be used to compare the variation of a measurement among different treatments/samples Statistics Showing the mean and sd of the %RSD for all metabolites for a given treatment
Data quality Biology Chemistry Informatics Below LOQ %RSD (sensitivity) Bad Statistics ~40% Moderate ~10,000 Mean Good
Selecting the “optimal” method Biology Chemistry Informatics Optimal can be: 1. Lowest average %RSD for all measurements 2. Lowest %RSD for measurements of interest 3. Largest number of metabolites passing %RSD cutoff 4. Lowest average %RSD for all measurements passing %RSD cutoff Using strategy #4 for metabolites %RSD ≤ 40 Statistics Count Method #2 (ACN/IPA/water 3:3:2) looks optimal… %RSD (mean sd)
Based on Method #2 Biology Chemistry Informatics Mean %RSD %RSD ≤ 40 Log Mean Statistics Analytes with high signal and high %RSD should be further interrogated for explanations of low reproducibility Log Mean
Biology Chemistry Statistical comparison of the effects of sample drying Informatics Goals: identify the effect of treatment (fresh/lyophylized) on Methods #3-4 performance? (Use DATA: Pumpkin data 2.csv) Count %RSD (mean sd) Statistics Steps: 1. Use t-Test to compare metabolite means for each treatment 2. Correct for the false discovery rate (FDR) adjusted p-value 3. Estimate FDR (q-value) Visualize: 1. Relationship between p-value and FDR adjusted p-value 2. Relationship between FDR adjusted p-value and q-value 3. Box plots for highest and lowest p-value metabolites Questions: 1. When should you use a one-sample, two-sample or paired t-test, ANOVA? *return to 0-introduction
Hypothesis Testing Strategies Biology Chemistry Statistics Informatics • One sample t-Test is used to compare single value to a population mean • Two sample t-Test is used to compare 2 independent populations • Paired t-Test is used to compare the same population (intervention, repeated measures) • One-way ANOVA (analysis of variance) is used to compare n populations for one factor • Two-way ANOVA is used to compare n populations for 2 factors • ANCOVA (analysis of covariance) is used to adjust n populations for covariate (typically continuous) prior to testing for n factors • Mixed effects models are versatile analogue to linear model or ANOVA/ANCOVA and typically used to adjust for covariates or variance due to repeated measures *All of the above are parametric tests, and some of which have non-parametric analogues
p-value vs. FDR adjusted p-value Biology Chemistry Informatics FDR adjusted p-value Benjamini & Hochberg (1995) (“BH”) • Accepted standard Statistics Bonferroni • Very conservative • adjusted p-value = pvalue*# of tests (e.g. 0.005 * 148 = 0.74 ) p-value
p-value vs. q-value Biology Chemistry Informatics Statistics FDR adjusted p-value • q-value can be used to select appropriate p-value cut off for an acceptable FDR for multiple hypotheses tested • q=0.05 nicely matches assumptions of p=0.05 for multiple hypotheses tested • q-value≤0.2 can be acceptable q-value
Biology Chemistry Change in metabolites due to treatment Informatics Statistics Effect size: small large
Effect of drying: is minimal Biology Chemistry Informatics - Log p-value FDR p-value= 0.05 Statistics 7 significantly different metabolites out of 148 (5%) - Log p-value Fold change (relative to fresh)
Power analysis Biology Chemistry Informatics Goals: Use power analysis to plan a follow up experiment to detect differences in metabolites due to treatment Steps: 1. Calculate effect size and power for three metabolites 2. Given the observed effect size calculate the number of samples needed to reach 80% power Statistics Questions: 1. How would you take FDR in to account?
Power analysis Biology Chemistry Informatics Statistics Scaled difference in means between treatments Ability to detect a difference when it exists (control false negative rate) Probability of being wrong when spotting a difference (control false positive rate)
Power analysis Biology Chemistry Informatics The minimum fold change (FC) in means observable by the study can be calculated using RSD and estimated effect size to reach 0.8 (80%) power given the population size Statistics RSD = 0.21 and effect size (EF) =1.2 We can observe a minimum of a 38% change in means at 0.8 power (p= 0.05).
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e.g., a scientific ...
Data Analysis and Statistical Software for Professionals. Stata is a complete, integrated statistics package that provides everything you need for data ...
What statistical analysis should I use? The following table shows general guidelines for choosing a statistical analysis. We emphasize that these are ...
Freelancer; Jobs; Statistical Analysis; 1 Statistical Analysis Jobs und Wettbewerbe. Outsourcen Sie Arbeitsplätze für Ihr Unternehmen oder Geschäft ...
Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a process for obtaining raw data and ...
Why statistics? Figure 1.1 shows one of the standard sets of data available in the R statistical package. In the 1920s, braking distances were recorded for ...
What is statistical analysis? This definition explains this component of data analytics in terms of business intelligence and provides links to more resources.
Statistical analysis isn’t just for sports geeks and political pollsters. Learn how statistics effect your world.
An introduction to basic statistical concepts and R programming skills necessary for analyzing data in the life sciences.
Free Statistical Software, ... Software for Basic Statistical Analysis of Experimental Data aimed primarily ... (1.0) Screen Shot: Statistical Data ...