SAMSI

50 %
50 %
Information about SAMSI
Others-Misc

Published on December 16, 2008

Author: aSGuest6790

Source: authorstream.com

Some Thoughts on Replicability in Science : July 2006 SAMSI Copyright, 1996 © Dale Carnegie & Associates, Inc. Some Thoughts on Replicability in Science Yoav Benjamini Tel Aviv University www.math.tau.ac.il/~ybenja Based on Joint work with : YB SAMSI ‘06 Based on Joint work with Ilan Golani Department of Zoology, Tel Aviv University Greg Elmer, Neri Kafkafi Behavioral Neuroscience Branch, National Institute on Drug Abuse/IRP, Baltimore, Maryland Dani Yekutieli, Anat Sakov, Ruth Heller, Rami Cohen, Department of Statistics, Tel Aviv University Dani Yekutieli, Yosi Hochberg Department of Statistics, Tel Aviv University Outline of Lecture : YB SAMSI ‘06 Outline of Lecture Prolog The replicablity problems in behavior genetics Addressing strain*lab interaction Addressing multiple endpoints The replicability problems in Medical Statistics The replicability problems in Functional Magnetic Resonance Imaging (fMRI) Epilog 1. Prolog : July 2006 SAMSI 1. Prolog J.W.Tukey’s last paper (with Jones and Lewis) was an entry on Multiple Comparisons for the International Encyclopedia of Statistics. It started with general discussion that multiple comparisons addresses ``a diversity of issues … that tend to be important, difficult, and often unresolved.” Multiple comparisons; Multiple determinations Selection of one or more candidates; Selection of variables; Selecting their transformations; etc. ( … his usual advice: there need not be a single best…) The Mixed Puzzle : July 2006 SAMSI The Mixed Puzzle Then, the Encyclopedia entry included in detail two issues The False Discovery Rate (FDR) approach in pairwise comparisons The random effects vs fixed effects ANOVA Slide 6: July 2006 SAMSI "Two alternatives, 'fixed' and 'variable', are not enough. A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ] … It pays then to learn as much as possible about values of c in the real world”… but What’s that to do with Multiple Comparisons? 2. Behavior genetics : YB SAMSI ‘06 2. Behavior genetics Study the genetics of behavioral traits: Hearing, sight, smell, alcoholism, locomotion, fear, exploratory behavior Compare behavior between inbred strains, crosses, knockouts… Number of behavioral endpoints ~200 and growing The entry Tukey wrote was about Replicability The search for replicable scientific methods : YB SAMSI ‘06 The search for replicable scientific methods Fisher’s The Design of Experiments (1935) “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results.” (pp 14) i.e. significance level interpreted directly in replications of the experiment. The discussion motivates the inclusions of results more extreme in the rejection region Replicability : YB SAMSI ‘06 Replicability “Behavior Genetics in transition” (Mann, Science ‘94) “…jumping too soon to discoveries..” (and press discoveries) raises the issue of Replicability Mann identifies statistics as a major source of troubles yet did not mention the two main themes we’ll address. The common cry Lack of standardization (e.g. Koln 2000) Slide 10: YB SAMSI ‘06 Does it work? Crabbe et al (Science 1999) experiment at 3 labs: In spite of strict standardization, they found: Strain effect, Lab effect Lab*Strain Interaction From their conclusions: “Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory.” “…differences between labs… can contribute to failures to replicate results of Genetic Experiments” Whalsten(2001) A concrete example: exploratory behavior : YB SAMSI ‘06 A concrete example: exploratory behavior NIH: Phenotyping Mouse Behavior High throughput screening of mutant mice. Comparing between 8 inbred strains of mice Dr. Ilan Golani TAU Dr. Elmer, MPRC , Dr Kafkafi, NIDA Behavior Tracking Slide 12: YB SAMSI ‘06 Slide 13: YB SAMSI ‘06 Using sophisticated data analytic tools we get for “segment acceleration” (log-transformed): Slide 14: YB SAMSI ‘06 The display supporting this claim for Distance Traveled (m) Slide 15: YB SAMSI ‘06 Source df MSE F p-value Strain 7 102.5 44.8 0.00001 Lab 2 6.35 2.77 0.065 Lab*Strain 14 6.87 3.00 0.00028 Residuals 264 2.29 The statistical analysis supporting this claim for prop. of time in center (logit) and it is a common problem: Slide 16: YB SAMSI ‘06 Kafkafi&YB et al, PNAS ‘05 Our statistical diagnosis of the replicability problem : YB SAMSI ‘06 Our statistical diagnosis of the replicability problem Part I. Using the wrong “yardstick for variability” Fixed Model analysis treating labs’ effects as fixed Part II. Multiplicity problems many endpoints; repeated testing (screening) (Kafkafi &YB et al, PNAS ‘05) 3. Part 1: The mixed model : YB SAMSI ‘06 3. Part 1: The mixed model The existence of Lab*Strain interaction does not diminish the credibility of a behavioral endpoint - in this sense it is not a problem This interaction should be recognized as “ a fact of life” Interaction’s size is the right “yardstick” against which genetic differences should be compared Statistically speaking: Lab is a random factor, as is its interaction with strain. A mixed model should be used (rather than fixed) The formal Mixed Model : YB SAMSI ‘06 The formal Mixed Model YLGI is the value of an endpoint for Laboratory L, Strain S, index I represents the repetition within each group . YLSI = ????S + aL +bL*S + ?LSI ?G is the strain effect which is considered fixed, aL ~ N(0, ??LAB) is the laboratory random effect, bL*S ~ N(0, ??LAB*STRAIN ) is the interaction random effect, ?LSI ~ N(0, ??) is the individual variability Implications of the Mixed Model : YB SAMSI ‘06 Implications of the Mixed Model Source df MSE F p-value Strain 7 102.5 44.8 0.00001 Lab 2 6.35 2.77 0.09 Lab*Strain 14 6.87 3.00 0.00028 Residuals 264 2.29 0.9 14.8 0.0028 Esimates of ??LAB and ??LAB*STRAIN Technically The threshold for significant strain differences can be much higher 0.43 Implications of the Mixed Model : YB SAMSI ‘06 Implications of the Mixed Model Practically 1. For screening new mutants significance assessed ??Lab + ??Lab*Strain +?? /n 2. For screening new mutants vs locally measured background significance assessed ???Lab*Strain +?? (1/n+1/m) Unfortunately, even as sample sizes increase the interaction term does not disappear Implications of the Mixed Model : YB SAMSI ‘06 Implications of the Mixed Model 3. For developing new endpoints A single lab cannot suffice for the development of a new endpoint - no yardstick is available Thus it’s the developer’s responsibility to offer estimates of interaction variability and put it in a public dataset (such as Jackson Laboratory) Slide 23: YB SAMSI ‘06 Significance of 8 Strain differences In summary of part I : YB SAMSI ‘06 In summary of part I Practically, the threshold for making discoveries, in all aspects, is set at a higher level Is this a drawback? It is a way to weed out non-replicable differences Slide 25: YB SAMSI ‘06 What about the warning: “Never use a random factor unless your levels are a true random sample” We do not agree: Replicability in a new lab, is at least partially captured by a random effect model. Revisit Jones Lewis and Tukey (2002) Slide 26: July 2006 SAMSI "Two alternatives, 'fixed' and 'variable', are not enough. A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ] so that 'everything fixed' corresponds to c=0 and 'random' is a particular case of c=1. It pays then to learn as much as possible about values of c in the real world” c=0.5 fixed columns and illustrative weights c=1.6 illustrative columns A challenge: can we estimate c? Slide 27: YB SAMSI ‘06 Significance of 8 Strain differences Should we believe all p-value = 0.05? Not necessarily - Beware of Multiplicity! 4. Part II: The Multiplicity Problem : YB SAMSI ‘06 4. Part II: The Multiplicity Problem The more statistical tests in a study - the larger the probability of making a type I error Stricter control - less power to discover a real effect Traditional approaches “Don’t worry, be happy” Conduct each test at the usual .05 level (eg Ionadis’ PLOS paper ) Panic! Control the prob. of making even a single type I error in the entire study at the usual level (e.g. Bonferroni) Panic causes severe loss of power to discover in large problems Slide 29: YB SAMSI ‘06 Significance of 8 Strain differences Should we believe all p-value = 0.05? Not necessarily - Beware of Multiplicity! Should we use Bonferroni? 0.05*1/17=.0029 Panic causes severe loss of power to discover in large problems : YB SAMSI ‘06 “Genetic dissection of complex traits: guidelines for interpreting…” Lander and Kruglyak “Adopting too lax a standard guarantees a burgeoning literature of false positive linkage claims, each with its own symbol… Scientific disciplines erode their credibility when substantial proportion of claims cannot be replicated…” “On the other hand, adopting too high a hurdle for reporting results runs the risk that nascent field will be stillborn.” Is there an in-between approach? The False Discovery Rate (FDR) criterion : YB SAMSI ‘06 The False Discovery Rate (FDR) criterion The FDR approach takes seriously the concern of Lander & Kruglyak: The error in the entire study is measured by Q = the proportion of false discoveries among the discoveries = 0 if none found, and FDR = E(Q) If nothing is “real”, controlling the FDR at level q guarantees that the probability of making even one false discovery is less than q This is why we choose usual levels of q, say 0.05 But otherwise there is room for improving detection power. This error rate is scalable; adaptive; economically interpretable. Slide 32: YB SAMSI ‘06 Our motivating work was Soriç (JASA 1989): If we use size 0.05 tests to decide upon statistical discoveries then “there is danger that a large part of science is not true” We define Q = V/R if R > 0 = 0 if R = 0 Soriç used E(V)/R for his demonstrations. More recently Ioannidis (PLoS Medicine ‘05) just repeated the argument using The Positive Predictive Value (PPV) PPV = 1-Q stating “most published research findings are false” For demonstration in his model he used PPV = 1-E(V)/E(R) = 1-FDR Control of FDR assures large PPV under most of Ioannidis’ scenarios (except biases such as omission, publication, and interest) Slide 33: YB SAMSI ‘06 Significance of 8 Strain differences Addressing multiplicity by controlling the FDR : July 2006 SAMSI Addressing multiplicity by controlling the FDR FDR control is a very active area of current research mainly because of its scalability (even into millions…) types of dependency resampling procedures FDR adjusted p-values adaptive procedures Bayesian interpretations Related error rates Model selection Is there a mixed-multiplicity connection? : July 2006 SAMSI Is there a mixed-multiplicity connection? Recall in the puzzling entry to the Encyclopedia only two issues were addressed in detail FDR approach in pairwise comparisons The random effects vs fixed effects problem The mixed-multiplicity connection : YB SAMSI ‘06 The mixed-multiplicity connection In the fixed framework we selected three labs and made our analysis as if this is our entire world of reference When this is not the case - as when the experiment is repeated in a different lab - the fixed point of view is overly optimistic This is also an essence of the multiplicity problem - say selecting the maximal difference (with the smallest p-value) and treating it as if it is our only comparison In both cases, conclusions from a naïve point of view have too great a chance to be non-replicable 5. Replicability in Medical ResearchHormone therapy in postmenopausal women : YB SAMSI ‘06 5. Replicability in Medical ResearchHormone therapy in postmenopausal women A very large and long randomized controlled study (Women’s Health Initiative, Rossouw, Anderson, Prentice, LaCroix, JAMA,2002) Study was not performed for drug approval Was stopped before completion because expected effects were reversed Bonferroni-adjusted and marginal (nominal) CIs reported The conclusions contradictory: The decision to stop the trial was based on the marginal CIs The editorial : YB SAMSI ‘06 The editorial “The authors present both nominal and rarely used adjusted CIs to to take into account multiple testing, thus widening the CIs. Whether such adjustment should be used has been questioned, ... ". (Fletcher and Colditz, 2002) Our Puzzle: US and European Regulatory Bodies require adjusting results in clinical trial to multiplicity. So, is the statement true? Small Meta-analysis of Methods (with Rami Cohen) Check with the flagship of medical research: Sampling the New England J of Medicine : YB SAMSI ‘06 Sampling the New England J of Medicine Period: 3 half-years, 2000,2002,2004 All articles of length > 6 pages; containing at least once “p=“ Sample of 20 from each period: 60 articles No differences between periods - results reported pooled over periods 44/60 reported clinical trials’ results How was multiplicity addressed? : YB SAMSI ‘06 How was multiplicity addressed? (out of 60 articles) Success: All studies define primary endpoints : YB SAMSI ‘06 Success: All studies define primary endpoints Multiple endpoints : YB SAMSI ‘06 Multiple endpoints No article had a single endpoint 2 articles only corrected for multiple endpoints 80% define a single primary endpoint In many cases there is no clear distinction between primary and secondary endpoints Even when a correction was made it was adjusted for a partial list. (Note: Rami vs Yoav) Multiple Conf. Intervals: two different concerns : YB SAMSI ‘06 Multiple Conf. Intervals: two different concerns The effect of Simultaneity Pr(all intervals cover their parameters) < 0.95 The goal of Simultaneous CIs, such as Bonferroni-adjusted CIs, is to assure that Pr( all cover) = 0.95 The effect of Selection When only a subset of the parameters is selected for highlighting, for example the significant findings, even the average coverage < 0.95 Implications of selection on average coverage : YB SAMSI ‘06 Implications of selection on average coverage 2/11 do not cover with no selection 2/3 do not cover when selecting significant coefficients (BY & Yekutieli ‘05: FDR ideas for confidence intervals) Slide 45: YB SAMSI ‘06 So what? In MCP2005 conference in Shankhai Head of statistical unit in American FDA brought amazing numbers More than half of the Phase III studies fail to show the effect they were designed to show. Is it at least partly because clinical trials are analyzed “loosely” in terms of multiplicity before standing up to the regulatory agencies, and thus their results are not replicable? More comments in view of Ionadis’ paper at a later time 6. Functional Magnetic Resonance Imaging (fMRI) : YB SAMSI ‘06 6. Functional Magnetic Resonance Imaging (fMRI) Study of the functioning brain: Where is the brain active when we perform a mental task? Example Functional Magnetic Resonance Imaging (fMRI) : YB SAMSI ‘06 Functional Magnetic Resonance Imaging (fMRI) Study of the functioning brain: Where is the brain active when we perform a mental task? Example Slide 48: YB SAMSI ‘06 Unit of data is volume pixel - Voxel 64 x 64 per slice x 16 A comparison of experimental factor per voxel: inference on~ 64K voxels Assuring replicability in fMRI Analysis: Part II : YB SAMSI ‘06 Assuring replicability in fMRI Analysis: Part II Multiplicity was addressed early on Controlling the prob. of making a false discovery even in a single voxel: FWE control using random field theory of level sets (Worsley & Friston ‘95, Adler’s theory) FWE control with resampling (Nichols & Holmes ‘96) Extra power by limiting # voxels tested using Regions Of Interest (ROI) from an independent study Genovese, Lazar & Nichols (‘02) introduced FDR into fMRI analysis. FDR voxel-based analysis available in most software packages e.g. Brain Voyager, SPM, fMRIstat fMRI: More to do on the multiplicity front : YB SAMSI ‘06 fMRI: More to do on the multiplicity front Working with regions rather than voxels: Utilizing activity is in regions Defining appropriate FDR on regions Trimming non-active voxels from active regions Using adaptive FDR procedures etc. Pacifico et al (‘04) Heller et al (‘05) Heller & YB (‘06) Slide 51: YB SAMSI ‘06 “The Good the Bad and the Ugly” “The Good the Bad and the Ugly” : YB SAMSI ‘06 “The Good the Bad and the Ugly” Assuring replicability in fMRI Analysis: Part I : YB SAMSI ‘06 Assuring replicability in fMRI Analysis: Part I Initially results were reported for each subject separately. Then fixed model ANOVA was used to analyze the multiple subjects - within subject “yardstick for variability” only. Concern about between subject variability was raised more recently. Mixed models analysis, with random effects for subjects, is now available in the main software tools. This is called multi-subject analysis Obviously, the number of degrees of freedom is much smaller and the variability is larger Slide 54: YB SAMSI ‘06 Multi-Subjectusing random effects Single subject Slide 55: YB SAMSI ‘06 Tricks of the trade: Using correlation at first session as pilot q1=2/3 at second session at q2=.075 Slide 56: YB SAMSI ‘06 Slide 57: YB SAMSI ‘06 Why is Random Effects (mixed model) analysis insensitive : YB SAMSI ‘06 Why is Random Effects (mixed model) analysis insensitive Variability between subjects about location of activity and Task specific variability of location between subjects Variability about the size of the activity Problems in mapping different subjects to a single map of the brain Use of uniform smoothing across the brain to solve problem (3) reduces the signal per voxel Pattern of change in signal along time differs between individuals Epilog : YB SAMSI ‘06 Epilog The debate Fixed-vs-Random is fierce in the community of neuroimagers. Acceptance to the best journals seems to depend on chance: will the article meet a “Random Effects Referee”? One can read that “the multi subject analysis is less sensitive, so results were not corrected for multiplicity” Researchers sometimes resort to more questionable ways to control for multiplicity and across subject variability e.g. Conjunction Analysis Epilog : YB SAMSI ‘06 Epilog Using the statistic Tvi to test for each subject i: H0vi: there is no effect at voxel v for subject i Conjunction analysis: intersect the individual subjects’ maps Friston, Worseley and others compare Tv=min 1=i=n Tvi to a (lower) random field theory based threshold. Nichols et al (‘05): the “complete null” is tested at each voxel, so a rejection merely indicates that at least for one subject there is an effect at the voxel. (IS IT ENOUGH TO ASSURE REPLICABILITY?) Instead Tv should be compared to the regular threshold to test the hypothesis that all have effect, and then multiplicity strictly controlled. Epilog : YB SAMSI ‘06 Epilog Friston et al (‘05) object to this proposal, because of loss of power; they suggest testing: “there is effect in at least u out of n subjects” as the alternative, and then strictly control multiplicity* It is clear that a compromise that addresses replicability - both multiplicity and between subject variability - and sensitivity is needed. Is this the case where Tukey’s ideas about 0<c<1 may become essential? It may very well be! In summary : YB SAMSI ‘06 In summary Assuring the replicability of results of an experiment is at the heart of the scientific dogma Watch out for two statistical dangers to replicability Ignoring the variability in those selected to be studied - thus using the wrong “yardstick for variability” Selecting to emphasize your best results The second problem emerges naturally when multiple inferences are made and multiplicity is ignored. Slide 63: YB SAMSI ‘06 The FDR website www.math.tau.ac.il/~ybenja Slide 64: YB SAMSI ‘06 Slide 65: YB SAMSI ‘06 Further details about the failures to adjust out of 60 articles The False Discovery Rate (FDR) criterion : YB SAMSI ‘06 The False Discovery Rate (FDR) criterion Benjamini and Hochberg (95) R = # rejected hypotheses = # discoveries V of these may be in error = # false discoveries The error (type I) in the entire study is measured by i.e. Q is the proportion of false discoveries among the discoveries (0 if none found) FDR = E(Q) Does it make sense? Does it make sense? : YB SAMSI ‘06 Does it make sense? Inspecting 20 features: 1 false among 20 discovered - bearable 1 false among 2 discovered - unbearable This error rate is adaptive and has also economic interpretation Inspecting 100 features the above remains the same So this error rate is also scalable If nothing is “real”, controlling the FDR at level q guarantees that the probability of making even one false discovery is less than q This is why we choose usual levels of q, say 0.05 But otherwise there is room for improving detection power: FDR controlling proceures. : YB SAMSI ‘06 FDR controlling proceures. Linear step up procedure (BH, FDR) Pi be the observed p-value of a test for Hi i=1,2,…m Order the p-values P(1) = P(2) =…= P(m) Let Reject FDR control of Linear StepUp Procedure (BH) : YB SAMSI ‘06 FDR control of Linear StepUp Procedure (BH) Suppose m0 = m of the hypotheses are true If the test statistics are independent, or positive dependent : in general : normally distributed

Add a comment

Related presentations

Related pages

Statistical and Applied Mathematical Sciences Institute ...

SAMSI annually conducts a national Education and Outreach Program for undergraduate and graduate students interested in statistics and applied mathematics ...
Read more

Research Programs | Statistical and Applied Mathematical ...

SAMSI’s research programs are large-scale efforts focusing on interfaces among statistics, applied mathematics and other disciplinary sciences.
Read more

Samsi - Wikipedia, the free encyclopedia

Samsi had come to power as a vassal of Assyria, succeeding the former Arabian queen Zabibe, who had abdicated in Samsi's favour. [4] Zabibe's oath of ...
Read more

SAMSI AS - INSTRUMENTS - SERVICE - SUPPORT - TRAINING

gc kurs 2015, rosenholm: tilbehØr: products: oiw: fame: organic: inorganic : absolute std: entech: horizon: quadrex
Read more

History of Samsi - Fine Japanese foods from Samsi

History of Samsi "Your restaurant is fantastic" – Duncan Bannatyne (Dragons Den) "Best Japanese food by a country mile" - Manchester Metro news
Read more

Samsi Wilmslow - fine japanese food

Samsi Wilmslow - fine japanese food Samsi is now well established and recognised as serving 'the best Japanese food in the North West by a county mile' to ...
Read more

SAMSI | LinkedIn

Learn about working at SAMSI. Join LinkedIn today for free. See who you know at SAMSI, leverage your professional network, and get hired.
Read more

SAMSİ CAFE

“Samsi”, a subsidiary of Metro Holding, the owner of Sampi brand with 45 restaurants countrywide and 25 years experience in restaurant business, is a ...
Read more

SAMSI Home Page

samsi postdoc web page ...
Read more

Rezepte von samsi - kochbar.de

Rezepte von samsi bei kochbar.de. Alle Rezepte, Tipps und Menüs von samsi bei Kochbar.de
Read more