Information about Probability/Statistics Lecture Notes 4: Hypothesis Testing

A. Spanos (PHIL 6334)

"Probability/Statistics Lecture Notes 4: Hypothesis Testing"

"Probability/Statistics Lecture Notes 4: Hypothesis Testing"

(θ=θ∗), where the statement ‘θ∗denotes the true value of θ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution (x; θ∗)’. In testing, the reasoning is hypothetical: evaluation of the relevant error probabilities is under various hypothetical scenarios revolving around diﬀerent values of θ in Θ associated with the null (0) and alternative (1) hypotheses. The main objective in statistical testing is to use the sample information x∈X as summarized by (x; θ) in conjunction with data x0:=(1 2 ), to narrow down the set of possible values of the unknown parameter θ∈Θ using hypothetical values of θ. Ideally, the narrowing down reduces Mθ (x) to a single point M∗(x)={ (x; θ∗)} x∈X. That is, hypothesis testing is all about learning from data x0 about the ‘true’ statistical Data Generating Mechanism (DGM) M∗(x). Instead of asking the data to pinpoint the value θ∗ in testing one assumes a speciﬁc value, say θ=θ0 and poses the question to data data x0 whether θ0 is ‘close enough’ to θ∗ or not. In general, hypothetical reasoning enables statistical testing to pose sharper questions to the data and elicit more informative answers. Section 2 provides a bare bones account of Fisher’s significance testing by focusing on the simple Normal model. Section 3 gives a basic account of the Neyman-Pearson (N-P) approach to testing, paying particular attention to the elaboration of the initial Fisher set up. Section 4 raises brieﬂy some of the foundational problems bedeviling both approaches to testing since the 1930s. 2

2 Fisher’s signiﬁcance testing Example 1. Let us assume that the appropriate statistical model for data x0 is the simple (one parameter) Normal model, where is known (table 1). Table 1 - Simple Normal (one parameter) Model Statistical GM: = + ∈N={1 2 } ⎫ [1] Normality: v N( ) ∈R ⎬ [2] Constant mean: ( )= ∈N. ⎭ 2 [3] Constant variance: ( )= [known] [4] Independence: { ∈N} independent process Note that in this case X:=R and Θ=(−∞ ∞) A typical Fisher type null hypothesis takes the form: 0 : =0 (1) where the particular value 0 is given. The question posed by the null hypothesis in (1) is the extent to which data x0 accords with the pre-speciﬁed value 0 i.e. Θ=(−∞ ∞):=R is narrowed down to a single value. Fisher used common sense to construct a test statistic that aims to evaluate the accordance between 0 and data x0 based on the distance between a good estimator of and 0; a good estimator is the next best thing to knowing As in the case of a Conﬁdence Interval, a ‘good’ (optimal) test begins with a ‘good’ (optimal) estimator. P 1 In lecture notes 3 it was shown that = pro=1 vides a best estimator of because it has a sampling distrib3

ution: =∗ 2 v N(∗ ) (2) and enjoys the optimal properties of unbiasedness (( )=∗) full eﬃciency ( ( )=(∗)) and strong consistency (P( lim =∗)), where ∗ denotes the ‘true’ value of what→∞ ever that value happens to be. Remember that an optimal (1 − ) Conﬁdence interval is based on a standardized version of (2), deﬁned in terms of the pivotal quantity: √ ( −∗ ) =∗ (3) v N(0 1) The question posed by 0 in (1) amounts to asking the data x0 whether the distance (∗ − 0) is ‘large enough’ to indicate discordance with 0 or not. In light of the fact that ∗ is unknown, it makes intuitive sense to use the best estimator of ∗ to deﬁne a test statistic in terms of the diﬀerence ( − 0) which after standardization takes the form: √ ( −0 ) (X)= In view of the fact that is known, (X) constitutes a statistic: a function that involves no unknown parameters of the form () : X → R. The question now is, ‘how does one deﬁne whether this distance is large enough?’ Since (X) is a random variable (being a function of the sample X), that question can only be answered in terms of its sampling distribution. But what is it? R.A. Fisher realized that one cannot use factual reasoning 4

to derive it because that will give rise to: √ ( −0 ) =∗ v (X)= N( 0 1) √ ∗ 0= ( −0) (4) which is non-operational because 0 involves the unknown ∗ Upon reﬂection, it became clear to Fisher that if the evaluation was hypothetical, under 0: =0 then: √ ( −0 ) 0 v (X)= N(0 1) (5) which is operational! Fisher (1956): “In general, tests of signiﬁcance are based on hypothetical probabilities calculated from their null hypotheses.” (p. 47) Fisher used the fact that the sampling distribution of (X) in (5) is completely known to deﬁne a measure of discordance between 0 and ∗ in terms of the p-value: P((X) (x0); 0 true)=(x0) (6) viewing it as an indicator of discordance between data x0 and 0; the bigger the observed test statistic (x0) the smaller the p-value. Deﬁnition. The p-value is the probability of getting an outcome x∈X such that (x) is more discordant with 0 than (x0), when 0 is true. Note: It is crucial to avoid the highly misleading notation, P( (X) (x0) | 0)=(x0) of the vertical line (|) instead of a semi-colon (;), since conditioning on 0 makes no sense in frequentist inference; is not a random variable. Hence, the p-value is NOT a conditional probability of any type! 5

When the p-value is smaller than a certain threshold, say = 05, it suggests that data x0 indicate a signiﬁcant discordance with 0. What about when the p-value is greater than the selected threshold? Fisher was a strict falsiﬁcationism and rejected any interpretation of that as indicating accordance with 0. Numerical examples (i) Consider the case where =1 =100 0=12 and data x0 gave rise √ =122 The observed value of the test statistic to is (x0)= 100(122 − 12)=20 which yields: P((X) (x0); 0 true)=023 This p-value indicates a clear discordance with 0: 0=12 (ii) Assuming the same values as above except =121 √ yields (x0)= 100(121−12)=1 which gives rise to a p-value: P((X) (x0); 0 true)=159 In this case Fisher would say that data x0 (=121) indicate no discordance with the null, but that should NOT be interpreted as indicating accordance with 0! He was a strict falsiﬁcationist. The main components of a Fisher test (table 2). Table 2 - Fisher test: main components (i) a statistical model: Mθ (x)={ (x; θ) θ∈Θ} x∈X, (ii) a null hypothesis 0 (iii) a test statistic (X) (iv) the distribution of (X) under 0 [ascertainable], (v) the -value P((X) (x0); 0 true)=(x0) (vi) a threshold below which the -value is signiﬁcant 6

3 Neyman-Pearson (N-P) testing Neyman and Pearson (1933) pointed out two crucial weaknesses in Fisher’s approach to testing: (i) the arbitrariness of the p-value evaluation, and (ii) the ad hoc choice of a test statistic. Indeed, they viewed their primary contribution to statistical testing to be one of proposing a theory of optimal testing — analogous to Fisher’s optimal estimation — by improving upon two interrelated aspects of his signiﬁcance testing. The ﬁrst was to replace the p-value with pre-data error probabilities, type I and II, in terms of which one can deﬁne the optimality of tests. The second was to justify the choice of a test statistic itself in terms of the same notion of optimality. The key to their approach to testing is given by the notion of an alternative hypothesis. 3.1 Archetypal Neyman-Pearson (N-P) hypotheses In the case where the sets Θ0 or Θ1 contain a single point, say Θ={} = 0 1 that determines (x; 0) completely (no unknowns), the hypothesis is said to be simple, otherwise it is composite. For a particular null hypothesis 0 in the context of a statistical model: Mθ (x)={ (x; ) ∈Θ} x∈X:=R the default alternative hypothesis, denoted by 1, is always deﬁned as the complement of the null with respect to the particular parameter space. That is, the null and the 7

alternative hypotheses should constitute a partition of the parameter space. The archetypal way to specify the null and alternative hypotheses for N-P testing is: 0: ∈Θ0 vs. 1: ∈Θ1 (7) where Θ0 and Θ1 constitute a partition of the parameter space Θ; Θ0∩Θ1=∅ Θ0∪Θ1=Θ This is because for statistical purposes the whole of the parameter space is relevant, despite the fact that only a few value of the unknown parameter are often of interest for substantive purposes. Example 1 (continued). In the case of (1) the alternative is the rest of the parameter space: 0 : =0 vs. 1 : 6= 0 (8) 0 : ≤0 vs. 1 : 0 (9) Similarly, the one-sided hypotheses: (10) 0 : ≥0 vs. 1 : 0 constitute proper partitions of the parameter space. N-P testing, in eﬀect, partitions Θ into Θ0 and Θ1, and poses the question whether data x0 accord better with one or the other subset. It is often insuﬃciently appreciated that this question pertains directly to the ‘true’ statistical DGM M∗(x)={ ∗(x; θ∗)} x∈X, where ∗(x; θ∗) denotes the distribution of the sample evaluated at θ=θ∗ This is because (7) can be written in the equivalent form: 0: ∗(x; θ∗)∈M0(x) vs. 1: ∗(x; θ∗)∈M1(x) M0(x)={ (x; θ) θ∈Θ0} M1(x)={ (x; θ) θ∈Θ1} x∈X 8

To answer that question the N-P approach uses a test statistic (X) which maps the partition Θ0 and Θ1 into a corresponding partition of the sample space X, say 0 and 1 where 0 ∩ 1=∅ 0 ∪ 1=X known as acceptance and rejection regions, respectively: ½ ¾ 0 ↔ Θ0 X= =Θ 1 ↔ Θ1 Fig. 1 places N-P testing in a broader context where P(x) denotes the set of all possible models that could have given rise to data x0 P (x ) H0 H1 Fig. 1: Neyman-Pearson testing In particular, N-P reduces hypothesis testing to a decision to accept or reject the null hypothesis (0) according to where data x0 happen to belong: if x0 ∈ 0 accept 0 if x0 ∈ 1 reject 0 However, in view of the probabilistic nature of the underlying statistical model each of these decisions carries with it the possibility of error. In particular, viewing this choice as a 9

decision, there are two types of error one can perpetrate: DecisionÂTSN 0 is true 0 is false Accept 0 X type II error Reject 0 type I error X type I error: reject 0 when true, type II error: accept 0 when false. The key to applying the N-P approach to testing is to be able to evaluate these two types of error probabilities using hypothetical reasoning associated with 0 being true or false: type I error probability: P(x0∈1; 0 true)= type II error probability: P(x0∈0; 0 false)= This depends crucially on knowing the sampling distribution of the test statistic under these two scenarios. It is very important to emphasize that there is nothing conditional about these error probabilities; they are evaluated under diﬀerent scenarios. Example 1 (continued). For simplicity, let us consider the hypotheses of interest: 0 : =0 vs. 1 : 0 (11) in the context of the simple Normal model (table 1). It turns out that the optimal N-P test for (11) concides with that for the hypotheses: 0 : ≤0 vs. 1 : 0 (12) which constitute a partition of the relevant parameter space. Provisionally, let us adopt Fisher’s test statistic √ ( −0 ) (X)= as the appropriate distance function. 10

In light of the fact that departures from the null are associated with large values of (X) an obvious rejection region is: (13) 1()={x : (x) } where 0 is the threshold rejection value. Given that the sampling distribution of (X) under 0 is given in (5), and is completely known, one can evaluate the type I error probability for diﬀerent rejection values using: P((X) ; 0 true)= where is the type I error; 0 1. Similarly, an obvious acceptance region is: 0()={x : (x) ≤ } since small values of (x) indicate accordance with 0 To evaluate the type II error probability one needs to know the sampling distribution of (X) when 0 is false. However, since 0 is false refers to 1 : 0 this evaluation will involve all values of greater than 0 (i.e. 10) : (1)=P((X) ≤ ; 0 false)=P((X) ≤ ; =1) ∀(10) The relevant sampling distribution takes the form: √ ( −0 ) =1 v (X)= N( 1) where √ 1 = ( −0) for all 10 (14) That is, under 1 the sampling distribution of (X) is Normal, but its mean is non-zero and changes with 1 or equivalently, √ ( −1 ) =1 1(X)= v N(0 1) where √ 1 (X)=1(X)+ ( −0) (15) 11

A closer look at the above two types of error probabilities reveals that both of them are evaluated from a certain threshold It turns out that as increases the type I error [rejecting 0 when true] probability decreases because we make it more and more diﬃcult to reject the null. In contrast, the type II error [accepting 0 when false] probability increases as increases because we make it easier to accept; the reverse holds when decreases. Hence, there is a trade-oﬀ between the two types of error probabilities as the threshold changes, which means that one cannot keep both error probabilities low at the same time. Very low error probabilities are desirable because that will render the test more eﬀective in the sense that it will make fewer errors (more reliable). How can one address this trade-oﬀ? Neyman and Pearson (1929) suggested that a natural way to address this trade-oﬀ is as follows: (a) Specify the null and alternative hypothesis in such a way so as to render the type I error the more serious of the two potential errors. 12

(b) Fix the type I error probability (signiﬁcance level) to a small number: P((X) ; 0 true)= say =.05 or =.01, where the choice of depend on the particular circumstances. (c) For a given choose the optimal test {(X) 1()} that minimizes the type II error probability for all values 10 The last step is often replaced with an equivalent step: (c)* for a given choose a test {(X) 1()} that maximizes the power of the test in question for all values 10 : (1)=P((X) ; 1(1) true) for all 1 0 (16) Uniformly Most Powerful (UMP). A test :={(X) 1()} is said to be UMP if it has higher power than any other -level e test for all values ∈Θ1, i.e. e (; ) ≥ (; ) for all ∈Θ1 That is, a UMP N-P test is one whose power curve dominates that of every other possible test in the sense that for all 10 its power is greater than or equal to that of the other tests. standard Normal tables One-sided values Two-sided values =100 : =128 =100 : =1645 =050 : =1645 =050 : =196 =025 : =196 =025 : =200 =010 : =233 =010 : =258 13

Example 1 (continued). In the above explication of the basic components of a N-P test we have taken the Fisher test statistic as given, but it’s not obvious that it gives rise to a UMP test in the case of the hypotheses in (11). Does it? Before we answer that question let us consider the problem of evaluating the power of the test deﬁned by {(X) 1()} for =025 and diﬀerent values of 1 12 Consider the case where =1 =100 0=12 and data x0 gave rise to =1227 The observed value of the test statistic √ is (x0)= 100(1227 − 12)=27 which in view of the fact that =196 for =025 it leads to a rejection of 0: 0=12 The power of this test is deﬁned by: (1)=P((X) 196; 1(1) true) for all 1 12 In light of the fact that for the relevant sampling distribution for the evaluation of (1) (14), the use the N(0 1) tables requires one to√ split the test statistic into: √ √ ( −0 ) ( −1 ) 1 = + 1 1= ( −0) to evaluate ³ power for diﬀerent values of 1 use (15): the ´ √ (1 −0 ) for all 1 0 (17) (1)=P 1(X) − 14

=1−0 =0 =05 =1 =2 =3 =4 =5 Table 3 - Power of {(X) 1(025)} √ 1 1= ( −0) : =0 =5 =1 =2 =3 =4 =5 √ (1)=P( (−1) (121)=P( (121)=P( (121)=P( (122)=P( (123)=P( (124)=P( (125)=P( − 1+; 1) 196) = 025 196 − 5) = 072 196 − 1) = 169 196 − 2) = 516 196 − 3) = 851 196 − 4) = 979 196 − 5) = 999 where is a generic standard Normal r.v., i.e. v N(0 1) The power of this test is typical of an optimal (UMP) √ 1 test since (1) increases with the non-zero mean 1= ( −0) : (a) the power increases with the sample size (b) the power increases with the discrepancy =(1−0) (c) the power decreases with The features (a)-(b) are often used to decide pre-data on how large should be to detect departures (1−0) of interest as part of the pre-data design of the study. Returning to the original intentions by Neyman and Pearson (1933) to improve upon Fisher’s signiﬁcance testing, we can see that by bringing into the set up the notion of an alternative hypothesis deﬁned as the compliment to the null, they proposed to: (i) replace the post-data p-value with the null, with the pre-data type I and II error probabilities, and (ii) deﬁne an optimal test in terms of notion of an significance level UMP test. 15

The notion of optimality renders the choice of the test statistic and the associated rejection region a matter of mathematical optimization, replacing Fisher’s intuition what test statistic makes sense to use in diﬀerent cases. It turned out that in most cases Fisher’s initial intuition coincided with the notion of an optimal test. Example 1 (continued). The question that one might naturally raise at this stage is whether the same test statistic √ ( −0 ) (X)= can be used to specify a UMP for testing the two-sided hypotheses: (18) 0 : =0 vs. 1 : 6=0 The rejection region in this case should naturally allow for discrepancies on either side of 0 and would take the form: ∗ 1 ()={x : |(x)| } (19) 2 ∗ It turns out that the test deﬁned by {(X) 1 ()} is not UMP, but it is UMP Unbiased; see Spanos (1999). The main components of a N-P test are given in table 4. Table 4 - N-P test main components (i) a statistical model: Mθ (x)={ (x; θ) θ∈Θ} x∈X, (ii) a null (0) and an alternative (1) hypothesis within Mθ (x) (iii) a test statistic (X) (iv) the distribution of (X) under 0 [ascertainable], (v) the signiﬁcance level (or size) (vi) the rejection region 1()={x : (x) } (vii) the distribution of (X) under 1 [ascertainable]. Remarks. (i) It is very important to remember that a N-P test is not just a test statistic! It is at least a pair :={(X) 1()} 16

(ii) The optimality of an N-P test is inextricably bound up with the optimality of the estimator the test statistic is based on. Hence, it is no accident that most optimal N-P tests are based on consistent, fully eﬃcient and suﬃcient estimators. √ ( −0 ) For instance, if one were to replace in (X)= 2 with the unbiased estimator 2=(1+)2 vN( 2 ) the b resulting test based: √ 2 (X)= 2(b−0) and 1()={x: (x) } will not be optimal because its power is much lower than that ˇ of . Worse, is an inconsistent test: its power does not approach one as → ∞ for all discrepancies (1 − 0) Consistency is considered a mininal property as in the case of optimal estimators. (iii) Also, by changing the rejection region one can render an optimal N-P test useless! For instance, replacing the rejection region of {(X) 1()} with: 1()={x: (x) } the resulting test T :={(X) 1()} is practically useless because it is a biased test, i.e. its power is less than the signiﬁcance level, ((X) (x0); =1) ≤ , for all 1 0, and it decreases as the discrepancy increases. 17

4 The fallacies of acceptance and rejection It turned out that despite these obvious technical developments in statistical testing, both the Fisher and N-P approaches suﬀered from serious foundational problems due to the lack of a coherent philosophy of testing. In particular, neither account gave a satisfactory answer to the basic question (see Mayo, 1996): when do data x0 provide evidence for (or against) a hypothesis or a claim ? Fisher’s notion of the p-value as a measure of discordance, gave rise to several foundational questions that remained largely unanswered since the 1930s: (a) how does one interpret a small p-value as it pertains to evidence against 0? (b) how does one interpret a large p-value as it pertains to evidence for 0? (c) since the p-value depends on the sample size how does one interpret a small p-value when is very large? Similarly, the N-P approach to hypothesis testing gave rise to several foundational questions of its own: (d) does one interpret accepting 0 as evidence for 0? If not, why? (e) does one interpret rejecting 0 as evidence for 1? If not, why? These questions are associated with two fundamental fallacies that have bedeviled frequentist testing since the 1930s: 18

¥ (I) The fallacy of acceptance: (mis)-interpreting accept 0 [no evidence against 0] as evidence for 0; e.g. the test had low power to detect existing discrepancy, ¥ (II) the fallacy of rejection: (mis)-interpreting reject 0 [evidence against 0] as evidence for a particular 1; e.g. the test had very high power to detect even tiny discrepancies. The best example of the fallacy of acceptance is when 0 is false but the test applied did not have any power to detect the particular discrepancy present. N-P testing is clearly vulnerable to this fallacy, especially when the power of the test is not evaluated. Example 1 (continued). In the case where = 025 =1 =100 0=12 and data x0 gave rise to =1218 The √ observed value of the test statistic is (x0)= 100(1218 − 12)=18 which leads to the acceptance of 0: 0=12 since = 196 If, however, the sustantive discrepancy of interest in this case is (1−0)=2 then according to table 3, the power of the test {(X) 1()} is only (122)=516 which is rather low, and thus the test might not be able to detect such a discrepancy even if it’s present. A good example of the fallacy of rejection is conﬂating statistical with substantive signiﬁcance. It could easily be that the test has very high power (e.g. when the sample size is very large) and the detected discrepancy is substantively tiny but statistically signiﬁcant due to the over-sensitivity of the test in question. Both Fisher and N-P testing are vulnerable to this fallacy. The question that naturally arises at this stage is whether 19

there is a way to address the fallacies of acceptance and rejection and the measleading interpretations associated with observed CIs, and at the same time provide a coherent interpretation to statistical testing that oﬀers an inferential interpretation that pertains to substantive hypotheses of interest? Error statistics provides such a coherent intepretation of testing and enables one to address these fallacies using a post-data evaluation of the accept/reject decisions using severe testing reasoning; see Mayo (1996), Mayo and Spanos (2006). The error statistical approach, viewed narrowly at the statistical level, blends in the Fisher and Neyman-Pearson (N-P) testing perspectives to weave a coherent frequentist inductive reasoning anchored ﬁrmly on error probabilities, both pre and post data. The key to this coalescing is provided by recognizing that Fisher’s p-value reasoning is based on a post-data error probability, and Neyman-Pearson’s type I and II errors reasoning is based on pre-data error probabilities, and they play complementary, not contradictory, roles. 5 Summary and conclusions In frequentist inference learning from data x0 about the stochastic phenomenon of interest is accomplished by applying optimal inference procedures with ascertainable error probabilities in the context of a statistical model: (20) Mθ (x)={ (x; θ) θ∈Θ} x∈R Hypothesis testing gives rise to learning from data x0 by partitioning Mθ (x) into two subsets framed in terms of the 20

parameter(s): (21) 0: ∈Θ0 vs. 0: ∈Θ1 and use x0 pose questions using hypothetical reasoning to learn about M∗(x)={ (x; θ∗)} x∈X, where ∗(x)= (x; θ∗) denotes the ‘true’ distribution of the sample. A more perceptive way to specify (21) is: ∈Θ1 ∈Θ0 z }| { z }| { ∗ ∗ 0 (x)∈M0(x)={ (x; θ) θ∈Θ0} vs. 1: (x)∈M1(x)={ (x; θ) θ∈Θ1} This notation elucidates the basic question posed in hypothesis testing: I Assuming that the true M∗(x) lies within the boundaries of Mθ (x) can x0 be used to narrow it down to a smaller subset M0(x)? A test :={(X) 1()} is deﬁned in terms of a test statistic ((X)) and a rejection region (1()), and its optimality is calibrated in terms of the relevant error probabilities: type I: P(x0∈1; 0() true) ≤ () for ∈Θ0 type II: P(x0∈0; 1(1) true)=(1) for 1∈Θ1 These error probabilities specify how often these procedures lead to erroneous inferences. For a given signiﬁcance level the optimal N-P test is the one whose pre-data capacity (power): (1)=P(x0∈1; =1) for all 1∈Θ1 is maximum for all 1∈Θ1 As they stand, neither the p-value nor the accept/reject 0 rules provide an evidential interpretation pertaining to ‘when data x0 provide evidence for (or against) a hypothesis (or claim) ’. 21

Statistics: Lecture Notes. Chapter 1. ... Chapter 4. Definitions; Counting Principles; Chapter 5. ... Steps to Hypothesis Testing;

Read more

Lecture 9: Bayesian hypothesis testing 5 November 2007 ... Linguistics 251 lecture 9 notes, page 4 Roger Levy, Fall 2007. P(~x|H2) = 6 4 Z 1 0

Read more

1. Lecture 9 - Introduction Slide 2. Review of hypothesis testing ... 4. Statistical Tests, ...

Read more

Introduction to Hypothesis Testing ... Hypothesis Test for μ Page 4 ...

Read more

Chapter 6: Introduction to Hypothesis Testing ... Lecture Notes: Chapter 6: Hypothesis Testing 4 ... 760 Lecture Notes: Chapter 6: Hypothesis Testing 6;2 ...

Read more

Class site for Probability and Statistics at Olin College, ... Lecture notes. Lecture 01. Lecture 02. ... Hypothesis testing.

Read more

Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 ... Statistical hypothesis testing Solution (cont™d) In conclusion:

Read more

Probability & statistics; Calculus; Differential equations; Linear algebra; Math for fun and glory; ... Hypothesis testing with one sample. Z-statistics vs ...

Read more

## Add a comment