Information about Regression-through-the-origin: Ratio of Means or Medians

Using a ratio of medians to estimate the slope of a best-fit-line instead of a ratio of means (the least-squares estimator).

Rationale and Objectives for the Vignette “Ratio of Means or Medians” Rationale: 1) Firstly, the vignette may be used as a vehicle to inform the students regarding the circumstances related to the discovery of the principle of least squares and the charac- teristic of some scientists, for good reasons, to set aside newly discovered results for later publication. 2) Secondly, the vignette may also be used for emphasizing the concept of least-squares using a very meager dataset (using three data points) and hence emphasizing the principle of parsimony and simplicity in data analysis. In so doing, the students are aided in appreciating the concept to be learned by visualizing the concept instead of just memorizing formulas and the principle in words. The paucity of the dataset should not be a source of discouragement considering that, as they would later discover in the vignette, Gauss predicted the position of the asteroid Ceres based only on three data points and using the principle of least squares! 3) Together with the use of the alternative methods of averages to come up with alternative estimators, the students experience the value of cre- ativity, ingenuity, and common sense in discovery. The use of the ratio of medians instead of the means, as mentioned in the vignette, may also be used to inform the students of the principle of using robust methods for ﬁrst-analysis of datasets before having recourse to more complicated methods. Objectives: After studying the vignette the students should be able to: 1) give an account of the discovery of least squares; 2) explain the principle of least squares by using a simple diagram; 3) appreciate the value of simplicity, parsimony, common sense, ingenuity, creativ- ity, communication, and accuracy in the process of discovery; 4) identify linear regression problems wherein the ratio of medians may be more appropriate than the ratio of means; and, 5) appreciate the value of statistics as a scientiﬁc methodology. Scientiﬁc attitudes addressed (based on Roach (1993)) 4 : Skepticism, Communication, Ac- curacy, Parsimony, Common Sense Teacher Notes: This vignette should be discussed after the lecture on linear regression. 4 Roach, Linda E. (1993). I Have A Story About That: Historical Vignettes to Enhance the Teaching of the Nature of Science, p. 13. 2

Regression-through-the-origin: Ratio of Means or Medians by Justine Leon A. Uro e-mail: justineuro@yahoo.com An experimenter wants to determine an equation of a line that can be used to describe a possible linear relationship between two hypothetical variables, say X and Y . It is known beforehand that Y = 0 when X = 0. He was also able to obtain three additional pairs of values for X and Y : (2,3), (3,2), and (4,7). There are a number of ways for obtaining such a “best-ﬁt-line.” A common (maybe the most common) method is through the use of the least-squares (LS) criterion, in which case, the estimated line is called the least-squares (LS) line. What do you recall about the least-squares criterion? In what sense is an LS line a “least-squares” line? For this particular type of dataset, since the LS line has to pass through the origin (0,0) (regression-through-the-origin method) the LS line is given by y = bx where b = yw /¯w ¯ x and xw and yw are the weighted arithmetic means (with weighting factor w = X) of the ¯ ¯ variables X and Y , respectively. 5 That is to say, b = ( wi Yi / wi )/( wi Xi / wi ) = ( Xi Yi / Xi )/( Xi Xi / Xi ), or simply b = Xi Yi / Xi2 . Construct a scatterplot of Y vs. X. Find the weighted means xw , yw , ¯ ¯ and an equation of the LS line. Overlay the graph of the LS line on the scatterplot of Y vs. X. Aside from the LS line, what other “best-ﬁt- lines” have you previously encountered? How does one obtain them? You may want to graph one of these alternative “best-ﬁt-lines” on the scatterplot of Y vs. X. 5 This is a more precise formula of the LS estimator of b than that appearing in an earlier version of this paper in which this author inadvertently used b = y /¯, where x and y were meant to denote the simple or ¯ x ¯ ¯ unweighted arithmetic means of X and Y , respectively. It should, however, be pointed out that using b = y /¯ ¯ x is of particular interest (for regression-thru-the-origin) since it is the zero deviation or ZD estimator—the sum of the deviations of the predicted from the observed Y values becomes zero for this estimator of b. 3

The earliest publicized use of the LS method is probably that by Karl Friedrich Gauss who in 1801, at 23 or 24 years old, used this method to predict the position of the asteroid Ceres as it emerged from the sun based on only three celestial observations of this asteroid [[1], [2], [3]]. Based on the calculations of Gauss, Hungarian astronomer Franz Xaver von Zach and German astronomer Heinrich Olbers rediscovered it on December 31, 1801 in Gotha and January 1, 1802 in Bremen, respectively [[1]]. The asteroid was initially discov- ered by Italian Giuseppe Piazzi in January 1, 1801 but was able to watch it in its path for only 40 days before the glare of the sun got in the way [[2]]. Apparently Gauss had been using the LS method since 1795 but did not publish it until 1809. By then, Frenchman Adrien-Marie Legendre and Irish-American Robert Adrain had discovered it independently of each other and of Gauss. Legendre published his results in 1805 and Adrain in 1808. Do you know of other scientists the publication of whose discoveries were preceded by colleagues who also discovered them, albeit later? Please cite examples. Why do you think these discoverers delayed the publication of their discoveries? Recall that there are three common measures of average: the mean, median, and mode. Although the most common is the mean, there are instances wherein the median or the mode is preferred. For example, the median is preferred over the mean when the dataset is skewed to the right in that it includes some extremely large values. Recall that the LS estimator for the slope for the regression-through- the-origin is b = yw /¯w and is therefore a ratio of two averages, these ¯ x averages being weighted means. Suggest other estimators for b based on your knowledge of averages (see the contents of the previous para- graph). Can you think of instances of datasets wherein it better to use this kind of average instead of the weighted mean? Cite examples and justify your answer. In connection with the determination of the equation of a best-ﬁt-line to describe the 4

relationship between two variables, H. Theil (1950; in Daniel (1991, pp. 621-2, 630)) and E.J. Dietz (1989; in Daniel (1991, pp. 622, 630)) used the median instead of the mean. A possible estimator for the slope of the regression-through-the-origin line based on their theory would be b = median (y/x). A similar estimator would be b = median(y)/median(x). Obtain the regression lines based on these two estimators of b then graph them on the scatterplot obtained earlier. Compare the three regression lines obtained to each other based on their graphs. Gauss in 1829 was able to prove that the least squares method for obtaining a best ﬁt line is optimal in that among unbiased estimators for the coeﬃcients of the regression line, the LS method gives those with least variance assuming that the errors are independently and identically normally distributed (Gaussian distribution) with zero mean and a common variance. On the other hand, the two methods derived from the work of Theil (1950; in Daniel (1991, pp. 621-2, 630)) and Dietz (1989; in Daniel (1991, pp. 622, 630)) mentioned previously are robust in that they are nonparametric (no particular distribution assumed for the errors). It can thus be seen that the three methods mentioned above have their advantages and disadvantages. What can you say about the relative number of calculations involved in the three diﬀerent meth- ods? Cite instances wherein a nonparametric method is better than a parametric method especially when dealing with datasets that arise in the physical sciences. References [1] Carl Friedrich Gauss. (2007, April 28, 3:27). In Wikipedia, The Free Encyclopedia. Re- trieved April 29, 2007 from http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss. 4 5

[2] Least squares. (2007, June 30, 7:59). In Wikipedia, The Free Encyclopedia. Retrieved July 9, 2007 from http://en.wikipedia.org/wiki/Least_squares. 4 [3] Statistics. (2007, July 6, 8:05). In Wikipedia, The Free Encyclopedia. Retrieved July 7, 2007 from http://en.wikipedia.org/wiki/Statistics. 4 [4] Probability. (2007, June 20, 2:37). In Wikipedia, The Free Encyclopedia. Retrieved July 7, 2007 from http://en.wikipedia.org/wiki/Probability. [5] Daniel, W. (1991). Biostatistics: a foundation for analysis in the health sciences, 5th ed., pp. 621-2, 630. 6

Transcripts - Regression-through-the-origin: Ratio of Means or Medians. 1. Regression-through-the-origin: Ratio of Means or Medians (A Learning Vignette ...

Read more

This then means that the two ... The medians of a triangle intersect ... For us to show that the medians intersect each other in the ratio 2 ...

Read more

The basic advantage of the median over the mean ... Inequality relating means and medians ... The efficiency of the sample median, measured as the ratio ...

Read more

MEANS OF RATIOS OR RATIOS OF MEANS? Marc Wuyts ... • the ratio of the means (or sums) of both variables (the right hand side in the equation above),

Read more

How Complex Can Complex Survey Analysis Be with SAS ... is the ratio of the number of PSUs in the stratum minus 1 to the ... weighted means and medians .

Read more

The centroid divides each median into parts in the ratio 2:1, ... The medians from sides of lengths a and b are perpendicular if and only if + =. ...

Read more

Regression through the Origin KEYWORDS: Teaching; ... have a nonzero mean, ... deﬁned by the ratio of SSR to SST

Read more

Statistics Calculator will compare two or more mean averages to determine whether ... When the data is interval or ratio scaled, it ...

Read more

... it is therefore equal to the mean. ... , the statistical median of the random ... measured as the ratio of the variance of the mean to the ...

Read more

## Add a comment