statistics by ravikotak

50 %
50 %
Information about statistics by ravikotak
Education

Published on September 8, 2009

Author: aSGuest25517

Source: authorstream.com

Statistics and Data Analysis : Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics Statistics and Data Analysis : Statistics and Data Analysis Part 2 – Descriptive Statistics Basic Descriptive Statistics Agenda : Basic Descriptive Statistics Agenda Populations and Samples Descriptive Statistics for a Variable Measures of location: Mean,median,mode Measure of dispersion: Standard deviation Measures of Covariation for Two Variables Understanding covariation Measuring covariance and correlation Scatter plots and regression 1/51 Populations and Samples : Populations and Samples Population: Collection of all possible observations (data points) on a variable Sample: A subset of the data points in the population Random sample: Defined by the way the sample data are obtained. All points in the population are equally likely to be drawn in any particular sample. What is the purpose of obtaining a sample?To learn about the population. The sample is observed The population is assumed. 2/51 See HOG, Sec. 1.5. Random Sampling : Random Sampling A production process produces circuit boards each with several dozen soldering connections. 20 boards are produced per hour, with an average of 2 defects per board when the process is in control. Over the course of a particular 30 hour week, the following averages are obtained: 1.45, 1.65, 1.50, …, 2.35. What is the population? What could be learned from this sample? From HOG, Ex. 2.40, p. 64 3/51 Samples of House Listings and Per Capita Incomes (Bivariate) : Samples of House Listings and Per Capita Incomes (Bivariate) 4/51 Questions About the Income Data : Questions About the Income Data Are they a population or a sample? Population? Drawn from all 50 states (plus DC) Sample? Could have been drawn at a different point in time. Are they a random sample? It is all 50 states +DC and all incomes within the states, so no. They would vary “randomly” at different points in time, so yes. The variation across states is random. What if the per capita incomes were each based on a “sample” of state residents? Then the values would be random within states. Still all 50 states. 5/51 Nonrandom Samples : Nonrandom Samples Nonrandom samples produce tainted, sometimes not believable results Biased with respect to the population Results reflect only the subpopulation from which the data are obtained. 6/51 (Non)Randomness of Samples : (Non)Randomness of Samples Sources of bias in samples Bad sample design – e.g., home phone surveys conducted during working hours Survey (non)response bias – e.g., hotel opinion surveys about service quality Participation bias – e.g., voluntary participation in the Literary Digest poll below Attrition bias from clinical trials - e.g., if the drug works, the subject does not come back. Self selection – volunteering for a trial or an opinion sample. (See below; Shere Hite’s cultural revolution) 7/51 Nonrandom Sampling – THE Classic Case : Nonrandom Sampling – THE Classic Case Literary Digest, 1936, Alf Landon vs. Franklin Roosevelt: Survey result based on a HUGE sample. Prediction? Landon, 1,293,669 Roosevelt, 972,897 Final Returns in the Digest’s Poll of Ten Million Voters Literary Digest subscribers Telephone registrations and drivers’ license registrations – both overrepresented on the republican side. Election result: Roosevelt by a landslide, 62%-38% 8/51 Nonscientific, Nonrandom “(non)Sampling” : Nonscientific, Nonrandom “(non)Sampling” A Cultural Revolution … “3000 women, ages 14 to 78 describe in their own words …” 9/51 The Lesson… : The Lesson… In both cases: Having a really big sample does not assure you of an accurate result. It may assure you of a really solid, really bad (inaccurate) result. 10/51 A Descriptive Statistic : A Descriptive Statistic Is … ? Describes what? The sample The process that produced the sample The population that the data came from 11/51 Measures of Location : Measures of Location Location and central tendency There exists a distribution of values This is the “center” of the distribution The mean Symmetry and the median The mode and qualitative data 12/51 These are the 30 hours of defect data on circuit boards.Roughly where do these data fall on the line? 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 The Sample Mean : The Sample Mean 13/51 These are the 30 hours of defect data on circuit boards.Roughly where do these data fall on the line? 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 Average of Average Home Listings : Average of Average Home Listings 14/51 Averaging Averages? : Averaging Averages? Hawaii’s average listing = $896,800 Hawaii’s population = 1,275,194 Illinois’ average listing = $377,683 Illinois’ population = 12,763,371 Anything wrong here? Looks like Hawaii is getting too much influence. 15/51 A Properly Weighted Average : A Properly Weighted Average New average is 409,234 compared to 369,687 without weights, an error of 11%. Don’t average averages! State populations from http://www.factmonster.com/ipka/A0004986.html 16/51 Averaging Time Series Observations Is Usually Not Informative : Averaging Time Series Observations Is Usually Not Informative 17/51 589230 994320 1414852 1080732 Does the mean over the entire observation period mean anything? (Does it estimate anything meaningful?) Note how the mean changes completely depending on what time interval is used to compute it. The Sample Median : The Sample Median Median = the middle observation after data are sorted. Odd number: Central observation: Med[1,2,4,6,8,9,17] = 6 Even number: Midpoint between the two central observations Med[1,2,4,6,8,9,14,17] = (6+8)/2=7 18/51 Sample Median of Defects : Sample Median of Defects 1.05 1.30 1.40 1.45 1.45 1.50 1.55 1.60 1.60 1.65 1.65 1.70 1.70 1.70 1.70 1.90 1.90 1.95 2.05 2.05 2.05 2.20 2.25 2.30 2.30 2.35 2.35 2.35 2.60 2.70 Median = 1.8000 Mean = 1.8767 19/51 Slide 22: 20/51 Slide 23: 21/51 Extreme Observations Distort Means but Not Medians : Extreme Observations Distort Means but Not Medians Outlying observations distort the mean Med [1,2,4,6,8,9,17] = 6 Mean[1,2,4,6,8,9,17] = 6.714 Med [1,2,4,6,8,9,17000] = 6 (still) Mean[1,2,4,6,8,9,17000] = 2432.8 (!) This typically occurs when there are some outlying obervations, such as in cross sections of income or wealth and/or when the sample is not very large. 22/51 Trimming the Sample to Remove Distortions by Extreme Observations : Trimming the Sample to Remove Distortions by Extreme Observations Note: Means of ordinal survey data. See HOG, p. 32 23/51 Asymmetric Earnings Distribution Mean vs. Median in Skewed Data : Asymmetric Earnings Distribution Mean vs. Median in Skewed Data Weekly Earnings N = 595, (NLS, 1984) Mean = 1150 Median = 1080 The mean will exceed the median when the distribution is skewed to the right. (The skewness is in the direction of the long tail.) 24/51 Symmetric DistributionLog Weekly Earnings : Symmetric DistributionLog Weekly Earnings Log Weekly Earnings N = 595, (NLS, 1984) Mean = 6.95 Median = 6.98 Logs are often used to remove asymmetry from sample data. Mean and Median 25/51 Sample Mode : Sample Mode Most frequently occurring value in the sample Not useful for continuous (measurement) data Possibly informative for discrete measurements (counts). (Usually not.) Use for qualitative data. 26/51 Unordered Qualitative DataTravel Between Sydney and Melbourne : Unordered Qualitative DataTravel Between Sydney and Melbourne 27/51 Use the Mode. The mean and median make no sense, even if the responses are given numerical values. The values are just labels. Modal outcome is CAR for men, TRAIN for women. Dispersion of the Observations : Dispersion of the Observations These are the 30 hours of defect data on circuit boards. 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 We quantify the variation of the values around the mean. Note the range is from 1.05 to 2.70. This gives an idea where the data lie. The mean plus a measure of the variation do the same job (better). 28/51 A Measure of Dispersion : A Measure of Dispersion Variation = Syy = Variance = sy2 = Standard deviation = sy = Note the units of measurement. The standard deviation has the same units as the mean. The standard deviation is the standard measure for the dispersion (spread) of a set of values (sample of observations). 29/51 WHY N-1 in the Denominator? : WHY N-1 in the Denominator? Everyone else does it Minitab does it I have totally no idea. Tendency of the variance to be too small when computed using 1/N when the sample size, N, is itself small. When N is large, it won’t matter. 30/51 See HOG, p. 37 Computing a Standard Deviation : Computing a Standard Deviation 31/51 Y Deviation Squared From Mean Deviation 1 -2.1 4.41 4 0.9 0.81 6 2.9 8.41 0 -3.1 9.61 3 -0.1 0.01 2 -1.1 1.21 6 2.9 8.41 4 0.9 0.81 4 0.9 0.81 1 -2.1 4.41 Sum 31 Mean = 31/10=3.1 Sum of squared deviations = 38.90 Variance=38.90/(10-1)= 4.322Standard Deviation = 2.079 Standard Deviation : Standard Deviation These are the 30 hours of defect data on circuit boards.1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 32/51 Distribution of values : Distribution of values 33/51 Reliable Rules of Thumb : Reliable Rules of Thumb Almost always, 66% of the observations in a sample will lie in the range [mean+1 s.d. and mean – 1 s.d.] Almost always, 95% of the observations in a sample will lie in the range [mean+2 s.d. and mean – 2 s.d.] Almost always, 99.5% of the observations in a sample will lie in the range [mean+3 s.d. and mean – 3 s.d.] 34/51 A Reliable Empirical Rule : A Reliable Empirical Rule Mean ± 1 s =(1.47 to 2.28) includes 18/30 = 60% Mean ± 2 s = (1.06 to 2.69) includes 28/30 = 93% Minitab: Graph  Dotplot … 35/51 A Statistical Rule(Chebyshev Inequality) : A Statistical Rule(Chebyshev Inequality) 36/51 A Convenient Computation : A Convenient Computation 37/51 Using the Variance Shortcut : Using the Variance Shortcut Note: modern computer programs never do this when they have the raw data. It can be wildly inaccurate when observations have many digits and differ widely. 38/51 Rules For Transformations : Rules For Transformations Mean of a + bY = a + b Standard deviation of a + bY = |b| sy Standard deviation of log(Y) is generally not even close to equal to the log of the standard deviation of Y. 39/51 Application – Cost of Defects : Application – Cost of Defects These are the 30 hours of defect data on circuit boards. 1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.702.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.351.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35 Suppose the cost to repair defects is $25 + .10*Defects I.e., a $25 setup cost plus 10 cents per defect.Mean defects = 1.8767 Standard Deviation = 0.407205 Mean Cost = $25 + 0.10(1.8767) = $25.18767 Standard Deviation Cost = $0.10(.407205) = $0.0407205 Std.Dev(log of defects) = 0.2229; log of Std.Dev of defects = -0.8984! 40/51 Covariation : Covariation Variables Y and X vary together Causality vs. covariation: Does movement in X “cause” movement in Y in some metaphysical sense? Covariance Simultaneous movement through a statistical relationship Simultaneous variation “induced” by the variation of a common third effect 41/51 Scatter Plots Suggest Covariation : Scatter Plots Suggest Covariation 42/51 Regression Measures Covariation : Regression Measures Covariation Regression Line: Listing = a + b IncomePC 43/51 Covariation Is Not Causation : Covariation Is Not Causation Price and Income seem to be “positively” related. 44/51 The U.S. Gasoline Market. Data are yearly from 1953 to 2004. Plot of per capita income vs. gasoline price index. The Hidden Relationship : The Hidden Relationship Not positively “related” to each other; both positively related to “time.” 45/51 Measuring Covariation with Correlation : Measuring Covariation with Correlation Units are X times Y, e.g., Price times Income Not intuitive or natural. -1 < rXY < +1 Units free. A pure number. Variation around means is measured in standard deviation units. 46/51 Correlation : Correlation rIncome,Listing = +0.591 47/51 Correlations : Correlations r = +1.0 r = 0.0 r = +0.5 48/51 Sample Statistics andPopulation Parameters : Sample Statistics andPopulation Parameters Sample has a sample mean and standard deviation and sY. Population has a mean, μ, and standard deviation, σ. The sample “looks like” the population. The sample statistics resemble the population features. The bigger is the RANDOM sample, the closer will be the resemblance. 49/51 Populations and Samples : Populations and Samples Sometimes the sample is the population. “Sample” statistics are population quantities. E.g., an average of all 50 states. Other times the population is infinite and the sample is a trivial proportion of the population. How can a population be infinite? The subject of the rest of this course. 50/51 Summary : Summary Statistics to describe location (mean) and spread (standard deviation) of a sample of values. Interpretations Computations Complications Statistics and graphical tools to describe bivariate (two variable) relationships Scatter plots Covariance and correlations 51/51

Add a comment

Related presentations

Related pages

ravi kotak - Academia.edu

ravi kotak studies Communication, Entrepreneurship, and Sustainable Development.
Read more