Simple Linear Regression AMS 572 11/29/2010 2/69 Outline 1. Brief History and Motivation Zhen Gong 2. Simple Linear Regression Model Wenxiang Liu 3. Ordinary Least Squares Method Ziyan Lou 4. Goodness of Fit of LS Line Yixing Feng 5. OLS Example Lingbin Jin 6. Statistical Inference on Parameters Letan Lin 7. Statistical Inference Example Emily Vo 8. Regression Diagnostics Yang Liu 9. Correlation Analysis Andrew Candela 10.Implementation in SAS Joseph Chisari 3/69 Brief History and Introduction Legendre published the The Karl method Pearson was and Udny In 1809,

Gauss earliest form of extended Yule extended by Francis it to a published the same regression, which was Galton more general in the 19th statistical method. the method of least century around context to describe 20tha

squares in 1805. biological phenomenon. century. 4/69 Motivation for Regression Analysis Regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable. When there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression ? New observed predictor value Predict Y, based on X Prediction for response variable

5/69 Motivation for Regression Analysis 2010 Camry: Horsepower at 6000 rpm: 169 Highway gasoline consumption: 0.03125 gallon per mile 2010 Milan: Horsepower at 6000 rpm: 175 Highway gasoline consumption: 0.0326 gallon per mile 2010 Fusion: Horsepower at 6000 rpm: 263 Highway gasoline consumption: ? Response variable (Y): Highway gasoline consumption Predictor variable (X): Horsepower at 6000 rpm 6/69

Simple Linear Regression Model A summary of the relationship between a dependent variable (or response variable) Y and an independent variable (or covariate variable) X. Y is assumed to be a random variable while, even if X is a random variable, we condition on it (assume it is fixed). Essentially, we are interested in knowing the behavior of Y given we know X = x. 7/69 Good Model Regression models attempt to minimize the distance measured vertically between the observation point and the model line (or curve). The length of the line segment is called residual, modeling error, or simply error. The negative and positive errors should cancel out Zero overall error Many lines will satisfy this criterion. 8/69 Good Model

9/69 Probabilistic Model In simple linear regression, the population regression line was given by E(Y) = 0+1x The actual values of Y are assumed to be the sum of the mean value, E(Y), and a random error term, : Y = E(Y) + = 0+1x + At any given value of x, the dependent variable Y ~ N (0+1x , 2) 10/69 Least Squares (LS) Fit Boiling Point of Water in the Alps Pressure Boiling Pt Pressure Boiling Pt 20.79 194.5 24.01 201.3 20.79 194.3 25.14 203.6 22.40 197.9

26.57 204.6 22.67 198.4 28.49 209.5 23.15 199.4 27.76 208.6 23.35 199.9 29.04 210.7 23.89 200.9 29.88 211.9 23.99 201.1 30.06 212.2 24.02 201.4 11/ 69 Least Squares (LS) Fit Find a line that

represent the best linear relationship: 12/69 Least Squares (LS) Fit Problem: the data does not go through a line yi 0 1 xi i 1,2,......n Find the line that minimizes the sum: n 2 Q yi 0 1 xi i 1 We are looking for the line that 2 minimizes e( x) yi ( 01 xi ) i

13/69 Least Squares (LS) Fit To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero. 2 2 y x y x

i 0 1 i i 0 1 i Q Q 0 0 1 1 0 0 2 yi 0 1xi xi 0 2 yi 0 1xi 1 0 xi yi 0 xi 1 xi 2 0

yi 0n 1 xi 0 0 xi 1 xi 2 xi yi 0n 1 xi yi 14/69 Least Squares (LS) Fit Solve the equations and we get n 2 i n n n i 1 n i 1 n i 1 ( x )( yi ) ( xi )( xi yi )

0 i 1 n xi2 ( xi ) 2 i 1 n 1 i 1 n n n xi yi ( xi )( yi ) i 1 i 1 n n

n x ( xi ) 2 i 1 2 i i 1 i 1 15/69 Least Squares (LS) Fit To simplify, we introduce n n n 1 n S xy ( xi x )( yi y ) xi yi ( xi )( yi ) n i 1 i 1 i 1 i 1 n n n

1 S xx ( xi x ) 2 xi2 ( xi ) 2 n i 1 i 1 i 1 n n 1 n S yy ( yi y ) y ( yi ) 2 n i 1 i 1 i 1 2 2 i S xy 0 y 1 x1 S xx

y 0 1 x The resulting equation is known as the least squares line, which is an estimate of the true regression line. 16/69 Goodness of Fit of the LS Line The fitted values is x yi 0 1 i The residuals ei yi ( 0 1 xi ) are used to evaluate the goodness of fit of the LS Line. 17/ 69 Goodness of Fit of the LS Line The error sum of squares SSE=

The total sum of squares SST= n nof squares n The regression sum 2 2 2 n SST ( yi y ) ( yi y ) ( yi yi ) 2 ( yi yi )( yi y ) i 1 i1 i1 i1 SSR SSE SST=SSR+SSE 0 18/70 Goodness of Fit of the LS Line The coefficient of determination is always between 0 and 1

The sample correlation coefficient between X and Y is For the simple linear regression, 19/69 Estimation of the variance The variance measures the scatter of the around their means An unbiased estimate n of is given by 2 e i SSE 2 i 1 s n 2 n 2 This estimate of freedom. has n-2 degrees of

20/69 Implementing OLS method to Problem 10.4 n 2 Q [ yi ( 0 1 xi )] OLS method: The time between eruptions of Old Faithful geyser in Yellowstone National Park is random but is related to the duration of the last eruption. The table below shows these times for 21 consecutive eruptions. i 1 Obs Las No. t Nex Obs Las t

No. t Nex Obs Last Ne t No. xt 1 2.0 50 8 2.8 57 15 4.0 77 2 1.8 57

9 3.3 72 16 4.0 70 3 3.7 55 10 3.5 62 17 1.7 43

4 2.2 47 11 3.7 63 18 1.8 48 5 2.1 53 12 3.8 70

19 4.9 70 6 2.4 50 13 4.5 85 20 4.2 79 7 2.6 62

14 4.7 75 21 4.3 72 21/69 Implementing OLS method to Problem 10.4 A scatter plot of Next vs. LAST 22/69 Implementing OLS method to Problem 10.4 x 3.238 21 S xx ( xi x ) 2 22.230 i 1 21

y=62.714 21 SSE ( yi yi ) 2 713.687 i 1 21 S yy ( yi y ) 2 2844.286 SSR ( yi y ) 2 2130.599 S xy ( xi x )( yi y ) 217.629 SST S yy 2844.286 i 1 21 i 1 i 1 1 S xy / S xx 9.790 0 y 1 x 31.013 23/69 Implementing OLS method to Problem 10.4

y = 0 1 x When x=3, y=60 r SSR / SST 0.865 We could say that Last is a good predictor of Next Statistical Inference 24/69 Statistical Inference on 0 and 1 Final Result 0 and 1 are normally distributed. E ( 1 ) 1 E ( 0 ) 0 SD( 0 ) . 2 i

x nS xx 0 0 ~ N (0,1) SD( 0 ) SD( 1 ) S xx 1 1 ~ N (0,1) SD( ) 1 Statistical Inference on 0 and 25/69 1 Derivation .

Set xi s as fixed and use (xi x) xi nx 0 (x x)(Y Y ) (x x)Y Y (x x) i 1 i Sxx n (xi x)Y Sxx i 1 0 Y 1x i i Sxx

i 26/69 Statistical Inference on 0 and Derivation 1 n (xi x)E(Yi ) E(1) Sxx i1 . n (xi x)E(0 1xi ) Sxx i1 2

n xi x Var (Yi ) Var ( 1 ) i 1 S xx n 2 xi x n n (xi x) (xi x)xi i 1 S xx 0 1 Sxx Sxx i 1 i1 2 n 2 ( x

x ) n i 2 1 n S xx i 1 (xi x)xi (xi x)x Sxx i1 i1 2 S xx 2 n 2 1 S xx S xx (xi x)2 1 Sxx i1 2 27/69 Statistical Inference on 0 and

Derivation . E ( 0 ) E (Y 1 x ) E (Y ) E ( ) x i 1 n E ( 0 1 x i ) n n 0 1 x i 0 n 1 x 1 x 1

Var ( 0 ) Var (Y 1 x ) 2 Var (Y ) x Var ( ) 1 2 2 x n S xx 2 2 ( xi x ) xi nx 2 nS xx

2 x i nS xx 2 Statistical Inference on 0 and 2 (n 2) S SSE Since 2 SE ( 0) s Pivotal .

2 2 x i nS xx 28/70 1 ~ n2 2 s SE ( 1) S xx Quantities (P.Q.): 0 0 ~ tn SE ( 0) 1 1 ~ tn SE ( 1) 2

Confidence . Intervals (CIs): 0 tn 2, / 2 SE ( 0) tn 2 1 2, / 2 SE ( 1 ) 29/69 Statistical Inference on 0 and 1 Hypothesis tests: 0 0 H : vs . H

: 0 1 1 0 1 1 . 1 10 t n 2, / 2 Reject H 0 at level if t0 SE ( ) 1 A useful application is to show whether there is a linear relationship between x and y H 0 : 1 0 vs. H 0 : 1 0 Reject H 0 at level if t0 1 SE ( 1 )

One-side alternative hypotheses can be tested using one-side t-test. t n 2, / 2 30/69 Analysis of Variance (ANOVA) Mean Square: A sum of squares divided by its degrees of freedom. SSR MSR 1 SSE and MSE n 2 MSR SSR S xx 1 2 2 s/ S MSE s s xx

2 1 f1,n 2, t 2 2 1 t02 F0 SE ( ) 1 2 n 2 , / 2 31/69 Analysis of Variance (ANOVA) ANOVA Table: Source of Variation (Source) Sum of Squares (SS)

Degrees of Freedom (d.f.) Regression SSR 1 Error SSE n-2 Total SST n-1 Mean Square (MS) SSR MSR 1

SSE MSE n 2 F F MSR MSE 32/69 Statistical Inference Example Testing for Linear Relationship Problem 10.4 At = 0.05, is there a linear trend between the time to the NEXT eruption and the duration of the LAST eruption? H 0 : 1 0 H1vs. : 1 0 Reject H0 if t tn 2, /2 1 where t

SE 1 33/70 Statistical Inference Hypothesis Testing Solution: S 217.629 9.790 xy 1 B1 9.790 t 7.531 1.2999 S xx 22.230 SE 1

n SSE yi yi i 1 2 713.687 SSE 713.689 s 6.129 n 2 19 s SE 1 S xx

6.129 1.2999 22.230 tn 2, /2 t19,0.025 2.093 7.531 2.093 We reject H0 and therefore conclude That there is a linear relationship between NEXT and LAST. 34/69 Statistical Inference Example Confidence and Prediction Intervals Problem 10.11 from Tamane & Dunlop Statistics and Data Analysis 10.11 (a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes. 35/69 Problem 10.11 Prediction Interval Solution: The formula for a 100(1-)% PI for a future * Y observation

is given by * 2 * 1 ( x x) Y tn 2, / 2 s 1 n S xx 36/69 Problem 10.11 - Prediction Interval 1 B S xy 9.790 S xx 0 y B 1 x 31.013 B *

Y B 0 B 1 x* 31.013 9.790(3) 60.385 SSE s 6.129 n 2 tn 2, /2 t19,0.025 2.093 * 2 * 1 ( x x ) Y tn 2, /2 s 1 n S xx

60.385 1 (3 3.238) 2 (2.093)(6.129) 1 21 22.230 [47.238, 73.529] 37/69 Problem 10.11 - Confidence Interval 10.11(b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare this confidence interval with the PI obtained in (a) 38/70 Problem 10.11 - Confidence Interval Solution: The formula for a 100(1-)% CI for * given by * 2 * 1 (

x x ) t n 2, /2 s n S xx * B 0 B 1 x* where The 95% CI is [57.510, 63.257] The CI is shorter than the PI is 39/69 Regression Diagnostics Checking the Model Assumptions

1. E (Yi ) is a linear function of xi 2. Var(Y) 2 is the same for all xi i 3. The errors are normally i distributed i 4. The errors are independent(for time series data) Checking for Outliers and Influential Observations 40/69 Checking the Model Assumptions e yi yi Residuals: i

ei can be viewed as the estimates i of random errors 's 2 1 (xi x) ei ~ N(0, 1 n Sxx 2 2 ) 41/69 Checking for Linearity If regression of y xon is linear, ei plot then the xi of vs. should exhibit random scatter around zero

42/69 Checking for Linearity Tire Wear Data i 400 y 350 300 250 200 150 0 5 10 15 20

x 25 30 35 xi yi y i ei 1 0 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3

8 302.39 302.39 -11.39 4 12 273.27 273.27 -18.10 5 16 244.15 244.15 -14.82 6 20 215.02 215.02 -10.19 7 24 185.90 185.90 -6.90 8 28 156.78 156.78 7.05 9 32 127.66 127.66 22.67 43/69

Checking for Linearity Tire Wear Data Residual y i i xi yi 1 0 394.33 360.64 33.69 30 2 4 329.50 331.51 -2.01 20

3 8 302.39 302.39 -11.39 10 4 12 273.27 273.27 -18.10 5 16 244.15 244.15 -14.82 6 20 215.02 215.02 -10.19 7 24 185.90 185.90 -6.90 8 28 156.78 156.78 7.05 9

32 127.66 127.66 22.67 40 0 -10 -20 0 5 10 15 20 x 25 30 35 ei

44/69 Checking for Linearity Data Transformation x x2 x3 y y y x x log y 1/ y x y y y log x 1/ x x log y x 1/ y

x log x 1/ x x x x x2 x3 x x y y y2 y3 y2 y y y y2 y3 45/69

Checking for Constant Variance If the constant variance assumption s is correct, the dispersion e of i ' the is approximately constant with respect to the yi ' s 46/69 Checking for Constant Variance Example from textbook 10.21 e 0.3 0.2 0.1 Residual 0 -0.1

-0.2 -0.3 -0.4 0 0.5 1 1.5 y 2 2.5 47/69 Checking for Normality We can use residuals to make a normal plot Normal Probability Plot 0.99 0.98 0.95

Probability Example from 0.90 textbook 10.21 0.75 Normal plot of 0.50 residuals 0.25 0.10 0.05 0.02 0.01 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 Data 0 0.05 0.1 0.15 0.2 48/69 Checking for Outliers

Definition: An outlier is an observation that does not follow the general pattern of the yrelationship between and x A large residual indicates anei outlier!! ei ei * ei SE (ei ) 1 ( xi x ) s 1 n S xx 2 s *

ei 2 49/69 Checking for Influential Observations An observation can be influential because it h has an extreme x-value, an y-value, or both ii h A large ii indicates an influential observation!! n y i h ij y j j 1 hii 2(k 1) / n 1 ( xi x ) hii n S xx 2

k: # of predictors 50/69 Checking for Influential Observations 90 80 70 60 50 40 30 20 10 0 2 4 6 8 10 12

14 16 18 20 51/69 Why use Correlation analysis? If the nature of the relationship between X and Y is not known, we can investigate the correlation between them without making any assumptions of causality. In order to do this, assume (X,Y) follows the bivariate normal distribution. 52/69 The Bivariate Normal Distribution (X,Y) has the following distribution: 53/69 Why can we do this? This assumption reduces to the probabilistic model for linear regression since the conditional distribution of Y given X=x is

normal with the following parameters: So when X=x the mean of Y is a linear function of x and the variance is constant w.r.t. x. 54/69 So what? Under these assumptions we can use the data available to make inferences about . First we have to estimate from the data. Define the sample correlation coefficient R: 55/69 How can we use this? The exact distribution of R is very complicated, but we do have some options. Under the null Hypothesis H0:0=0 the distribution of R is simplified. An exact test exists in this case. For arbitrary values of 0 we can approximate a function of R with a normal distribution thanks to R.A. Fisher. 56/69 Testing H0 : 0=0

Under H0 the distribution of is t(n-2). This is kind of surprising, but think about it. The test statistic we used to test 10=0 is distributed as t(n-2) and =0 if and only if 1=0. That the two test statistics are equivalent is shown on page 382-383 of the text. 57/70 Approximation of R Fisher showed that for n even as small as 10 Now we can test H0 : = 0 vs. H1 : 0 for arbitrary 0. We just compute: 58/69 Almost Finished! We now have the tools necessary for inference on . For a confidence interval for compute: and solve for: 59/69 Correlation - Conclusion

When we are not sure of the relationship between X and Y assume (Xi,Yi) is an observation from a bivariate normal distribution. To test H0 : = 0 vs H1 : 0 at significance level just compare : to But if 0 =0 compare t(n-2,) to 60/69 SAS - Reg Procedure Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 61/70 Proc Reg Output 62/ 70

Plot Next*Last 63/70 SAS - Plotting Regression Line Symbol1 Value=Dot C=blue I=R; Symbol2 Value=None C=red I=RLCLM95; Proc Gplot Data=Regression_Example; Title "Regression Line and CIs"; Plot Next*Last=1 Next*Last=2/Overlay; Run; 64/70 Plotting Regression Line 65/69 SAS - Checking Homoscedasticity Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 66/

69 Predicted.*Residual. 67/69 SAS - Checking Normality of Residuals Proc Reg Data=Regression_Example; Output Out=Data_From_Regression Residual=R Predicted=PV; Proc Univariate Data=Data_From_Regression Normal; Var R; qqplot R / Normal(Mu=est Sigma=est); Run; 68/69 Checking for Normality 69/69 Questions?