2014_Ch.17_Notes

8/17/2019 2014_Ch.17_Notes

1/6

QMDS 202 Data Analysis and Modeling

Chapter 17 Multiple Regression

Model and Required Conditions

For k independent variables (predicting variables) x1, x2, … , xk , the multiple linear

regression model is represented b the !ollo"ing equation#

ε β β β β +++++= k k x x x y $$$2211%

"here β 1, β 2, … , β k are population regression coe!!icients o! x1, x2, … , xk

respectivel, β % is the constant term, and ε (the &reek letter epsilon) represents the

random term (also called the error variable) ' the di!!erence bet"een the actual value

o! Y and the estimated value o! Y based on the values o! the independent variables$

he random term thus accounts !or all other independent variables that are not

included in the model$

Required Conditions !or the rror *ariable#

1$ he probabilit distribution o! the error variable ε is normal$

2$ he mean o! the error variable is %$

+$ he standard deviation o! ε is ε σ , "hich is constant !or each value o! x$

$ he errors are independent$

he general !orm o! the sample regression equation is e-pressed as !ollo"#

k k xb xb xbb y ++++= $$$. 2211%

"here b1, b2, … , bk are sample linear regression coe!!icients o! x1, x2, … , xk respectivel and b% is the constant o! the equation$

For k / 2, the sample regression equation is 2211%. xb xbb y ++= "here b%, b1, and b2can be !ound b solving a sstem o! three normal equations#

Σ+Σ+Σ=Σ

Σ+Σ+Σ=Σ

Σ+Σ+=Σ

2

222112%2

212

2

111%1

2211%

xb x xb xb y x

x xb xb xb y x

xb xbnb y

-ample 1

1 x

2 x y y x

1 y x

2 21 x x 2

1 x 2

2 x y.

1 2%% 1%% 1%% 2%%%% 2%% 1 %%%% 02$

3%% +%% 1%% 21%%%% +%% 2 %%%% +%$2%

0 0%% %% +2%% +2%%%% 4%% 4 4%%%% +$3+

4 %% 2%% 12%% 0%%%% 2%% +4 14%%%% 21$+ 1%% 1%% +%% 1%%%% +%% 1%%%% $2

1

8/17/2019 2014_Ch.17_Notes

2/6

1% 4%% %% %%% 2%%%% 4%%% 1%% +4%%%% +3$41

++ 20%% 1%% 1%+%% 00%%%% 100%% 2+ 13%%%%% 1%%

n / 4

++=

++=++=

21%

21%

21%

13%%%%%100%%20%%00%%%%

100%%2+++1%+%%

20%%++41%%

bbb

bbbbbb

5 solving the above sstem o! normal equations, "e should !ind the !ollo"ing#

b% / 4$+3 b1 / 2%$2 b2 / %$20%

∴ he sample multiple linear regression equation is#

21 20%$%,2$2%+3$4. x x y ++=

6nterpretation o! the Regression Coe!!icients

b1# the appro-imate change in y i! x1 is increased b 1 unit and x2 is held constant$

b2# the appro-imate change in y i! x2 is increased b 1 unit and x1 is held constant$

6n -ample 1, i! x1 is increased b 1 unit and x2 is held constant, then the appro-imate

change in y there!ore "ill be 2%$2 units$

7oint stimate

6n -ample 1, suppose x1 / and x2 / %%, then the point estimate o! y equals#41$220)%%(20%$%),(,2$2%+3$4. =++= y

he 8tandard rror o! stimate in Multiple Regression Model

( )

1

. 2

−−

−Σ=

k n

y y s iiε

"here i y / the observed y value in the sample

i y. / the estimated y value calculated !rom the multiple regression equation

6n -ample 1, 2).( ii y y −

(13$%1)200$20

124

,$2%2=

−−=

ε s

(9$2)2

($23)2

(91$)2

($%0)2

(2%$+)2

2

8/17/2019 2014_Ch.17_Notes

3/6

2%2$

:ote# ε s is the point estimate o! ε σ (the standard deviation o! the error variable ε $)

esting the *alidit o! the Model ' he ;nalsis o! *ariance (;:s consider a simple linear regression model#

y

? y / Σ y @ n / the mean o! y

? ? y

? ?

x

).().()( iiii y y y y y y −+−=−

⇒ ).().()( iiii y y y y y y −Σ+−Σ=−Σ

( ) y yi −Σ / total deviations

( ) y yi −Σ . / total deviations o! estimated values !rom the mean( )

ii y y .−Σ / error deviations / ieΣ

iii y ye .−= / the residual o! the ith data point

222).().()( iiii y y y y y y −Σ+−Σ≈−Σ

⇒ 88 / 88R A 88

88 / total sum o! squared deviations / total variation

88R / sum o! squares resulting !rom regression / e-plained variation

88 / sum o! squares resulting !rom sampling error / une-plained variation

he ;:

8/17/2019 2014_Ch.17_Notes

4/6

(Re!er to the associated computer output o! this e-ample)

B%# he regression model is not signi!icant (β1 / β2 / … / βk / %)

B1# he regression model is signi!icant (;t least one βi ≠ %)

α / %$% d! 1 / k / 2 d! 2 / n ' k ' 1 / 4 ' 2 ' 1 / +

Critical value / $

est statistic / $ $ ⇒ ReDect B%Ee can also use the p9value provided b the output to arrive at the conclusion#

p9value / %$%%+ α / %$% ⇒ ReDect B%

∴ he regression model is signi!icant$ (here is at least one independent variable that

can e-plain G$)

66 he t 9ests !or Regression Coe!!icients (8lopes)

; t 9test is used to determine i! there is a meaning!ul relationship bet"een the

dependent variable and one o! the independent variables$

6n -ample 1, the t 9test !or H1 (again re!er to the computer output o! this e-ample)#

B%# H1 is not a signi!icant independent variable (β1 / %)

B1# H1 is a signi!icant independent variable (β1 ≠ %)

α / %$% α@2 / %$%2 df / n ' k ' 1 / 4 ' 2 ' 1 / +

Critical values / ± +$102

ReDect B% i! 8 −+$102 or 8 +$102

1

%11 )(

bS

bTS

β −= "here 1bS / estimated standard deviation o! b1

,0$+002$

%,2$2%=

−=TS +$102 ⇒ ReDect B%

p9value approach#

p9value / %$% α / %$% ⇒ ReDect B%

∴ he slope β1 is signi!icant, that is, there is a meaning!ul relationship bet"een H1and G$

he t 9test !or H2#

B%# H2 is not a signi!icant independent variable (β2 / %)

B1# H2 is a signi!icant independent variable (β2 ≠ %)

α / %$% α@2 / %$%2 df / n ' k ' 1 / 4 ' 2 ' 1 / +

Critical values / ± +$102

ReDect B% i! 8 −+$102 or 8 +$102

2

%22 )(

bS

bTS

β −= "here 2bS / estimated standard deviation o! b2

%0$,%4$%

%20%$% =−

=TS +$102 ⇒ ReDect B%

8/17/2019 2014_Ch.17_Notes

5/6

p9value approach#

p9value / %$%24 α / %$% ⇒ ReDect B%

∴ H2 is also a signi!icant independent variable$

6n case there are some insigni!icant independent variables in the model (the p9values

o! some regression coe!!icients are bigger than α), "e should take out the most

insigni!icant variable !rom the model (the one "ith the highest p9value) and run the

regression !unction once again b using onl the remaining variables$ hen "e

observe the p9values o! the coe!!icients in this ne" model and repeat the same

procedure (i! necessar) until all the p9values are less than α$

he Coe!!icient o! Multiple Ietermination (R2 )

iationtotal

iationlained

SST

SSR R

var

var e-p2==

6n -ample 1, 3,$%2%+2,3

2,32 =+

= R

Ee can conclude that 3$J o! the variation in G is e-plained b using H 1 and H2 as

independent variables$

he ;dDusted R2

he adDusted R 2 has been adDusted to take into account the sample siKe and the number

o! independent variables$ he rationale !or this statistic is that, i! the number o!

independent variables k is large relative to the sample siKe n, the unadDusted R 2 value

ma be unrealisticall high$

;dDusted R 2 /)1@(

)1@(1

−

−−−

nSST

k nSSE

6! n is considerabl larger than k , the actual and adDusted R 2 values "ill be similar$ 5ut

i! 88 is quite di!!erent !rom % and k is large compared to n, the actual and adDusted

values o! R 2 "ill di!!er substantiall$

2

adj R /)1@(

)1@(1

−

−−−

nSST

k nSSE / ( )

−−

−−−

1

111 2

k n

n R

6n -ample 1,2

adj R / 4$%1%%%

212$0+,1

@%%%

+@4+4$2%21 =−=−

he Multicollinearit 7roblem in Multiple Regression Model

8/17/2019 2014_Ch.17_Notes

6/6

Multicollinearit is the name given to the situation in "hich t"o independent

variables (e$g$ Hi and H D) are closel correlated$ 6! this is the case, the values o! the

t"o regression coe!!icients (bi and b D) tend to be unreliable and an estimate made "ith

an equation that uses these values also tends to be unreliable$ his is because, i! H iand H D are closel correlated, values in H D don>t necessaril remain constant "hile Hi

changes$ 6! t"o independent variables are closel correlated, that is, i! their correlation coe!!icient (r) is close to ± 1, a simple solution to solve the

multicollinearit problem is to use Dust one o! them in a multiple regression model$

;s a rule o! thumb, i! r o! Hi and H D is bigger than or equal to %$0, then "e

should drop one o! them !rom the regression model$

6n -ample 1, r o! H1 and H2 / %$31 is not bigger than %$0

⇒ H1 and H2 can be used together in the model$

6nterval stimates !or 7opulation Regression Coe!!icients

he con!idence interval o! βi is !ound b# ibi S t b 2@α ±

d! / n ' k ' 1

6n -ample 1, the J con!idence interval o! β1 is#

2%$2 ± +$102 × $002

/ (1$33 to +$21)

4

2014_Ch.17_Notes

Documents

Transcript of 2014_Ch.17_Notes