SlideShare ist ein Scribd-Unternehmen logo
1 von 35
3. Linear Methods for Regression
Contents
 Least Squares Regression
 QR decomposition for Multiple Regression
 Subset Selection
 Coefficient Shrinkage
1. Introduction
• Outline
• The simple linear regression model
• Multiple linear regression
• Model selection and shrinkage—the state of
the art
Regression
0
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
X
Y
How can we model the generative process for this data?
Linear Assumption
 A linear model assumes the regression
function E(Y | X) is reasonably approximated
as linear
i.e.
• The regression function f(x) = E(Y | X=x) was the result
of minimizing squared expected prediction error
• Making the above assumption has high bias, but low
variance
)
,...
,
(
,
)
( 2
1
1
0 p
p
j
j
j X
X
X
X
X
X
f 

 



Least Squares Regression
 Estimate the parameters  based on a set of
training data: (x1, y1)…(xN, yN)
 Minimize residual sum of squares
• Training samples are random, independent draws
• OR, yi’s are conditionally independent given xi
2
1 1
0
)
(  
 











N
i
p
j
j
ij
i x
y
RSS 


Reasonable criterion when…
Matrix Notation
 X is N  (p+1) of
input vectors
 y is the N-vector of
outputs (labels)
  is the (p+1)-
vector of
parameters


























Np
N
N
p
p
T
N
T
T
x
x
x
x
x
x
x
x
x
x
x
x
...
1
...
...
1
...
1
1
...
1
1
2
1
2
22
21
1
12
11
2
1
X











N
y
y
y
y
...
2
1











p



 ...
1
0
Perfectly Linear Data
 When the data is exactly linear, there exists
 s.t.
 (linear regression model in matrix form)
 Usually the data is not an exact fit, so…

X

y
Finding the Best Fit?
-4
0
4
8
12
16
20
0 2 4 6 8 10
X
Y
Fitting Data from Y=1.5X+.35+N(0,1.2)
Minimize the RSS
 We can rewrite the RSS in Matrix form
 Getting a least squares fit involves
minimizing the RSS
 Solve for the parameters for which the
first derivative of the RSS is zero
   


 X
X 

 y
y
RSS
T
)
(
Solving Least Squares
 Derivative of a Quadratic Product
       
b
x
e
x
e
x
b
x
dx
d T
T
T
T





 A
C
D
D
C
A
D
C
A
   
   
 







X
X
X
X
X
X
X
X



















y
y
I
y
I
y
I
y
RSS
T
T
N
T
N
T
N
T
2
  y
y
y
T
T
T
T
T
T
X
X
X
X
X
X
X
X
X
1
0








 Then,
 Setting the First Derivative to Zero:
Least Squares Solution
Y
X
X)
(X T
1
T 

β̂ Y
X
X)
X(X
β
X
Y T
1
T 

 ˆ
ˆ
1



p
N
)
ˆ
(
RSS 
•Least Squares Coefficients •Least Squares Predictions
•Estimated Variance
 






N
i
i
i ŷ
y
p
N
ˆ
1
2
2
1
1

The N-dimensional Geometry of Least
Squares Regression
Statistics of Least Squares
 We can draw inferences about the parameters,
, by assuming the true model is linear with
noise, i.e.
 Then,
)
,
0
(
~
, 2
1
0 



 N
X
Y
p
j
j
j





 
 
2
1
,
~
ˆ 



X
XT
N
  )
1
(
χ
~
ˆ
1 2
2
2



 p
N
p
N 

Significance of One Parameter
 Can we eliminate one parameter, Xj
(j=0)?
 Look at the standardized coefficient
),
1
(
~
ˆ
ˆ


 p
N
t
v
z
j
j
j


vj is the jth diagonal element of (XTX)-1
Significance of Many Parameters
 We may want to test many features at
once
 Comparing model M1 with p1+1 parameters to
model M0 with p0+1 parameters from M1 (p0<p1)
 Use the F statistic:
   
 
)
1
,
(
~
1
1
0
1
1
1
0
1
1
0







 p
N
p
p
F
p
N
RSS
p
p
RSS
RSS
F
Confidence Interval for Beta
 We can find a confidence interval for j
 Confidence Interval for single parameter (1-2
confidence interval for j )
 Confidence Interval for entire parameter
(Bounds on )
 
σ̂
v
z
β̂
,
σ̂
v
z
β̂ /
j
α
j
/
j
α
j
2
1
1
2
1
1 
 

     












1
2
1
2
p
T
T
ˆ
ˆ
ˆ X
X
2.1 : Prostate cancer < Example>
 Data
• lcavol: log cancer volume
• lweight: log prostate weight
• age: age
• lbph: log of benign prostatic
hyperplasia amount
• svi: seminal vesicle invasion
• lcp: log of capsular penetration
• Gleason: gleason scores
• Pgg45: percent Gleason scores 4
or 5
Technique for Multiple Regression
 Computing directly has poor
numeric properties
 QR Decomposition of X
 Decompose X = QR where
• Q is N  (p+1) orthogonal vector (QTQ = I(p+1))
• R is an (p+1)  (p+1) upper triangular matrix
 Then
  y
T
T
X
X
X
1
ˆ 


      y
y
y
y
ˆ T
T
T
T
T
T
T
T
T
T
T
Q
R
Q
R
R
R
Q
R
R
R
Q
R
QR
Q
R 1
1
1
1
1 









y
ŷ T
QQ

1
1 q
x 11
r

2
22
12
2 q
q
x 1 r
r 

3
33
2
23
13
3 q
q
q
x 1 r
r
r 


…
Gram-Schmidt Procedure
1) Initialize z0 = x0 = 1
2) For j = 1 to p
For k = 0 to j-1, regress xj on the zk’s so that
Then compute the next residual
3) Let Z = [z0 z1 … zp] and  be upper triangular with
entries kj
X = Z  = ZD-1D  = QR
where D is diagonal with Djj = || zj ||
k
k
j
k
kj
z
z
x
z









1
0
j
k
k
kj
j
j z
x
z 
(univariate least squares estimates)
Subset Selection
 We want to eliminate unnecessary features
 Best subset regression
• Choose the subset of size k with lowest RSS
• Leaps and Bounds procedure works with p up to 40
 Forward Stepwise Selection
• Continually add features to  with the largest F-ratio
 Backward Stepwise Selection
• Remove features from  with small F-ratio
Greedy techniques – not guaranteed to find the best model
 
)
1
,
1
(
~
1
1
1
1
1
0






p
N
F
p
N
RSS
RSS
RSS
F
Coefficient Shrinkage
 Use additional penalties to reduce
coefficients
 Ridge Regression
• Minimize least squares s.t.
 The Lasso
• Minimize least squares s.t.
 Principal Components Regression
• Regress on M < p principal components of X
 Partial Least Squares
• Regress on M < p directions of X weighted by y



p
j
j s
1
|
| 



p
j
j s
1
2

4.2 Prostate Cancer Data Example-
Continued
Error Comparison
Shrinkage Methods (Ridge Regression)
 Minimize RSS() + T
• Use centered data, so 0 is not penalized
• xj are of length p, no longer including the initial 1
 The Ridge estimates are:
N
x
x
x
N
y
N
i
ij
ij
ij
N
i
i /
,
/
ˆ
1
1
0 
 





   
 
 
 
  y
y
y
y
RSS
y
y
RSS
T
p
T
p
T
T
T
T
T
T
X
I
X
X
I
X
X
X
X
X
X
X
X
X
1
0
0
2
2
)
(



































Shrinkage Methods (Ridge Regression)
The Lasso
 Use centered data, as before
 The L1 penalty makes solutions nonlinear
in yi
• Quadratic programming are used to compute them
s
x
y
RSS
p
j
j
N
i
p
j
j
ij
i 










 
  
  1
2
1 1
0 |
|
)
( 


 subject to
Shrinkage Methods (Lasso Regression)
Principal Components Regression
 Singular Value Decomposition (SVD) of X
• U is N  p, V is p  p; both are orthogonal
• D is a p  p diagonal matrix
 Use linear combinations (v) of X as new features
• vj is the principal component (column of V) corresponding to
the jth largest element of D
• vj are the directions of maximal sample variance
• use only M < p features, [z1…zM] replaces X
T
UDV
X 
M
j
v
z j
j ...
1

 X
m
M
m
m
pcr
z
ˆ
y
ŷ 



1
 m
m
m
m z
,
z
/
y
,
z
ˆ 

Partial Least Squares
 Construct linear combinations of inputs
incorporating y
 Finds directions with maximum variance
and correlation with the output
 The variance aspect seems to dominate
and partial least squares operates like
principal component regression
4.4 Methods Using Derived Input Directions
(PLS)
• Partial Least Squares
Discussion :
a comparison of the selection and shrinkage
methods
4.5 Discussion :
a comparison of the selection and shrinkage
methods
A Unifying View
 We can view all the linear regression
techniques under a common framework
  includes bias, q indicates a prior distribution
on 
• =0: least squares
• >0, q=0: subset selection (counts number of nonzero parameters)
• >0, q=1: the lasso
• >0, q=2: ridge regression





















 
  
 
p
j
q
j
N
i
p
j
j
ij
i x
y
1
2
1 1
0 |
|
min
arg
ˆ 



 
Discussion :
a comparison of the selection and shrinkage methods
• Family of Shrinkage Regression

Weitere ähnliche Inhalte

Ähnlich wie ch3.ppt

Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygons
aa11bb11
 

Ähnlich wie ch3.ppt (20)

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
ML unit2.pptx
ML unit2.pptxML unit2.pptx
ML unit2.pptx
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
UNIT I_5.pdf
UNIT I_5.pdfUNIT I_5.pdf
UNIT I_5.pdf
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
sada_pres
sada_pressada_pres
sada_pres
 
Bresenham circlesandpolygons
Bresenham circlesandpolygonsBresenham circlesandpolygons
Bresenham circlesandpolygons
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motion
 
Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
A lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controlsA lambda calculus for density matrices with classical and probabilistic controls
A lambda calculus for density matrices with classical and probabilistic controls
 

Kürzlich hochgeladen

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Kürzlich hochgeladen (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

ch3.ppt

  • 1. 3. Linear Methods for Regression
  • 2. Contents  Least Squares Regression  QR decomposition for Multiple Regression  Subset Selection  Coefficient Shrinkage
  • 3. 1. Introduction • Outline • The simple linear regression model • Multiple linear regression • Model selection and shrinkage—the state of the art
  • 4. Regression 0 2 4 6 8 10 12 14 16 0 1 2 3 4 5 6 7 8 9 10 X Y How can we model the generative process for this data?
  • 5. Linear Assumption  A linear model assumes the regression function E(Y | X) is reasonably approximated as linear i.e. • The regression function f(x) = E(Y | X=x) was the result of minimizing squared expected prediction error • Making the above assumption has high bias, but low variance ) ,... , ( , ) ( 2 1 1 0 p p j j j X X X X X X f       
  • 6. Least Squares Regression  Estimate the parameters  based on a set of training data: (x1, y1)…(xN, yN)  Minimize residual sum of squares • Training samples are random, independent draws • OR, yi’s are conditionally independent given xi 2 1 1 0 ) (                N i p j j ij i x y RSS    Reasonable criterion when…
  • 7. Matrix Notation  X is N  (p+1) of input vectors  y is the N-vector of outputs (labels)   is the (p+1)- vector of parameters                           Np N N p p T N T T x x x x x x x x x x x x ... 1 ... ... 1 ... 1 1 ... 1 1 2 1 2 22 21 1 12 11 2 1 X            N y y y y ... 2 1            p     ... 1 0
  • 8. Perfectly Linear Data  When the data is exactly linear, there exists  s.t.  (linear regression model in matrix form)  Usually the data is not an exact fit, so…  X  y
  • 9. Finding the Best Fit? -4 0 4 8 12 16 20 0 2 4 6 8 10 X Y Fitting Data from Y=1.5X+.35+N(0,1.2)
  • 10. Minimize the RSS  We can rewrite the RSS in Matrix form  Getting a least squares fit involves minimizing the RSS  Solve for the parameters for which the first derivative of the RSS is zero        X X    y y RSS T ) (
  • 11. Solving Least Squares  Derivative of a Quadratic Product         b x e x e x b x dx d T T T T       A C D D C A D C A                  X X X X X X X X                    y y I y I y I y RSS T T N T N T N T 2   y y y T T T T T T X X X X X X X X X 1 0          Then,  Setting the First Derivative to Zero:
  • 12. Least Squares Solution Y X X) (X T 1 T   β̂ Y X X) X(X β X Y T 1 T    ˆ ˆ 1    p N ) ˆ ( RSS  •Least Squares Coefficients •Least Squares Predictions •Estimated Variance         N i i i ŷ y p N ˆ 1 2 2 1 1 
  • 13. The N-dimensional Geometry of Least Squares Regression
  • 14. Statistics of Least Squares  We can draw inferences about the parameters, , by assuming the true model is linear with noise, i.e.  Then, ) , 0 ( ~ , 2 1 0      N X Y p j j j          2 1 , ~ ˆ     X XT N   ) 1 ( χ ~ ˆ 1 2 2 2     p N p N  
  • 15. Significance of One Parameter  Can we eliminate one parameter, Xj (j=0)?  Look at the standardized coefficient ), 1 ( ~ ˆ ˆ    p N t v z j j j   vj is the jth diagonal element of (XTX)-1
  • 16. Significance of Many Parameters  We may want to test many features at once  Comparing model M1 with p1+1 parameters to model M0 with p0+1 parameters from M1 (p0<p1)  Use the F statistic:       ) 1 , ( ~ 1 1 0 1 1 1 0 1 1 0         p N p p F p N RSS p p RSS RSS F
  • 17. Confidence Interval for Beta  We can find a confidence interval for j  Confidence Interval for single parameter (1-2 confidence interval for j )  Confidence Interval for entire parameter (Bounds on )   σ̂ v z β̂ , σ̂ v z β̂ / j α j / j α j 2 1 1 2 1 1                       1 2 1 2 p T T ˆ ˆ ˆ X X
  • 18. 2.1 : Prostate cancer < Example>  Data • lcavol: log cancer volume • lweight: log prostate weight • age: age • lbph: log of benign prostatic hyperplasia amount • svi: seminal vesicle invasion • lcp: log of capsular penetration • Gleason: gleason scores • Pgg45: percent Gleason scores 4 or 5
  • 19. Technique for Multiple Regression  Computing directly has poor numeric properties  QR Decomposition of X  Decompose X = QR where • Q is N  (p+1) orthogonal vector (QTQ = I(p+1)) • R is an (p+1)  (p+1) upper triangular matrix  Then   y T T X X X 1 ˆ          y y y y ˆ T T T T T T T T T T T Q R Q R R R Q R R R Q R QR Q R 1 1 1 1 1           y ŷ T QQ  1 1 q x 11 r  2 22 12 2 q q x 1 r r   3 33 2 23 13 3 q q q x 1 r r r    …
  • 20. Gram-Schmidt Procedure 1) Initialize z0 = x0 = 1 2) For j = 1 to p For k = 0 to j-1, regress xj on the zk’s so that Then compute the next residual 3) Let Z = [z0 z1 … zp] and  be upper triangular with entries kj X = Z  = ZD-1D  = QR where D is diagonal with Djj = || zj || k k j k kj z z x z          1 0 j k k kj j j z x z  (univariate least squares estimates)
  • 21. Subset Selection  We want to eliminate unnecessary features  Best subset regression • Choose the subset of size k with lowest RSS • Leaps and Bounds procedure works with p up to 40  Forward Stepwise Selection • Continually add features to  with the largest F-ratio  Backward Stepwise Selection • Remove features from  with small F-ratio Greedy techniques – not guaranteed to find the best model   ) 1 , 1 ( ~ 1 1 1 1 1 0       p N F p N RSS RSS RSS F
  • 22. Coefficient Shrinkage  Use additional penalties to reduce coefficients  Ridge Regression • Minimize least squares s.t.  The Lasso • Minimize least squares s.t.  Principal Components Regression • Regress on M < p principal components of X  Partial Least Squares • Regress on M < p directions of X weighted by y    p j j s 1 | |     p j j s 1 2 
  • 23. 4.2 Prostate Cancer Data Example- Continued
  • 25. Shrinkage Methods (Ridge Regression)  Minimize RSS() + T • Use centered data, so 0 is not penalized • xj are of length p, no longer including the initial 1  The Ridge estimates are: N x x x N y N i ij ij ij N i i / , / ˆ 1 1 0                     y y y y RSS y y RSS T p T p T T T T T T X I X X I X X X X X X X X X 1 0 0 2 2 ) (                                   
  • 27. The Lasso  Use centered data, as before  The L1 penalty makes solutions nonlinear in yi • Quadratic programming are used to compute them s x y RSS p j j N i p j j ij i                   1 2 1 1 0 | | ) (     subject to
  • 29. Principal Components Regression  Singular Value Decomposition (SVD) of X • U is N  p, V is p  p; both are orthogonal • D is a p  p diagonal matrix  Use linear combinations (v) of X as new features • vj is the principal component (column of V) corresponding to the jth largest element of D • vj are the directions of maximal sample variance • use only M < p features, [z1…zM] replaces X T UDV X  M j v z j j ... 1   X m M m m pcr z ˆ y ŷ     1  m m m m z , z / y , z ˆ  
  • 30. Partial Least Squares  Construct linear combinations of inputs incorporating y  Finds directions with maximum variance and correlation with the output  The variance aspect seems to dominate and partial least squares operates like principal component regression
  • 31. 4.4 Methods Using Derived Input Directions (PLS) • Partial Least Squares
  • 32. Discussion : a comparison of the selection and shrinkage methods
  • 33. 4.5 Discussion : a comparison of the selection and shrinkage methods
  • 34. A Unifying View  We can view all the linear regression techniques under a common framework   includes bias, q indicates a prior distribution on  • =0: least squares • >0, q=0: subset selection (counts number of nonzero parameters) • >0, q=1: the lasso • >0, q=2: ridge regression                             p j q j N i p j j ij i x y 1 2 1 1 0 | | min arg ˆ      
  • 35. Discussion : a comparison of the selection and shrinkage methods • Family of Shrinkage Regression