SlideShare ist ein Scribd-Unternehmen logo
1 von 27
The Elements of Statistical Learning
Ch.2: Overview of Supervised Learning
4/13/2017 坂間 毅
2
• Supervised Learning
• Predict outputs from inputs
• Inputsの別名
• Predictors 予測変数
• Independent variables 独立変数
• Features 特徴
• Outputsの別名
• Responses 応答変数
• Dependent variables 従属変数
2.1 Introduction
3
• Outputs
1. Quantitative variable
• 大気の測定値など、連続値
• Quantitative prediction = Regression
2. Qualitative variable
• Categorical, discrete variableともいう
• アヤメの種類など、有限集合の値
• Qualitative prediction = Classification
• Inputの種類
1. Quantitative variable
2. Qualitative variable
3. Ordered categorical variable (eg. small, mid, large)
※ 間隔尺度と比例尺度は量的変数にまとめられている?
2.2 Variable Types and Terminology
4
• Notation
• Input
• Vector: 𝑋
• Component of vector: 𝑋𝑗
• i-th observation: 𝑥𝑖 (小文字)
• Matrix: 𝐗 (ボールド)
• All the observations on j-th variable: 𝐱𝐣 (ボールド)
• Output
• Quantitative output: 𝑌
• Prediction of 𝑌: 𝑌
• Qualitative output: 𝐺
• Prediction of 𝐺: 𝐺
2.2 Variable Types and Terminology (contd.)
5
• Linear Model
• With bias term in coefficient, 𝑌 = 𝑋 𝑇 𝛽
• Most popular Fitting method: least squares
• 𝑅𝑆𝑆 𝛽 = 𝐲 − 𝐗𝛽 𝑇 𝐲 − 𝐗𝛽
(RSS: Residual Sum of Squared errors)
• By differentiating RSS w.r.t. 𝛽, and set 0
• 𝐗 𝑇
𝒚 − 𝐗𝛽 = 0
• If 𝐗 𝑇 𝐗 is nonsingular (regular 正則行列), then inverse exists,
• 𝛽 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
2.3.1 Linear Models and Least Squares
6
• Linear Model (Classification)
• 𝑮 = ORANGE if 𝑌 > 0.5
BLUE if 𝑌 ≤ 0.5
• Two classes are separated by Decision boundary
• 𝑥: 𝑥 𝑇 𝛽 = 0.5
• Two cases for generating 2-class data
1. 平均が異なる相関の無い2変数ガウス分布からそれぞれ生成される
⇒線形の決定境界が最善(第四章で)
2. それぞれの平均の分布がガウス分布になっている、10個の分散の小さいガ
ウス分布から生成される
⇒非線形の決定境界が最善(本章の例はこちら)
2.3.1 Linear Models and Least Squares (contd.)
7
• k-Nearest Neighbor
• 𝑌 𝑥 =
1
𝑘 𝑥 𝑖∈𝑁 𝑘(𝑥) 𝑦𝑖
𝑁𝑘 𝑥 is k (Euclidean) closest points to x in training set
• 𝑘 = 1: Voronoi tessellation
• Notice
• Effective number of parameters of k-NN = N/k
• “we will see”
• RSS is useless
• 𝑘 = 1のとき訓練データを誤差なく分類するので、𝑘 = 1がもっともRSSが
少ないことになる
2.3.2 Nearest-Neighbor Methods
8
• Today’s popular techniques are variants of Linear model
or k-Nearest Neighbor (or both)
2.3.3 From Least Squares to Nearest Neighbors
Variance Bias
Linear Model low high
k-Nearest Neighbors high low
9
• Theoretical Framework
• Joint distribution Pr 𝑋, 𝑌
• Squared error loss function 𝐿 𝑌, 𝑓 𝑋 = (𝑌 − 𝑓 𝑋 )2
• Expected (squared) prediction error
• EPE 𝑓 = E(𝑌 − 𝑓 𝑋 )2
= 𝑦 − 𝑓(𝑥) 2Pr(𝑑𝑥, 𝑑𝑦)
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑥, 𝑦 𝑑𝑦 𝑑𝑥
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑦 𝑥 Pr(𝑥 𝑑𝑦 𝑑𝑥
by Pr 𝑋, 𝑌 = Pr 𝑌 𝑋 Pr(𝑋)
= E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 Pr(𝑥) 𝑑𝑥
= E 𝑋E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋
2.4 Statistical Decision Theory
10
• Minimum 𝑓 is the regression function
• The best prediction of 𝑌 at any point 𝑋 = 𝑥 is the conditional mean,
when best is measured by average squared error.
• 𝑓 𝑥 = argmin 𝑐E 𝑌|𝑋 𝑌 − 𝑐 2
|𝑋 = 𝑥
⇒
𝜕
𝜕𝑓
E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋 = 𝑥 = 0
⇒
𝜕
𝜕𝑓
𝑦 − 𝑓(𝑥) 2Pr(𝑦|𝑥) 𝑑𝑦 = 0
⇒ −2𝑦 + 2𝑓(𝑥) Pr 𝑦 𝑥 𝑑𝑦 = 0
⇒ 2𝑓 𝑥 Pr 𝑦 𝑥 𝑑𝑦 = 2 𝑦𝑃𝑟 𝑦 𝑥 𝑑𝑦
⇒ 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
2.4 Statistical Decision Theory (contd.)
11
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• k-Nearest Neighbor
• 𝑓(𝑥) = Ave(𝑦𝑖|𝑥𝑖 ∈ 𝑁𝑘 𝑥 )
• Two approximation: Ave, 𝑁𝑘(𝑥)
• Under mild regularity condition on Pr(𝑋, 𝑌),
• If 𝑁, 𝑘 → ∞ with
𝑘
𝑁
→ 0, then 𝑓 𝑥 → E(𝑌|𝑋 = 𝑥)
• However, the curse of dimensionality becomes severe
2.4 Statistical Decision Theory (contd.)
12
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• Linear Regression
• 𝑓 𝑥 ≈ 𝑥 𝑇 𝛽 (or 𝑓 𝑥 = 𝑥 𝑇 𝛽?)
• Then,
•
𝜕EPE
𝜕𝛽
=
𝜕
𝜕𝛽
𝑦 − 𝑥 𝑇 𝛽 2 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= 2 𝑦 − 𝑥 𝑇 𝛽 −𝑥 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
= −2 𝑦 − 𝑥 𝑇 𝛽 𝑥𝑃𝑟 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= −2 𝑦𝑥 − 𝑥𝑥 𝑇
𝛽 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
⇒ 𝑦𝑥Pr(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
⇒𝛽 = E(𝑋𝑋 𝑇
) −1
E 𝑋𝑌
• This is not conditioned on X.
• Based on 𝐿1 loss function,
• EFE 𝑓 = E 𝑌 − 𝑓(𝑋)
• 𝑓 𝑥 = median(𝑌|𝑋 = 𝑥)
2.4 Statistical Decision Theory (contd.)
13
• In classification
• Zero-one loss function 𝐿 is represented by matrix 𝐋:
• 𝐋 =
0 ⋯ 𝛿1𝐾
𝛿21
⋮
⋱
𝛿2𝐾
⋮
𝛿 𝐾1 ⋯ 0
where 𝛿𝑖𝑗 ∈ 0,1 , K = card(ℊ)
• The Expected prediction error:
• EPE( 𝐺) = E 𝐿 𝐺, 𝐺(𝑋)
= E 𝑋 𝑘=1
𝐾
𝐿 ℊ 𝑘, 𝐺(𝑋) Pr(ℊ 𝑘|𝑋)
2.4 Statistical Decision Theory (contd.)
14
• In classification
• Minimum 𝐺 (at a point 𝑋 = 𝑥) is the Bayes classifier.
• 𝐺 𝑥 = argmin 𝑔∈ℊ 𝑘=1
𝐾
𝐿( ℊ 𝑘, 𝑔)Pr(ℊ 𝑘|𝑋 = 𝑥)
= argmin 𝑔∈ℊ 1 − Pr(𝑔|𝑋 = 𝑥)
= ℊ 𝑘 if Pr ℊ 𝑘 𝑋 = 𝑥 = max 𝑔∈ℊ Pr 𝑔 𝑋 = 𝑥
• This classifies to the most probable class, using the
conditional distribution Pr(𝐺|𝑋).
• Many approaches to modeling Pr 𝐺 𝑋 are discussed in Ch.4.
2.4 Statistical Decision Theory (contd.)
15
• The curse of dimensionality
1. If we want to include 10% of data in the neighbor, the
expected required rate of data in 10 dimensions is
𝑒10 0.1 = 0.8
2. Suppose a nearest-neighbor estimate at the origin, in 𝑁 data
uniformly distributed in 𝑝-dimensional unit ball
• The median distance to the closest data point
• 𝑑 𝑝, 𝑁 = 1 −
1
2
1 𝑁 1 𝑝
• If N = 500, 𝑝 = 10, then 𝑑 𝑝, 𝑁 ≈ 0.52
• more than half data points are closer to the boundary
2.5 Local Methods in High Dimensions
16
• The curse of dimensionality
3. The sampling density is proportional to 𝑁1 𝑝
• 𝑁10 = 10010
• Sparseness in high dimension
4. Examples 𝑥𝑖 uniformly from −1.1 𝑝
• Assume 𝑌 = 𝑓 𝑋 = 𝑒−8 𝑋 2
• Using 1-Nearest Neighbor estimation at 𝑥0 = 0
• 𝑓 𝑥0 < 0 if 𝑥0 ≠ 0
• If the dimension increase,
the nearest neighbor get further
from the target point
2.5 Local Methods in High Dimensions (contd.)
17
• The curse of dimensionality
5. In linear model 𝑌 = 𝑋 𝑇
𝛽 + 𝜀, 𝜀~𝑁(0, 𝜎2
)
• For arbitrary test set 𝑥0,
• EPE 𝑥0 = E 𝑦0|𝑥0
ET(𝑦0 − 𝑦0)2
= 𝜎2 + E 𝑇 𝑥 𝑜
𝑇(𝐗 𝑇 𝐗)−1 𝑥 𝑜 𝜎2 + 02
• If 𝑁 is large, 𝑇 were selected at random, E 𝑋 = 0,
E 𝑥0
EPE 𝑥0 ~𝜎2( 𝑝 𝑁) + 𝜎2
• If 𝑁 is large or 𝜎2
is small, EPE does not significantly
increases linearly as 𝑝 increases.
⇒ We can avoid the curse of dimensionality in this
restriction.
2.5 Local Methods in High Dimensions (contd.)
18
• Additive model
• 𝑌 = 𝑓 𝑋 + 𝜀
• Deterministic: 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
• Anything non-deterministic goes to the random error 𝜀
• E 𝜀 = 0
• 𝜀 is independent of 𝑋
• Additive model cannot be used in the classification
• Target function 𝑝 𝑋 = Pr(𝐺|𝑋), the conditional density
2.6.1 A Statistical Model for the Joint Distribution Pr(𝑋, 𝑌)
19
• Learn 𝑓 𝑋 by example through teacher
• Training set are pair of inputs and outputs
• 𝑇 = 𝑥𝑖, 𝑦𝑖 for 𝑖 = 1, … , 𝑁
• Learning by example
1. Produce 𝑓 𝑥𝑖
2. Compute differences 𝑦𝑖 − 𝑓 𝑥𝑖
3. Modify 𝑓 𝑥𝑖
※ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出し
たのか?
2.6.2 Supervised Learning
20
• Data point 𝑥𝑖, 𝑦𝑖 is viewed as a point in a 𝑝 + 1-
dimention Euclidean space
• Approximate Parameter 𝜃
• Linear model
• Linear basis expansions: 𝑓𝜃 𝑥 = 𝑘=1
𝐾
ℎ 𝑘(𝑥)𝜃 𝑘
• Criterion for approximation
1. The Residual sum-of-squares
• 𝑅𝑆𝑆 𝜃 = 𝑖=1
𝑁
𝑦𝑖 − 𝑓𝜃(𝑥𝑖) 2
• For linear model, we get
a simple closed form solution
2.6.3 Function Approximation
21
• Criterion for approximation
2. Maximum likelihood estimation
• 𝐿 𝜃 = 𝑖=1
𝑁
logPr 𝜃 (𝑦𝑖)
• The Principle of Maximum Likelihood:
• Most reasonable 𝜃 are for which the probability of the
observed sample is largest
• In classification, use cross-entropy with Pr 𝐺 = ℊ 𝑘 𝑋 = 𝑥 =
𝑝 𝑘,𝜃(𝑥)
• 𝐿 𝜃 = 𝑖=1
𝑁
log 𝑝 𝑔𝑖,𝜃(𝑥𝑖)
2.6.3 Function Approximation (contd.)
22
• Infinitely many function fits the training data
• The training sets (𝑥𝑖, 𝑦𝑖) are finite, so infinitely many 𝑓 fits them
• Constraint comes from consideration outside of the data
• The strength of the constraint (complexity) can be viewed as the
neighborhood size
• Constraint comes from the metric of the neighbors
• Especially, to overcome the curse of dimensionality, we need
non-isotropic neighborhoods
2.7.1 Difficulty of the Problem
23
• Variety of nonparametric regression techniques
• Add roughness penalty (regularization) term to RSS
• PRSS 𝑓; 𝜆 = RSS 𝑓 + 𝜆𝐽(𝑓)
• Penalty functional 𝐽 can be used to impose special structure
• Additive models with smooth coordinate (feature) functions
• 𝑗=1
𝑝
𝑓𝑗 𝑋𝑗 + 𝑗=1
𝑝
𝐽(𝑓𝑗)
• Projection pursuit regression
• PPR 𝑋 = 𝑚=1
𝑀
𝑔 𝑚(𝛼 𝑚
𝑇 𝑋)
• For more on penalty, see Ch.5
• For Bayesian approach, see Ch.8
2.8.1 Roughness Penalty and Bayesian methods
24
• Kernel methods specify the nature of local neighborhood
• The local neighborhood is specified by a kernel function
• Gaussian kernel is based on: 𝐾𝜆 𝑥0, 𝑥 =
1
𝜆
exp −
𝑥−𝑥0
2
2𝜆
• In general, a local regression estimate is 𝑓 𝜃 𝑥0 , where
• 𝜃 = argmin 𝜃RSS 𝑓𝜃, 𝑥0
= argmin 𝜃 𝑖=1
𝑁
𝐾𝜆(𝑥0, 𝑥𝑖) (𝑦𝑖 − 𝑓𝜃 𝑥𝑖 )2
• For more on this, see Ch.6
2.8.2 Kernel Methods and Local Regression
25
• This class includes a wide variety of methods
1. The model for 𝑓 is a linear expansion of basis functions ℎ𝑖(𝑥)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝜃 𝑚ℎ 𝑚(𝑥)
• For more, see Sec.5.2, Ch.9
2. Radial basis functions are symmetric 𝑝-dimensional kernels
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝐾𝜆 𝑚
(𝜇 𝑚, 𝑥)𝜃 𝑚
• For more, see Sec.6.7
3. Feed-forward neural network (single layer)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝛽 𝑚 𝜎(𝛼 𝑚
𝑇 𝑥 + 𝑏 𝑚) where 𝜎 is the sigmoid function
• For more, see Ch.11
• Dictionary methods mean to choose basis function adaptively
2.8.3 Basis Functions and Dictionary methods
26
• Many models have a smoothing or complexity parameter
• We cannot determine it with residual sum-of-squares on training
data
• Residuals will be zero and model will overfit
• The expected prediction error at 𝑥0 (test, generalization error)
• EPE 𝑘 𝑥0 = E 𝑌 − 𝑓𝑘 𝑥0
2
|𝑋 = 𝑥0
= 𝜎2
+ Bias2
( 𝑓(𝑥0)2
+Var 𝑇( 𝑓𝑘 𝑥0 )
= 𝜎2
+ 𝑓 𝑥0 −
1
𝑘 𝑙=1
𝑘
𝑓(𝑥 𝑙 )
2
+
𝜎2
𝑘
= 𝑇1 + 𝑇2 + 𝑇3
• 𝑇1: irreducible error, beyond our control
• 𝑇2: (Squared) Bias term of mean squared error
• 𝑇2 increases with 𝑘
• 𝑇3: Variance term of mean squared error
• 𝑇3 decreases with 𝑘
2.9 Model Selection and the Bias-Variance Tradeoff
27
• Model Complexity
• If model complexity increases,
• (Squared) Bias Term 𝑇2 decreases
• Variance Term 𝑇3 increases
• There is a trade-off between Bias and Variance
• The training error is not a good estimate of test error
• For more, see Ch.7.
2.9 Model Selection and the Bias-Variance Tradeoff (contd.)

Weitere ähnliche Inhalte

Was ist angesagt?

ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介Naoki Hayashi
 
PRML輪読#1
PRML輪読#1PRML輪読#1
PRML輪読#1matsuolab
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデルKei Nakagawa
 
マルコフ連鎖モンテカルロ法入門-2
マルコフ連鎖モンテカルロ法入門-2マルコフ連鎖モンテカルロ法入門-2
マルコフ連鎖モンテカルロ法入門-2Nagi Teramo
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANsDeep Learning JP
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習Eiji Uchibe
 
馬に蹴られるモデリング
馬に蹴られるモデリング馬に蹴られるモデリング
馬に蹴られるモデリングShushi Namba
 
PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説弘毅 露崎
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII
 
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot LearningDeep Learning JP
 
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)小川 雄太郎
 
[DL輪読会]Learning Task Informed Abstractions
[DL輪読会]Learning Task Informed Abstractions [DL輪読会]Learning Task Informed Abstractions
[DL輪読会]Learning Task Informed Abstractions Deep Learning JP
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNETomoki Hayashi
 
ブラックボックス最適化とその応用
ブラックボックス最適化とその応用ブラックボックス最適化とその応用
ブラックボックス最適化とその応用gree_tech
 
Attentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門までAttentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門までAGIRobots
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process ModelsDeep Learning JP
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?Deep Learning JP
 

Was ist angesagt? (20)

ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介ベイズ統計学の概論的紹介
ベイズ統計学の概論的紹介
 
階層ベイズとWAIC
階層ベイズとWAIC階層ベイズとWAIC
階層ベイズとWAIC
 
PRML輪読#1
PRML輪読#1PRML輪読#1
PRML輪読#1
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
 
マルコフ連鎖モンテカルロ法入門-2
マルコフ連鎖モンテカルロ法入門-2マルコフ連鎖モンテカルロ法入門-2
マルコフ連鎖モンテカルロ法入門-2
 
WAICとWBICのご紹介
WAICとWBICのご紹介WAICとWBICのご紹介
WAICとWBICのご紹介
 
[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
馬に蹴られるモデリング
馬に蹴られるモデリング馬に蹴られるモデリング
馬に蹴られるモデリング
 
PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説PCAの最終形態GPLVMの解説
PCAの最終形態GPLVMの解説
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
 
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning【DL輪読会】DayDreamer: World Models for Physical Robot Learning
【DL輪読会】DayDreamer: World Models for Physical Robot Learning
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
 
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)
機械学習・ディープラーニング、ITの実装スキル学ぶ方法(と私の場合)
 
[DL輪読会]Learning Task Informed Abstractions
[DL輪読会]Learning Task Informed Abstractions [DL輪読会]Learning Task Informed Abstractions
[DL輪読会]Learning Task Informed Abstractions
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
ブラックボックス最適化とその応用
ブラックボックス最適化とその応用ブラックボックス最適化とその応用
ブラックボックス最適化とその応用
 
Attentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門までAttentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門まで
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
 

Ähnlich wie Elements of Statistical Learning 読み会 第2章

Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfJifarRaya
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdfJifarRaya
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matchingtaeseon ryu
 
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...rofiho9697
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx건영 박
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningEiji Uchibe
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2khairulhuda242
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 

Ähnlich wie Elements of Statistical Learning 読み会 第2章 (20)

Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdf
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdf
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
MLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptxMLU_DTE_Lecture_2.pptx
MLU_DTE_Lecture_2.pptx
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 

Kürzlich hochgeladen

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Kürzlich hochgeladen (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Elements of Statistical Learning 読み会 第2章

  • 1. The Elements of Statistical Learning Ch.2: Overview of Supervised Learning 4/13/2017 坂間 毅
  • 2. 2 • Supervised Learning • Predict outputs from inputs • Inputsの別名 • Predictors 予測変数 • Independent variables 独立変数 • Features 特徴 • Outputsの別名 • Responses 応答変数 • Dependent variables 従属変数 2.1 Introduction
  • 3. 3 • Outputs 1. Quantitative variable • 大気の測定値など、連続値 • Quantitative prediction = Regression 2. Qualitative variable • Categorical, discrete variableともいう • アヤメの種類など、有限集合の値 • Qualitative prediction = Classification • Inputの種類 1. Quantitative variable 2. Qualitative variable 3. Ordered categorical variable (eg. small, mid, large) ※ 間隔尺度と比例尺度は量的変数にまとめられている? 2.2 Variable Types and Terminology
  • 4. 4 • Notation • Input • Vector: 𝑋 • Component of vector: 𝑋𝑗 • i-th observation: 𝑥𝑖 (小文字) • Matrix: 𝐗 (ボールド) • All the observations on j-th variable: 𝐱𝐣 (ボールド) • Output • Quantitative output: 𝑌 • Prediction of 𝑌: 𝑌 • Qualitative output: 𝐺 • Prediction of 𝐺: 𝐺 2.2 Variable Types and Terminology (contd.)
  • 5. 5 • Linear Model • With bias term in coefficient, 𝑌 = 𝑋 𝑇 𝛽 • Most popular Fitting method: least squares • 𝑅𝑆𝑆 𝛽 = 𝐲 − 𝐗𝛽 𝑇 𝐲 − 𝐗𝛽 (RSS: Residual Sum of Squared errors) • By differentiating RSS w.r.t. 𝛽, and set 0 • 𝐗 𝑇 𝒚 − 𝐗𝛽 = 0 • If 𝐗 𝑇 𝐗 is nonsingular (regular 正則行列), then inverse exists, • 𝛽 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲 2.3.1 Linear Models and Least Squares
  • 6. 6 • Linear Model (Classification) • 𝑮 = ORANGE if 𝑌 > 0.5 BLUE if 𝑌 ≤ 0.5 • Two classes are separated by Decision boundary • 𝑥: 𝑥 𝑇 𝛽 = 0.5 • Two cases for generating 2-class data 1. 平均が異なる相関の無い2変数ガウス分布からそれぞれ生成される ⇒線形の決定境界が最善(第四章で) 2. それぞれの平均の分布がガウス分布になっている、10個の分散の小さいガ ウス分布から生成される ⇒非線形の決定境界が最善(本章の例はこちら) 2.3.1 Linear Models and Least Squares (contd.)
  • 7. 7 • k-Nearest Neighbor • 𝑌 𝑥 = 1 𝑘 𝑥 𝑖∈𝑁 𝑘(𝑥) 𝑦𝑖 𝑁𝑘 𝑥 is k (Euclidean) closest points to x in training set • 𝑘 = 1: Voronoi tessellation • Notice • Effective number of parameters of k-NN = N/k • “we will see” • RSS is useless • 𝑘 = 1のとき訓練データを誤差なく分類するので、𝑘 = 1がもっともRSSが 少ないことになる 2.3.2 Nearest-Neighbor Methods
  • 8. 8 • Today’s popular techniques are variants of Linear model or k-Nearest Neighbor (or both) 2.3.3 From Least Squares to Nearest Neighbors Variance Bias Linear Model low high k-Nearest Neighbors high low
  • 9. 9 • Theoretical Framework • Joint distribution Pr 𝑋, 𝑌 • Squared error loss function 𝐿 𝑌, 𝑓 𝑋 = (𝑌 − 𝑓 𝑋 )2 • Expected (squared) prediction error • EPE 𝑓 = E(𝑌 − 𝑓 𝑋 )2 = 𝑦 − 𝑓(𝑥) 2Pr(𝑑𝑥, 𝑑𝑦) = 𝑦 − 𝑓(𝑥) 2 Pr 𝑥, 𝑦 𝑑𝑦 𝑑𝑥 = 𝑦 − 𝑓(𝑥) 2 Pr 𝑦 𝑥 Pr(𝑥 𝑑𝑦 𝑑𝑥 by Pr 𝑋, 𝑌 = Pr 𝑌 𝑋 Pr(𝑋) = E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 Pr(𝑥) 𝑑𝑥 = E 𝑋E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2 |𝑋 2.4 Statistical Decision Theory
  • 10. 10 • Minimum 𝑓 is the regression function • The best prediction of 𝑌 at any point 𝑋 = 𝑥 is the conditional mean, when best is measured by average squared error. • 𝑓 𝑥 = argmin 𝑐E 𝑌|𝑋 𝑌 − 𝑐 2 |𝑋 = 𝑥 ⇒ 𝜕 𝜕𝑓 E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2 |𝑋 = 𝑥 = 0 ⇒ 𝜕 𝜕𝑓 𝑦 − 𝑓(𝑥) 2Pr(𝑦|𝑥) 𝑑𝑦 = 0 ⇒ −2𝑦 + 2𝑓(𝑥) Pr 𝑦 𝑥 𝑑𝑦 = 0 ⇒ 2𝑓 𝑥 Pr 𝑦 𝑥 𝑑𝑦 = 2 𝑦𝑃𝑟 𝑦 𝑥 𝑑𝑦 ⇒ 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥) 2.4 Statistical Decision Theory (contd.)
  • 11. 11 • How to estimate the conditional mean E(𝑌|𝑋 = 𝑥) • k-Nearest Neighbor • 𝑓(𝑥) = Ave(𝑦𝑖|𝑥𝑖 ∈ 𝑁𝑘 𝑥 ) • Two approximation: Ave, 𝑁𝑘(𝑥) • Under mild regularity condition on Pr(𝑋, 𝑌), • If 𝑁, 𝑘 → ∞ with 𝑘 𝑁 → 0, then 𝑓 𝑥 → E(𝑌|𝑋 = 𝑥) • However, the curse of dimensionality becomes severe 2.4 Statistical Decision Theory (contd.)
  • 12. 12 • How to estimate the conditional mean E(𝑌|𝑋 = 𝑥) • Linear Regression • 𝑓 𝑥 ≈ 𝑥 𝑇 𝛽 (or 𝑓 𝑥 = 𝑥 𝑇 𝛽?) • Then, • 𝜕EPE 𝜕𝛽 = 𝜕 𝜕𝛽 𝑦 − 𝑥 𝑇 𝛽 2 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 2 𝑦 − 𝑥 𝑇 𝛽 −𝑥 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 = −2 𝑦 − 𝑥 𝑇 𝛽 𝑥𝑃𝑟 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = −2 𝑦𝑥 − 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦 ⇒ 𝑦𝑥Pr(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 ⇒𝛽 = E(𝑋𝑋 𝑇 ) −1 E 𝑋𝑌 • This is not conditioned on X. • Based on 𝐿1 loss function, • EFE 𝑓 = E 𝑌 − 𝑓(𝑋) • 𝑓 𝑥 = median(𝑌|𝑋 = 𝑥) 2.4 Statistical Decision Theory (contd.)
  • 13. 13 • In classification • Zero-one loss function 𝐿 is represented by matrix 𝐋: • 𝐋 = 0 ⋯ 𝛿1𝐾 𝛿21 ⋮ ⋱ 𝛿2𝐾 ⋮ 𝛿 𝐾1 ⋯ 0 where 𝛿𝑖𝑗 ∈ 0,1 , K = card(ℊ) • The Expected prediction error: • EPE( 𝐺) = E 𝐿 𝐺, 𝐺(𝑋) = E 𝑋 𝑘=1 𝐾 𝐿 ℊ 𝑘, 𝐺(𝑋) Pr(ℊ 𝑘|𝑋) 2.4 Statistical Decision Theory (contd.)
  • 14. 14 • In classification • Minimum 𝐺 (at a point 𝑋 = 𝑥) is the Bayes classifier. • 𝐺 𝑥 = argmin 𝑔∈ℊ 𝑘=1 𝐾 𝐿( ℊ 𝑘, 𝑔)Pr(ℊ 𝑘|𝑋 = 𝑥) = argmin 𝑔∈ℊ 1 − Pr(𝑔|𝑋 = 𝑥) = ℊ 𝑘 if Pr ℊ 𝑘 𝑋 = 𝑥 = max 𝑔∈ℊ Pr 𝑔 𝑋 = 𝑥 • This classifies to the most probable class, using the conditional distribution Pr(𝐺|𝑋). • Many approaches to modeling Pr 𝐺 𝑋 are discussed in Ch.4. 2.4 Statistical Decision Theory (contd.)
  • 15. 15 • The curse of dimensionality 1. If we want to include 10% of data in the neighbor, the expected required rate of data in 10 dimensions is 𝑒10 0.1 = 0.8 2. Suppose a nearest-neighbor estimate at the origin, in 𝑁 data uniformly distributed in 𝑝-dimensional unit ball • The median distance to the closest data point • 𝑑 𝑝, 𝑁 = 1 − 1 2 1 𝑁 1 𝑝 • If N = 500, 𝑝 = 10, then 𝑑 𝑝, 𝑁 ≈ 0.52 • more than half data points are closer to the boundary 2.5 Local Methods in High Dimensions
  • 16. 16 • The curse of dimensionality 3. The sampling density is proportional to 𝑁1 𝑝 • 𝑁10 = 10010 • Sparseness in high dimension 4. Examples 𝑥𝑖 uniformly from −1.1 𝑝 • Assume 𝑌 = 𝑓 𝑋 = 𝑒−8 𝑋 2 • Using 1-Nearest Neighbor estimation at 𝑥0 = 0 • 𝑓 𝑥0 < 0 if 𝑥0 ≠ 0 • If the dimension increase, the nearest neighbor get further from the target point 2.5 Local Methods in High Dimensions (contd.)
  • 17. 17 • The curse of dimensionality 5. In linear model 𝑌 = 𝑋 𝑇 𝛽 + 𝜀, 𝜀~𝑁(0, 𝜎2 ) • For arbitrary test set 𝑥0, • EPE 𝑥0 = E 𝑦0|𝑥0 ET(𝑦0 − 𝑦0)2 = 𝜎2 + E 𝑇 𝑥 𝑜 𝑇(𝐗 𝑇 𝐗)−1 𝑥 𝑜 𝜎2 + 02 • If 𝑁 is large, 𝑇 were selected at random, E 𝑋 = 0, E 𝑥0 EPE 𝑥0 ~𝜎2( 𝑝 𝑁) + 𝜎2 • If 𝑁 is large or 𝜎2 is small, EPE does not significantly increases linearly as 𝑝 increases. ⇒ We can avoid the curse of dimensionality in this restriction. 2.5 Local Methods in High Dimensions (contd.)
  • 18. 18 • Additive model • 𝑌 = 𝑓 𝑋 + 𝜀 • Deterministic: 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥) • Anything non-deterministic goes to the random error 𝜀 • E 𝜀 = 0 • 𝜀 is independent of 𝑋 • Additive model cannot be used in the classification • Target function 𝑝 𝑋 = Pr(𝐺|𝑋), the conditional density 2.6.1 A Statistical Model for the Joint Distribution Pr(𝑋, 𝑌)
  • 19. 19 • Learn 𝑓 𝑋 by example through teacher • Training set are pair of inputs and outputs • 𝑇 = 𝑥𝑖, 𝑦𝑖 for 𝑖 = 1, … , 𝑁 • Learning by example 1. Produce 𝑓 𝑥𝑖 2. Compute differences 𝑦𝑖 − 𝑓 𝑥𝑖 3. Modify 𝑓 𝑥𝑖 ※ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出し たのか? 2.6.2 Supervised Learning
  • 20. 20 • Data point 𝑥𝑖, 𝑦𝑖 is viewed as a point in a 𝑝 + 1- dimention Euclidean space • Approximate Parameter 𝜃 • Linear model • Linear basis expansions: 𝑓𝜃 𝑥 = 𝑘=1 𝐾 ℎ 𝑘(𝑥)𝜃 𝑘 • Criterion for approximation 1. The Residual sum-of-squares • 𝑅𝑆𝑆 𝜃 = 𝑖=1 𝑁 𝑦𝑖 − 𝑓𝜃(𝑥𝑖) 2 • For linear model, we get a simple closed form solution 2.6.3 Function Approximation
  • 21. 21 • Criterion for approximation 2. Maximum likelihood estimation • 𝐿 𝜃 = 𝑖=1 𝑁 logPr 𝜃 (𝑦𝑖) • The Principle of Maximum Likelihood: • Most reasonable 𝜃 are for which the probability of the observed sample is largest • In classification, use cross-entropy with Pr 𝐺 = ℊ 𝑘 𝑋 = 𝑥 = 𝑝 𝑘,𝜃(𝑥) • 𝐿 𝜃 = 𝑖=1 𝑁 log 𝑝 𝑔𝑖,𝜃(𝑥𝑖) 2.6.3 Function Approximation (contd.)
  • 22. 22 • Infinitely many function fits the training data • The training sets (𝑥𝑖, 𝑦𝑖) are finite, so infinitely many 𝑓 fits them • Constraint comes from consideration outside of the data • The strength of the constraint (complexity) can be viewed as the neighborhood size • Constraint comes from the metric of the neighbors • Especially, to overcome the curse of dimensionality, we need non-isotropic neighborhoods 2.7.1 Difficulty of the Problem
  • 23. 23 • Variety of nonparametric regression techniques • Add roughness penalty (regularization) term to RSS • PRSS 𝑓; 𝜆 = RSS 𝑓 + 𝜆𝐽(𝑓) • Penalty functional 𝐽 can be used to impose special structure • Additive models with smooth coordinate (feature) functions • 𝑗=1 𝑝 𝑓𝑗 𝑋𝑗 + 𝑗=1 𝑝 𝐽(𝑓𝑗) • Projection pursuit regression • PPR 𝑋 = 𝑚=1 𝑀 𝑔 𝑚(𝛼 𝑚 𝑇 𝑋) • For more on penalty, see Ch.5 • For Bayesian approach, see Ch.8 2.8.1 Roughness Penalty and Bayesian methods
  • 24. 24 • Kernel methods specify the nature of local neighborhood • The local neighborhood is specified by a kernel function • Gaussian kernel is based on: 𝐾𝜆 𝑥0, 𝑥 = 1 𝜆 exp − 𝑥−𝑥0 2 2𝜆 • In general, a local regression estimate is 𝑓 𝜃 𝑥0 , where • 𝜃 = argmin 𝜃RSS 𝑓𝜃, 𝑥0 = argmin 𝜃 𝑖=1 𝑁 𝐾𝜆(𝑥0, 𝑥𝑖) (𝑦𝑖 − 𝑓𝜃 𝑥𝑖 )2 • For more on this, see Ch.6 2.8.2 Kernel Methods and Local Regression
  • 25. 25 • This class includes a wide variety of methods 1. The model for 𝑓 is a linear expansion of basis functions ℎ𝑖(𝑥) • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝜃 𝑚ℎ 𝑚(𝑥) • For more, see Sec.5.2, Ch.9 2. Radial basis functions are symmetric 𝑝-dimensional kernels • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝐾𝜆 𝑚 (𝜇 𝑚, 𝑥)𝜃 𝑚 • For more, see Sec.6.7 3. Feed-forward neural network (single layer) • 𝑓𝜃 𝑥 = 𝑚=1 𝑀 𝛽 𝑚 𝜎(𝛼 𝑚 𝑇 𝑥 + 𝑏 𝑚) where 𝜎 is the sigmoid function • For more, see Ch.11 • Dictionary methods mean to choose basis function adaptively 2.8.3 Basis Functions and Dictionary methods
  • 26. 26 • Many models have a smoothing or complexity parameter • We cannot determine it with residual sum-of-squares on training data • Residuals will be zero and model will overfit • The expected prediction error at 𝑥0 (test, generalization error) • EPE 𝑘 𝑥0 = E 𝑌 − 𝑓𝑘 𝑥0 2 |𝑋 = 𝑥0 = 𝜎2 + Bias2 ( 𝑓(𝑥0)2 +Var 𝑇( 𝑓𝑘 𝑥0 ) = 𝜎2 + 𝑓 𝑥0 − 1 𝑘 𝑙=1 𝑘 𝑓(𝑥 𝑙 ) 2 + 𝜎2 𝑘 = 𝑇1 + 𝑇2 + 𝑇3 • 𝑇1: irreducible error, beyond our control • 𝑇2: (Squared) Bias term of mean squared error • 𝑇2 increases with 𝑘 • 𝑇3: Variance term of mean squared error • 𝑇3 decreases with 𝑘 2.9 Model Selection and the Bias-Variance Tradeoff
  • 27. 27 • Model Complexity • If model complexity increases, • (Squared) Bias Term 𝑇2 decreases • Variance Term 𝑇3 increases • There is a trade-off between Bias and Variance • The training error is not a good estimate of test error • For more, see Ch.7. 2.9 Model Selection and the Bias-Variance Tradeoff (contd.)