BSSML17 - Logistic Regressions

BigML, Inc 2
Logistic Regressions
Modeling probabilities
Poul Petersen
CIO, BigML, Inc

BigML, Inc 3Topic Models
Logistic Regression
• Classiﬁcation implies a discrete objective. How can this be a
regression?
• Why do we need another classiﬁcation algorithm?
• more questions….
Logistic Regression is a classification algorithm
Potential Confusion:

Linear Regression

Polynomial Regression

Regression
• Linear Regression: 𝛽₀＋𝛽1·(INPUT) ≈ OBJECTIVE
• Quadratic Regression: 𝛽₀＋𝛽1·(INPUT)＋𝛽2·(INPUT)2 ≈
OBJECTIVE
• Decision Tree Regression: DT(INPUT) ≈ OBJECTIVE
• Problem:
• What if we want to do a classiﬁcation problem: T/F or 1/0
• What function can we ﬁt to discrete data?
Regression is the process of "fitting" a function to the data
Key Take-Away:

Discrete Data Function?

Discrete Data Function?
????

Logistic Function
𝑥➝-∞
𝑓(𝑥)➝0
• Looks promising, but still not "discrete"
• What about the "green" in the middle?
• Let’s change the problem…
𝑥➝∞
𝑓(𝑥)➝1
Goal
1
1 ＋ 𝒆− 𝑥
𝑓(𝑥) ＝
Logistic Function

Modeling Probabilities
𝑃≈0 𝑃≈10<𝑃<1

Logistic Regression
• Assumes that output is linearly related to "predictors"
• What? (hang in there…)
• Sometimes we can "ﬁx" this with feature engineering
• Question: how do we "ﬁt" the logistic function to real data?
LR is a classification algorithm … that uses a regression … 
to model the probability of the discrete objective
Clarification:
Caveats:

Logistic Regression
𝛽₀ is the "intercept"
𝛽₁ is the "coefﬁcient"
• In which case solving is now a linear regression
• But this is only one dimension, that is one feature 𝑥…
• Given training data consisting of inputs 𝑥, and probabilities 𝑃
• Solve for 𝛽₀ and 𝛽₁ to ﬁt the logistic function
• How? The inverse of the logistic function is called the "logit":
𝑃(𝑥)＝
1
1＋𝑒−(𝛽0＋𝛽1 𝑥)
𝑙𝑛( )𝑃(𝑥)
𝑃(𝑥ʹ)
＝𝑙𝑛 ( )1－𝑃(𝑥ʹ)
𝑃(𝑥)
＝𝛽0＋𝛽1 𝑥

Logistic Regression
For "𝑖" dimensions, 𝑿﹦[ 𝑥1, 𝑥2,⋯, 𝑥𝑖 ], we solve
𝑃(𝑿)＝
1
1＋𝑒−𝑓(𝑿)
𝑓(𝑿)＝𝛽0＋𝞫·𝑿＝𝛽0＋𝛽1 𝑥1＋⋯＋𝛽𝑖 𝑥𝑖
where:

Interpreting Coefficients
• LR computes 𝛽0 and coefficients 𝛽𝑗 for each feature 𝑥𝑗
• negative 𝛽𝑗 → negatively correlated:
• positive 𝛽𝑗 → positively correlated:
• "larger" 𝛽𝑗 → more impact:
• "smaller" → less impact:
• 𝛽𝑗 "size" should not be confused with field importance
• Can include a coefficient for "missing" (if enabled)
• 𝑃(𝑿) ＝ 𝛽0＋⋯＋𝛽𝑗 𝑥𝑗＋⋯
• Binary Classification (true/false) coefficients are complementary
• 𝑃(True) ≡ 1− 𝑃(False)
＋𝛽𝑗+1[ 𝑥𝑗 ≡ Missing ]
𝑥𝑗↑ then 𝑃(𝑿)↓
𝑥𝑗↑ then 𝑃(𝑿)↑
𝑥𝑗≫ then 𝑃(𝑿)﹥
𝑥𝑗﹥then 𝑃(𝑿)≫

LR Parameters
1. Default Numeric: Replaces missing numeric values
2. Missing Numeric: Adds a field for missing numerics
3. Stats: Extended statistics, ex: p-value (runs slower)
4. Bias: Enables/Disables the intercept term - 𝛽₀
• Don’t disable this…
5. Regularization: Reduces over-fitting by minimizing 𝛽𝑗
• L1: prefers reducing individual coefficients
• L2 (default): prefers reducing all coefficients
6. Strength "C": Higher values reduce regularization
7. EPS: The minimum error between steps to stop
Larger values stop earlier but quality may be less
8. Auto-scaling: Ensures that all features contribute equally
• Don’t change this unless you have a specific reason

LR Questions
• How do we handle multiple classes?
• Binary class True/False only need to solve for one 
𝑃(True) ≡ 1− 𝑃(False)
• What about non-numeric inputs?
• Text/Items ﬁelds
• Categorical ﬁelds
Questions:

LR - Multi Class
• Instead of a binary class ex: [ true, false ],  
we have multi-class ex: [ red, green, blue, … ]
• "𝑘" classes: 𝑪＝[𝑐1, 𝑐2,⋯, 𝑐 𝑘]
• solve one-vs-rest LR
• Result: 𝞫𝑗 for each class 𝑐𝑗
• apply combiner to ensure 
all probabilities add to 1
𝑙𝑛 ( )𝑃(𝑐1)
𝑃(𝑐1ʹ)
＝𝛽1,0＋𝞫1·𝑿
𝑙𝑛 ( )𝑃(𝑐2)
𝑃(𝑐2ʹ)
＝𝛽2,0＋𝞫2·𝑿
⋯
𝑙𝑛 ( )𝑃(𝑐 𝑘)
𝑃(𝑐 𝑘ʹ)
＝𝛽 𝑘,0＋𝞫 𝑘·𝑿

LR - Field Codings
• LR is expecting numeric values to perform regression.
• How do we handle categorical values, or text?
Class color=red color=blue color=green color=NULL
red 1 0 0 0
blue 0 1 0 0
green 0 0 1 0
MISSING 0 0 0 1
One-hot encoding
• Only one feature is "hot" for each class
• This is the default

LR - Field Codings
Dummy Encoding
• Chooses a *reference class*
• requires one less degree of freedom
Class color_1 color_2 color_3
*red* 0 0 0
blue 1 0 0
green 0 1 0
MISSING 0 0 1

LR - Field Codings
Contrast Encoding
• Field values must sum to zero
• Allows comparison between classes
Class ﬁeld "inﬂuence"
red 0,5 positive
blue -0,25 negative
green -0,25 negative
MISSING 0 excluded

LR - Field Codings
Which one to use?
• One-hot is the default
• Use this unless you have a speciﬁc need
• Dummy
• Use when there is a control group in mind, which 
becomes the reference class
• Contrast
• Allows for testing speciﬁc hypothesis of relationships.
• Ex: customers give a "rating" of bad / ok / good
rating
Contrast
Encoding
bad -0,66
ok 0,33
good 0,33
Hypothesis is a
good and ok
review have the
same impact, but
a bad review has
a negative impact
twice as great.
rating
Contrast
Encoding
bad -0,5
ok 0
good 0,5
Hypothesis is that
a good and bad
review have an
equal but opposite
impact, while an
ok rating has no
impact.

LR - Field Codings
• Text/Items ﬁeld types are handled by creating a ﬁeld
for text token/item and setting 1 or 0
Text "hippo" "safari" "zebra"
“we saw hippos and
zebras…
1 0 1
“The best safari for
seeing zebras”
0 1 1
“The Oregon coast
is rainy in winter”
0 0 0
“Have you ever tried
a hippo burger”
1 0 0
Text / Items

Curvilinear LR
• Logistic Regression is expecting a linear relationship
between the features and the objective
• Remember - it’s a linear regression under the hood
• This is actually pretty common in natural datasets
• But non-linear relationships will impact model quality
• This can be addressed by adding non-linear
transformations to the features
• Knowing which transformations requires
• domain knowledge
• experimentation
• both

Curvilinear LR
Instead of
We could add a feature
Where
????
Possible to add any higher order terms or other functions to
match shape of data
𝛽0＋𝛽1 𝑥1
𝛽0＋𝛽1 𝑥1＋𝛽2 𝑥2
𝑥1 ≡

LR vs DT
• Expects a "smooth" linear
relationship with predictors.
• LR is concerned with probability of
a discrete outcome.
• Lots of parameters to get wrong:  
regularization, scaling, codings
• Slightly less prone to over-fitting 
• Because fits a shape, might work
better when less data available. 
• Adapts well to ragged non-linear
relationships
• No concern: classification,
regression, multi-class all fine.
• Virtually parameter free 
• Slightly more prone to over-fitting 
• Prefers surfaces parallel to
parameter axes, but given enough
data will discover any shape.
Logistic Regression Decision Tree

Summary
• Logistic Regression is a classification algorithm that
models the probabilities of each class
• How the algorithm works and why this is important
• Expects a linear relationship between the features
and the objective, and how to fix it
• Categorical encodings
• LR outputs a set of coefficients and how to interpret
• Scale relates to impact
• Sign relates to direction of impact
• Guidelines for comparing to Decision Trees

BSSML17 - Logistic Regressions

BSSML17 - Logistic Regressions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie BSSML17 - Logistic Regressions

Ähnlich wie BSSML17 - Logistic Regressions (20)

Mehr von BigML, Inc

Mehr von BigML, Inc (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BSSML17 - Logistic Regressions