SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
1st edition | July 8-11, 2019
BigML, Inc #DutchMLSchool
Supervised Learning I
Introduction to Machine Learning, Models, Evaluations and Ensembles
Poul Petersen
CIO, BigML, Inc
2
BigML, Inc #DutchMLSchool
Machine Learning Motivation
3
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?
BigML, Inc #DutchMLSchool
Maching Learning Motivation
4
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
BigML, Inc #DutchMLSchool
Data vs Expert
5
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508
BigML, Inc #DutchMLSchool
Data vs Expert
6
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
BigML, Inc #DutchMLSchool
Data vs Expert
7
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
BigML, Inc #DutchMLSchool
More Data!
8
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..
• Can we still fit a line to 10 variables? (well, yes)
• Will fitting a line give good results? (unlikely)
• What about those text fields and categorical values?
BigML, Inc #DutchMLSchool
Models
9
BigML, Inc #DutchMLSchool
Mythical ML Model?
10
• High representational power
• Fitting a line is an example of low
• Deep neural networks is an example of high
• High Ease-of-use
• Easy to configure - relatively few parameters
• Easy to interpret - how are decisions made?
• Easy to put into production
• Ability to work with real-world data
• Mixed data types: numeric, categorical, text, etc
• Handle missing values
• Resilient to outliers
• There are actually hundreds of possible choices…
BigML, Inc #DutchMLSchool
Decision Trees
11
Last Bill > $180 and Support Calls > 0
Remember This?
BigML, Inc #DutchMLSchool
Decision Tree Demo #1
12
BigML, Inc #DutchMLSchool
What Just Happened?
13
• We started with Housing data as a CSV from Redfin
• We uploaded the CSV to create Source
• Then we created a Dataset from the Source and reviewed the
summary statistics
• With 1-click we build a Model which can predict home prices
based on all the housing features
• We explored the Model and used it to make a Prediction
BigML, Inc #DutchMLSchool
Why Decision Trees
14
• Works for classification or regression
BigML, Inc #DutchMLSchool
Why Decision Trees
15
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
BigML, Inc #DutchMLSchool
DT Predictions
16
Question 2
Prediction
Question 1
BigML, Inc #DutchMLSchool
Why Decision Trees
17
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training
BigML, Inc #DutchMLSchool
Training with Missing
18
Reason Missing?
Loan Amount?
BigML, Inc #DutchMLSchool
Why Decision Trees
19
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training & Prediction
BigML, Inc #DutchMLSchool
Predictions with Missing
20
Missing?
Question 1
Last
Prediction
BigML, Inc #DutchMLSchool
Predictions with Missing
21
Missing?
Question 1
Skip
Question 2 Question 3
Avg Prediction
BigML, Inc #DutchMLSchool
Why Decision Trees
22
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training & Prediction
• Resilient to outliers
• High representational power
• Works easily with mixed data types
BigML, Inc #DutchMLSchool
Data Types
23
numeric
1 2 3
1, 2.0, 3, -5.4 categorical
true / false
yes / no
giraffe / zebra / ape
categoricalcategorical
A B C
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
DATE-TIME2013-09-25 10:02
DATE-TIME
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
appears 2 times
appears 1 time
appears 1 time
items
bread, sugar, coffee, milk
ice cream, hot fudge
items
BigML, Inc #DutchMLSchool
Why Not Decision Trees
24
• Slightly prone to over-fitting. (what is that again?)
BigML, Inc #DutchMLSchool
Learning Problems (fit)
25
Under-fitting Over-fitting
• Model fits too well does not “generalize”
• Captures the noise or outliers of the data
• Change algorithm or filter outliers
BigML, Inc #DutchMLSchool
Why Not Decision Trees
26
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
BigML, Inc #DutchMLSchool
Splits Parallel to Axis
27
But not Possible!
Ideal split…
BigML, Inc #DutchMLSchool
Splits Parallel to Axis
28
Will “discover”
diagonal edge
eventually
BigML, Inc #DutchMLSchool
Why Not Decision Trees
29
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
BigML, Inc #DutchMLSchool
Outlier Predictions
30
?
BigML, Inc #DutchMLSchool
Why Not Decision Trees
31
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
• We can catch this with model competence
• Can be sensitive to small changes in training data
BigML, Inc #DutchMLSchool
Outlier Predictions
32
BigML, Inc #DutchMLSchool
Why Not Decision Trees
33
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
• We can catch this with model competence
• Can be sensitive to small changes in training data
• What other models can we try?
• And how will we know which one works best?
BigML, Inc #DutchMLSchool
Evaluations
34
BigML, Inc #DutchMLSchool
Easy Right?
35
INTL
MIN
INTL
CALLS
INTL
CHARGE
CUST
SERV
CALLS
CHURN
8,7 4 2,35 1 False
11,2 5 3,02 0 False
12,7 6 3,43 4 True
9,1 5 2,46 0 False
11,2 2 3,02 1 False
12,3 5 3,32 3 False
13,1 6 3,54 4 False
5,4 9 1,46 4 True
13,8 4 3,73 1 False
Model Prediction
PREDICT
CHURN
False
True
True
False
False
False
False
False
False
Count up mistakes!
BigML, Inc #DutchMLSchool
Mistakes can be Costly
36
FUN!
+ = DANGER!
Insight: Labeling a Yield as a stop is not as bad as
labelling a stop as a yield… Need better metrics!
BigML, Inc #DutchMLSchool
Evaluation Metrics
37
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Define the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left
BigML, Inc #DutchMLSchool
Evaluation Metrics
38
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left
BigML, Inc #DutchMLSchool
Evaluation Metrics
39
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class
Remember…
BigML, Inc #DutchMLSchool
Accuracy
40
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate
BigML, Inc #DutchMLSchool
Accuracy
41
Classified as
Left Handed
Classified as
Right Handed
TP = 0
FP = 0
TN = 7
FN = 3
= Left
= RightPositive

Class
Negative

Class TP + TN
Total
= 70%
BigML, Inc #DutchMLSchool
Precision
42
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identified, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identified are actually left
handed. All mistakes.
BigML, Inc #DutchMLSchool
Precision
43
Classified as
Left Handed
Classified as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FP
= 50%
BigML, Inc #DutchMLSchool
Recall
44
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified
BigML, Inc #DutchMLSchool
Recall
45
Classified as
Left Handed
Classified as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FN
= 66%
BigML, Inc #DutchMLSchool
f-Measure
46
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small
BigML, Inc #DutchMLSchool
Phi Coefficient
47
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct
BigML, Inc #DutchMLSchool
Evaluations Demo #1
48
BigML, Inc #DutchMLSchool
What Just Happened?
49
• Starting with the Diabetes Source, we created a Dataset and
then a Model.
• Using both the Model and the original Dataset, we created an
Evaluation.
• We reviewed the metrics provided by the Evaluation:
• Confusion Matrix
• Accuracy, Precision, Recall, f-measure and
phi
• This Model seemed to perform really, really well…
Question: Can we trust this model?
BigML, Inc #DutchMLSchool
Evaluation Danger!
50
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
BigML, Inc #DutchMLSchool
“Memorizing” Training Data
51
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• Exactly the same values!
• Who needs a model?
• What we want to know is how the
model performs with values never
seen at training:
124 22 0,107 46 ?
BigML, Inc #DutchMLSchool
Evaluation Danger!
52
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
BigML, Inc #DutchMLSchool
Train / Test Split
53
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
• These instances were never seen
at training time.
• Better evaluation of how the
model will perform with “new” data
BigML, Inc #DutchMLSchool
Evaluation Danger!
54
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
BigML, Inc #DutchMLSchool
Evaluations Demo #2
55
BigML, Inc #DutchMLSchool
What Just Happened?
56
• Starting with the Diabetes Dataset we created a train/test split
• We built a Model using the train set and evaluated it with the
test set
• The scores were much worse than before, showing the danger
of evaluating with training data.
• Then we built several other models with different parameters
and used the evaluation comparison tool to see which
performed the best.
Question:
Couldn’t we search for the best Model
or parameters?
STAY
TUNED
BigML, Inc #DutchMLSchool
Evaluation
57
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…
BigML, Inc #DutchMLSchool
Operating Points
58
• The default probability threshold is 50%
• Changing the threshold can change the outcome for a
specific class
Rate Payment …
Actual
Outcome
Probability
PAID
Threshold
@ 50%
Threshold
@ 60%
Threshold
@ 90%
8,4 % US$456 … PAID 95 % PAID PAID PAID
9,6 % US$134 … PAID 87 % PAID PAID DEFAULT
18 % US$937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT
21 % US$35 … PAID 88 % PAID PAID DEFAULT
17,5 %US$1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT
BigML, Inc #DutchMLSchool
What about Regressions?
59
• No classes:
• Not possible to count mistakes: TP, FP, TN, FN
• Predicted values are numeric: error is the amount “off”
• actual 200, predict 180 = error 20
• Mean Absolute Error / Mean Squared Error
• Both are a measure of total error
• Note: value of the error is “unbounded”.
• When comparing models, lower values are “better”
• R-Squared Error
• Measure of how much better the model is than always
predicting the mean
• < 0 model is worse then mean
• = 0 model is no better than the mean
• ➞ 1 model fits the data “perfectly”
BigML, Inc #DutchMLSchool
Evaluations Demo #3
60
BigML, Inc #DutchMLSchool
What just happened?
61
• We split the RedFin data into training and test Datasets
• We created a Model and Evaluation
• We examined the Evaluation metrics
Wait - What about Time Series?
BigML, Inc #DutchMLSchool
Dependent Data
62
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
Error
BigML, Inc #DutchMLSchool
Dependent Data
63
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Year Pineapple
Harvest1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Rearranging Disrupts Patterns
BigML, Inc #DutchMLSchool
Random Train / Test Split
64
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
BigML, Inc #DutchMLSchool
Linear Train / Test Split
65
Train Test
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
Year Pineapple
Harvest
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Forecast
COMPARE
BigML, Inc #DutchMLSchool
Ensembles
66
BigML, Inc #DutchMLSchool
what is an Ensemble?
67
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?
BigML, Inc #DutchMLSchool
No Model is Perfect
68
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to fit a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…
BigML, Inc #DutchMLSchool
No Data is Perfect
69
• Not enough data!
• Always working with finite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overfitting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data
BigML, Inc #DutchMLSchool
Ensemble Techniques
70
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms
BigML, Inc #DutchMLSchool
Simple Example - Fit a Line
71
BigML, Inc #DutchMLSchool
Simple Example - Fit a Line
72
BigML, Inc #DutchMLSchool
Simple Example - Fit a Line
73
Partition the data… then model each partition…
For predictions, use the model for the same partition
?
BigML, Inc #DutchMLSchool
Decision Forest
74
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER
BigML, Inc #DutchMLSchool
Random Decision Forest
75
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
BigML, Inc #DutchMLSchool
Boosting
76
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 625
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 12500
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393,83333
6879,67857
Why stop at one iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"
BigML, Inc #DutchMLSchool
Boosting
77
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
BigML, Inc #DutchMLSchool
Ensembles Demo #1
78
BigML, Inc #DutchMLSchool
Which Ensemble Method
79
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially
BigML, Inc #DutchMLSchool
Ensembles Demo #2
80
BigML, Inc #DutchMLSchool
Summary
81
• Models have shortcomings: ability to fit, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation
Co-organized by: Sponsor:
Business Partners:

Weitere ähnliche Inhalte

Was ist angesagt?

DutchMLSchool. Your first BigML Project
DutchMLSchool. Your first BigML ProjectDutchMLSchool. Your first BigML Project
DutchMLSchool. Your first BigML ProjectBigML, Inc
 
DutchMLSchool. Opening Remarks
DutchMLSchool. Opening RemarksDutchMLSchool. Opening Remarks
DutchMLSchool. Opening RemarksBigML, Inc
 
DutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorDutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear RegressionSara Hooker
 
Data Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityData Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityDaniel Tunkelang
 
Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Jenia Gorokhovsky
 
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSMSunView Software, Inc.
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised LearningSara Hooker
 
Starting data science with kaggle.com
Starting data science with kaggle.comStarting data science with kaggle.com
Starting data science with kaggle.comNathaniel Shimoni
 
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Pramit Choudhary
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 

Was ist angesagt? (20)

DutchMLSchool. Your first BigML Project
DutchMLSchool. Your first BigML ProjectDutchMLSchool. Your first BigML Project
DutchMLSchool. Your first BigML Project
 
DutchMLSchool. Opening Remarks
DutchMLSchool. Opening RemarksDutchMLSchool. Opening Remarks
DutchMLSchool. Opening Remarks
 
DutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive SectorDutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. ML for Energy Trading and Automotive Sector
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
MLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs UnsupervisedMLSEV Virtual. Supervised vs Unsupervised
MLSEV Virtual. Supervised vs Unsupervised
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our World
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Data Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityData Science: A Mindset for Productivity
Data Science: A Mindset for Productivity
 
Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]
 
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
 
Module 7: Unsupervised Learning
Module 7:  Unsupervised LearningModule 7:  Unsupervised Learning
Module 7: Unsupervised Learning
 
Starting data science with kaggle.com
Starting data science with kaggle.comStarting data science with kaggle.com
Starting data science with kaggle.com
 
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 

Ähnlich wie DutchMLSchool. Models, Evaluations, and Ensembles

MLSEV. Models, Evaluations and Ensembles
MLSEV. Models, Evaluations and Ensembles MLSEV. Models, Evaluations and Ensembles
MLSEV. Models, Evaluations and Ensembles BigML, Inc
 
BSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBigML, Inc
 
BSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesDutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesBigML, Inc
 
MLSEV Virtual. Automating Model Selection
MLSEV Virtual. Automating Model SelectionMLSEV Virtual. Automating Model Selection
MLSEV Virtual. Automating Model SelectionBigML, Inc
 
MLSD18 Evaluations
MLSD18 EvaluationsMLSD18 Evaluations
MLSD18 EvaluationsBigML, Inc
 
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsMLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsBigML, Inc
 
VSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet AllocationVSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet AllocationBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
VSSML18. Evaluations
VSSML18. EvaluationsVSSML18. Evaluations
VSSML18. EvaluationsBigML, Inc
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsBigML, Inc
 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
 
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 

Ähnlich wie DutchMLSchool. Models, Evaluations, and Ensembles (20)

MLSEV. Models, Evaluations and Ensembles
MLSEV. Models, Evaluations and Ensembles MLSEV. Models, Evaluations and Ensembles
MLSEV. Models, Evaluations and Ensembles
 
BSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, EvaluationsBSSML17 - Introduction, Models, Evaluations
BSSML17 - Introduction, Models, Evaluations
 
BSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and Evaluations
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesDutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and Anomalies
 
MLSEV Virtual. Automating Model Selection
MLSEV Virtual. Automating Model SelectionMLSEV Virtual. Automating Model Selection
MLSEV Virtual. Automating Model Selection
 
MLSD18 Evaluations
MLSD18 EvaluationsMLSD18 Evaluations
MLSD18 Evaluations
 
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsMLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, Deepnets
 
VSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet AllocationVSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet Allocation
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
VSSML18. Evaluations
VSSML18. EvaluationsVSSML18. Evaluations
VSSML18. Evaluations
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and Fusions
 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly Detection
 
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 

Mehr von BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceBigML, Inc
 
Intelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility IndustryIntelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility IndustryBigML, Inc
 
Intelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
Intelligent Mobility: Embedded Machine Learning, Damage Detection in RailIntelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
Intelligent Mobility: Embedded Machine Learning, Damage Detection in RailBigML, Inc
 

Mehr von BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
 
Intelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility IndustryIntelligent Mobility: Machine Learning in the Mobility Industry
Intelligent Mobility: Machine Learning in the Mobility Industry
 
Intelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
Intelligent Mobility: Embedded Machine Learning, Damage Detection in RailIntelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
Intelligent Mobility: Embedded Machine Learning, Damage Detection in Rail
 

Kürzlich hochgeladen

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 

DutchMLSchool. Models, Evaluations, and Ensembles

  • 1. 1st edition | July 8-11, 2019
  • 2. BigML, Inc #DutchMLSchool Supervised Learning I Introduction to Machine Learning, Models, Evaluations and Ensembles Poul Petersen CIO, BigML, Inc 2
  • 3. BigML, Inc #DutchMLSchool Machine Learning Motivation 3 • You are looking to buy a house • Recently found a house you like • Is the asking price fair? Imagine: What Next?
  • 4. BigML, Inc #DutchMLSchool Maching Learning Motivation 4 Why not ask an expert? • Experts can be rare / expensive • Hard to validate experience: • Experience with similar properties? • Do they consider all relevant variables? • Knowledge of market up to date? • Hard to validate answer: • How many times expert right / wrong? • Probably can’t explain decision in detail • Humans are not good at intuitive statistics
  • 5. BigML, Inc #DutchMLSchool Data vs Expert 5 Replace the expert with data? • Intuition: square footage relates to price. • Collect data from past sales SQFT SOLD 2424 360000 1785 307500 1003 185000 4135 600000 1676 328500 1012 247000 3352 420000 2825 435350 PRICE = 125.3*SQFT + 96535 PREDICT 400262 320195 222211 614651 306538 223339 516541 450508
  • 6. BigML, Inc #DutchMLSchool Data vs Expert 6 Replace the expert scorecard • Experts can be rare / expensive • Hard to validate experience: • Experience with similar properties? • Do they consider all relevant variables? • Knowledge of market up to date? • Hard to validate answer: • How many times expert right / wrong? • Probably can’t explain decision in detail • Humans are not good at intuitive statistics
  • 7. BigML, Inc #DutchMLSchool Data vs Expert 7 Replace the expert with data • Intuition: square footage relates to price. • Collect data from past sales SQFT SOLD 2424 360000 1785 307500 1003 185000 4135 600000 1676 328500 1012 247000 3352 420000 2825 435350 PRICE = 125.3*SQFT + 96535
  • 8. BigML, Inc #DutchMLSchool More Data! 8 SQFT BEDS BATHS ADDRESS LOCATION LOT SIZE YEAR BUILT PARKING SPOTS LATITUDE LONGITUDE SOLD 2424 4 3 1522 NW Jonquil Timberhill SE 2nd 5227 1991 2 44,594828 -123,269328 360000 1785 3 2 7360 NW Valley Vw Country Estates 25700 1979 2 44,643876 -123,238189 307500 1003 2 1 2620 NW Chinaberry Tamarack Village 4792 1978 2 44,593704 -123,295424 185000 4135 5 3,5 4748 NW Veronica Suncrest 6098 2004 3 44,5929659 -123,306916 600000 1676 3 2 2842 NW Monterey Corvallis 8712 1975 2 44,5945279 -123,291523 328500 1012 3 1 2320 NW Highland Corvallis 9583 1959 2 44,591476 -123,262841 247000 3352 4 3 1205 NW Ridgewood Ridgewood 2 60113 1975 2 44,579439 -123,333888 420000 2825 3 411 NW 16th Wilkins Addition 4792 1938 1 44,570883 -123,272113 435350 Uhhhh…….. • Can we still fit a line to 10 variables? (well, yes) • Will fitting a line give good results? (unlikely) • What about those text fields and categorical values?
  • 10. BigML, Inc #DutchMLSchool Mythical ML Model? 10 • High representational power • Fitting a line is an example of low • Deep neural networks is an example of high • High Ease-of-use • Easy to configure - relatively few parameters • Easy to interpret - how are decisions made? • Easy to put into production • Ability to work with real-world data • Mixed data types: numeric, categorical, text, etc • Handle missing values • Resilient to outliers • There are actually hundreds of possible choices…
  • 11. BigML, Inc #DutchMLSchool Decision Trees 11 Last Bill > $180 and Support Calls > 0 Remember This?
  • 13. BigML, Inc #DutchMLSchool What Just Happened? 13 • We started with Housing data as a CSV from Redfin • We uploaded the CSV to create Source • Then we created a Dataset from the Source and reviewed the summary statistics • With 1-click we build a Model which can predict home prices based on all the housing features • We explored the Model and used it to make a Prediction
  • 14. BigML, Inc #DutchMLSchool Why Decision Trees 14 • Works for classification or regression
  • 15. BigML, Inc #DutchMLSchool Why Decision Trees 15 • Works for classification or regression • Easy to understand: splits are features and values • Lightweight and super fast at prediction time
  • 16. BigML, Inc #DutchMLSchool DT Predictions 16 Question 2 Prediction Question 1
  • 17. BigML, Inc #DutchMLSchool Why Decision Trees 17 • Works for classification or regression • Easy to understand: splits are features and values • Lightweight and super fast at prediction time • Relatively parameter free • Data can be messy • Useless features are automatically ignored • Works with un-normalized data • Works with missing data at Training
  • 18. BigML, Inc #DutchMLSchool Training with Missing 18 Reason Missing? Loan Amount?
  • 19. BigML, Inc #DutchMLSchool Why Decision Trees 19 • Works for classification or regression • Easy to understand: splits are features and values • Lightweight and super fast at prediction time • Relatively parameter free • Data can be messy • Useless features are automatically ignored • Works with un-normalized data • Works with missing data at Training & Prediction
  • 20. BigML, Inc #DutchMLSchool Predictions with Missing 20 Missing? Question 1 Last Prediction
  • 21. BigML, Inc #DutchMLSchool Predictions with Missing 21 Missing? Question 1 Skip Question 2 Question 3 Avg Prediction
  • 22. BigML, Inc #DutchMLSchool Why Decision Trees 22 • Works for classification or regression • Easy to understand: splits are features and values • Lightweight and super fast at prediction time • Relatively parameter free • Data can be messy • Useless features are automatically ignored • Works with un-normalized data • Works with missing data at Training & Prediction • Resilient to outliers • High representational power • Works easily with mixed data types
  • 23. BigML, Inc #DutchMLSchool Data Types 23 numeric 1 2 3 1, 2.0, 3, -5.4 categorical true / false yes / no giraffe / zebra / ape categoricalcategorical A B C YEAR MONTH DAY-OF-MONTH YYYY-MM-DD DAY-OF-WEEK HOUR MINUTE YYYY-MM-DD YYYY-MM-DD M-T-W-T-F-S-D HH:MM:SS HH:MM:SS 2013 September 25 Wednesday 10 02 DATE-TIME2013-09-25 10:02 DATE-TIME text Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon 'em. text “great” “afraid” “born” appears 2 times appears 1 time appears 1 time items bread, sugar, coffee, milk ice cream, hot fudge items
  • 24. BigML, Inc #DutchMLSchool Why Not Decision Trees 24 • Slightly prone to over-fitting. (what is that again?)
  • 25. BigML, Inc #DutchMLSchool Learning Problems (fit) 25 Under-fitting Over-fitting • Model fits too well does not “generalize” • Captures the noise or outliers of the data • Change algorithm or filter outliers
  • 26. BigML, Inc #DutchMLSchool Why Not Decision Trees 26 • Slightly prone to over-fitting • But we’ll fix this with ensembles • Splitting prefers decision boundaries that are parallel to feature axes
  • 27. BigML, Inc #DutchMLSchool Splits Parallel to Axis 27 But not Possible! Ideal split…
  • 28. BigML, Inc #DutchMLSchool Splits Parallel to Axis 28 Will “discover” diagonal edge eventually
  • 29. BigML, Inc #DutchMLSchool Why Not Decision Trees 29 • Slightly prone to over-fitting • But we’ll fix this with ensembles • Splitting prefers decision boundaries that are parallel to feature axes • More data! • Predictions outside training data can be problematic
  • 31. BigML, Inc #DutchMLSchool Why Not Decision Trees 31 • Slightly prone to over-fitting • But we’ll fix this with ensembles • Splitting prefers decision boundaries that are parallel to feature axes • More data! • Predictions outside training data can be problematic • We can catch this with model competence • Can be sensitive to small changes in training data
  • 33. BigML, Inc #DutchMLSchool Why Not Decision Trees 33 • Slightly prone to over-fitting • But we’ll fix this with ensembles • Splitting prefers decision boundaries that are parallel to feature axes • More data! • Predictions outside training data can be problematic • We can catch this with model competence • Can be sensitive to small changes in training data • What other models can we try? • And how will we know which one works best?
  • 35. BigML, Inc #DutchMLSchool Easy Right? 35 INTL MIN INTL CALLS INTL CHARGE CUST SERV CALLS CHURN 8,7 4 2,35 1 False 11,2 5 3,02 0 False 12,7 6 3,43 4 True 9,1 5 2,46 0 False 11,2 2 3,02 1 False 12,3 5 3,32 3 False 13,1 6 3,54 4 False 5,4 9 1,46 4 True 13,8 4 3,73 1 False Model Prediction PREDICT CHURN False True True False False False False False False Count up mistakes!
  • 36. BigML, Inc #DutchMLSchool Mistakes can be Costly 36 FUN! + = DANGER! Insight: Labeling a Yield as a stop is not as bad as labelling a stop as a yield… Need better metrics!
  • 37. BigML, Inc #DutchMLSchool Evaluation Metrics 37 • Imagine we have a model that can predict a person’s dominant hand, that is for any individual it predicts left / right • Define the positive class • This selection is arbitrary • It is the class you are interested in! • The negative class is the “other” class (or others) • For this example, we choose : left
  • 38. BigML, Inc #DutchMLSchool Evaluation Metrics 38 • We choose the positive class: left • True Positive (TP) • We predicted left and the correct answer was left • True Negative (TN) • We predicted right and the correct answer was right • False Positive (FP) • Predicted left but the correct answer was right • False Negative (FN) • Predict right but the correct answer was left
  • 39. BigML, Inc #DutchMLSchool Evaluation Metrics 39 True Positive: Correctly predicted the positive class True Negative: Correctly predicted the negative class False Positive: Incorrectly predicted the positive class False Negative: Incorrectly predicted the negative class Remember…
  • 40. BigML, Inc #DutchMLSchool Accuracy 40 TP + TN Total • “Percentage correct” - like an exam • If Accuracy = 1 then no mistakes • If Accuracy = 0 then all mistakes • Intuitive but not always useful • Watch out for unbalanced classes! • Ex: 90% of people are right-handed and 10% are left • A silly model which always predicts right handed is 90% accurate
  • 41. BigML, Inc #DutchMLSchool Accuracy 41 Classified as Left Handed Classified as Right Handed TP = 0 FP = 0 TN = 7 FN = 3 = Left = RightPositive Class Negative Class TP + TN Total = 70%
  • 42. BigML, Inc #DutchMLSchool Precision 42 TP TP + FP • “accuracy” or “purity” of positive class • How well you did separating the positive class from the negative class • If Precision = 1 then no FP. • You may have missed some left handers, but of the ones you identified, all are left handed. No mistakes. • If Precision = 0 then no TP • None of the left handers you identified are actually left handed. All mistakes.
  • 43. BigML, Inc #DutchMLSchool Precision 43 Classified as Left Handed Classified as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FP = 50%
  • 44. BigML, Inc #DutchMLSchool Recall 44 TP TP + FN • percentage of positive class correctly identified • A measure of how well you identified all of the positive class examples • If Recall = 1 then no FN → All left handers identified • There may be FP, so precision could be <1 • If Recall = 0 then no TP → No left handers identified
  • 45. BigML, Inc #DutchMLSchool Recall 45 Classified as Left Handed Classified as Right Handed TP = 2 FP = 2 TN = 5 FN = 1 Positive Class Negative Class = Left = Right TP TP + FN = 66%
  • 46. BigML, Inc #DutchMLSchool f-Measure 46 2 * Recall * Precision Recall + Precision • harmonic mean of Recall & Precision • If f-measure = 1 then Recall == Precision == 1 • If Precision OR Recall is small then the f-measure is small
  • 47. BigML, Inc #DutchMLSchool Phi Coefficient 47 __________TP*TN_-_FP*FN__________ SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)] • Returns a value between -1 and 1 • If -1 then predictions are opposite reality • =0 no correlation between predictions and reality • =1 then predictions are always correct
  • 49. BigML, Inc #DutchMLSchool What Just Happened? 49 • Starting with the Diabetes Source, we created a Dataset and then a Model. • Using both the Model and the original Dataset, we created an Evaluation. • We reviewed the metrics provided by the Evaluation: • Confusion Matrix • Accuracy, Precision, Recall, f-measure and phi • This Model seemed to perform really, really well… Question: Can we trust this model?
  • 50. BigML, Inc #DutchMLSchool Evaluation Danger! 50 • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations!
  • 51. BigML, Inc #DutchMLSchool “Memorizing” Training Data 51 plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 85 26,6 0,351 31 FALSE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Training Evaluating plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 ? 85 26,6 0,351 31 ? • Exactly the same values! • Who needs a model? • What we want to know is how the model performs with values never seen at training: 124 22 0,107 46 ?
  • 52. BigML, Inc #DutchMLSchool Evaluation Danger! 52 • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split
  • 53. BigML, Inc #DutchMLSchool Train / Test Split 53 plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Train Test plasma glucose bmi diabetes pedigree age diabetes 85 26,6 0,351 31 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE • These instances were never seen at training time. • Better evaluation of how the model will perform with “new” data
  • 54. BigML, Inc #DutchMLSchool Evaluation Danger! 54 • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split • Even a train/test split may not be enough! • Might get a “lucky” split • Solution is to repeat several times (formally to cross validate)
  • 56. BigML, Inc #DutchMLSchool What Just Happened? 56 • Starting with the Diabetes Dataset we created a train/test split • We built a Model using the train set and evaluated it with the test set • The scores were much worse than before, showing the danger of evaluating with training data. • Then we built several other models with different parameters and used the evaluation comparison tool to see which performed the best. Question: Couldn’t we search for the best Model or parameters? STAY TUNED
  • 57. BigML, Inc #DutchMLSchool Evaluation 57 • Never evaluate with the training data! • Many models are able to “memorize” the training data • This will result in overly optimistic evaluations! • If you only have one Dataset, use a train/test split • Even a train/test split may not be enough! • Might get a “lucky” split • Solution is to repeat several times (formally to cross validate) • Don’t forget that accuracy can be mis-leading! • Mostly useless with unbalanced classes (left/right?) • Use weighting, operating points, other tricks…
  • 58. BigML, Inc #DutchMLSchool Operating Points 58 • The default probability threshold is 50% • Changing the threshold can change the outcome for a specific class Rate Payment … Actual Outcome Probability PAID Threshold @ 50% Threshold @ 60% Threshold @ 90% 8,4 % US$456 … PAID 95 % PAID PAID PAID 9,6 % US$134 … PAID 87 % PAID PAID DEFAULT 18 % US$937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT 21 % US$35 … PAID 88 % PAID PAID DEFAULT 17,5 %US$1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT
  • 59. BigML, Inc #DutchMLSchool What about Regressions? 59 • No classes: • Not possible to count mistakes: TP, FP, TN, FN • Predicted values are numeric: error is the amount “off” • actual 200, predict 180 = error 20 • Mean Absolute Error / Mean Squared Error • Both are a measure of total error • Note: value of the error is “unbounded”. • When comparing models, lower values are “better” • R-Squared Error • Measure of how much better the model is than always predicting the mean • < 0 model is worse then mean • = 0 model is no better than the mean • ➞ 1 model fits the data “perfectly”
  • 61. BigML, Inc #DutchMLSchool What just happened? 61 • We split the RedFin data into training and test Datasets • We created a Model and Evaluation • We examined the Evaluation metrics Wait - What about Time Series?
  • 62. BigML, Inc #DutchMLSchool Dependent Data 62 Year Pineapple Harvest1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Trend Error
  • 63. BigML, Inc #DutchMLSchool Dependent Data 63 Pineapple Harvest Tons 0 125 250 375 500 Year 1986 1988 1990 1992 1994 1996 1998 2000 Year Pineapple Harvest1986 139,09 1987 175,31 1988 9,91 1989 22,95 1990 450,53 1991 73,93 1992 40,38 1993 22,03 1994 295,03 1995 50,74 1996 29,8 1997 223,41 1998 115,17 1999 193,88 2000 50,69 Rearranging Disrupts Patterns
  • 64. BigML, Inc #DutchMLSchool Random Train / Test Split 64 plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 TRUE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Train Test plasma glucose bmi diabetes pedigree age diabetes 85 26,6 0,351 31 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE
  • 65. BigML, Inc #DutchMLSchool Linear Train / Test Split 65 Train Test Year Pineapple Harvest1986 50,74 1987 22,03 1988 50,69 1989 40,38 1990 29,80 1991 9,90 1992 73,93 1993 22,95 1994 139,09 1995 115,17 1996 193,88 Year Pineapple Harvest 1997 175,31 1998 223,41 1999 295,03 2000 450,53 Forecast COMPARE
  • 67. BigML, Inc #DutchMLSchool what is an Ensemble? 67 • Rather than build a single model… • Combine the output of several typically “weaker” models into a powerful ensemble… • Q1: Why is this necessary? • Q2: How do we build “weaker” models? • Q3: How do we “combine” models?
  • 68. BigML, Inc #DutchMLSchool No Model is Perfect 68 • A given ML algorithm may simply not be able to exactly model the “real solution” of a particular dataset. • Try to fit a line to a curve • Even if the model is very capable, the “real solution” may be elusive • DT/NN can model any decision boundary with enough training data, but the solution is NP-hard • Practical algorithms involve random processes and may arrive at different, yet equally good, “solutions” depending on the starting conditions, local optima, etc. • If that wasn’t bad enough…
  • 69. BigML, Inc #DutchMLSchool No Data is Perfect 69 • Not enough data! • Always working with finite training data • Therefore, every “model” is an approximation of the “real solution” and there may be several good approximations. • Anomalies / Outliers • The model is trying to generalize from discrete training data. • Outliers can “skew” the model, by overfitting • Mistakes in your data • Does the model have to do everything for you? • But really, there is always mistakes in your data
  • 70. BigML, Inc #DutchMLSchool Ensemble Techniques 70 • Key Idea: • By combining several good “models”, the combination may be closer to the best possible “model” • we want to ensure diversity. It’s not useful to use an ensemble of 100 models that are all the same • Training Data Tricks • Build several models, each with only some of the data • Introduce randomness directly into the algorithm • Add training weights to “focus” the additional models on the mistakes made • Prediction Tricks • Model the mistakes • Model the output of several different algorithms
  • 71. BigML, Inc #DutchMLSchool Simple Example - Fit a Line 71
  • 72. BigML, Inc #DutchMLSchool Simple Example - Fit a Line 72
  • 73. BigML, Inc #DutchMLSchool Simple Example - Fit a Line 73 Partition the data… then model each partition… For predictions, use the model for the same partition ?
  • 74. BigML, Inc #DutchMLSchool Decision Forest 74 MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION COMBINER
  • 75. BigML, Inc #DutchMLSchool Random Decision Forest 75 MODEL 1 DATASET SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 MODEL 2 MODEL 3 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 SAMPLE 1 PREDICTION COMBINER
  • 76. BigML, Inc #DutchMLSchool Boosting 76 ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE LAST SALE PRICE 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 360000 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 307500 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 600000 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350 MODEL 1 PREDICTED SALE PRICE 360750 306875 587500 435350 ERROR 750 -625 -12500 0 ADDRESS BEDS BATHS SQFT LOT SIZE YEAR BUILT LATITUDE LONGITUDE ERROR 1522 NW Jonquil 4 3 2424 5227 1991 44,594828 -123,269328 750 7360 NW Valley Vw 3 2 1785 25700 1979 44,643876 -123,238189 625 4748 NW Veronica 5 3,5 4135 6098 2004 44,5929659 -123,306916 12500 411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0 MODEL 2 PREDICTED ERROR 750 625 12393,83333 6879,67857 Why stop at one iteration? "Hey Model 1, what do you predict is the sale price of this home?" "Hey Model 2, how much error do you predict Model 1 just made?"
  • 77. BigML, Inc #DutchMLSchool Boosting 77 DATASET MODEL 1 DATASET 2 MODEL 2 DATASET 3 MODEL 3 DATASET 4 MODEL 4 PREDICTION 1 PREDICTION 2 PREDICTION 3 PREDICTION 4 PREDICTION SUM Iteration 1 Iteration 2 Iteration 3 Iteration 4 etc…
  • 79. BigML, Inc #DutchMLSchool Which Ensemble Method 79 • The one that works best! • Ok, but seriously. Did you evaluate? • For "large" / "complex" datasets • Use DF/RDF with deeper node threshold • Even better, use Boosting with more iterations • For "noisy" data • Boosting may overfit • RDF preferred • For "wide" data • Randomize features (RDF) will be quicker • For "easy" data • A single model may be fine • Bonus: also has the best interpretability! • For classification with "large" number of classes • Boosting will be slower • For "general" data • DF/RDF likely better than a single model or Boosting. • Boosting will be slower since the models are processed serially
  • 81. BigML, Inc #DutchMLSchool Summary 81 • Models have shortcomings: ability to fit, NP-hard, etc • Data has shortcomings: not enough, outliers, mistakes, etc • Ensemble Techniques can improve on single models • Sampling: partitioning, Decision Tree bagging • Adding Randomness: RDF • Modeling the Error: Boosting • Modeling the Models: Stacking • Guidelines for knowing which one might work best in a given situation