This session took place at New York City on November 4th, 2019.
Speaker Bio:
Chemere is a Senior Data Science Training Specialist for H2O.ai. Chemere has a Master's in Business Administration with focus in Marketing Analytics from the University of North Carolina at Charlotte. She is an experienced data scientist with a diverse background in transformational decision-making in various industries including Banking, Manufacturing, Logistics, and Medical Devices. Chemere joins us from Venus Concept/2two5, where she was the Lead Data Scientist focused on building predictive models with Internet of Things (IoT) data and for a subscription-based marketing product for B2B customers. Prior to that, Chemere worked as a Senior Data Scientist at Wells Fargo Bank focused on various applied predictive analytic solutions.
More details about the event can be had here: https://www.eventbrite.com/e/dive-into-h2o-new-york-tickets-76351721053
4. Confidential4
H2O.ai Product Suite
Automatic feature engineering,
machine learning and interpretability
• 100% open source – Apache V2 licensed
• Built for data scientists – interface using R, Python, Scala,
H2O Flow (interactive notebook interface)
• Enterprise support subscriptions
• Enterprise software
• Built for domain users, analysts
and data scientists – GUI-based
interface for end-to-end data
science
• Fully automated machine learning
from ingest to deployment
• User licenses on a per seat basis
(annual subscription)
H2O AI open source engine
integration with Spark
Lightning fast machine
learning on GPUs
In-memory, distributed
machine learning algorithms
with H2O Flow GUI
Open Source
5. Confidential5
The Workflow of Driverless AI
SQL
HDFS
X Y
Automatic Model Optimization
Automatic
Scoring Pipeline
Deploy
Low-latency
Scoring to
Production
Modelling
Dataset
Model Recipes
• i.i.d. data
• Time-series
• More on the way
Advanced
Feature
Engineering
Algorithm
Model
Tuning+ +
Survival of the Fittest
1 Drag and Drop Data
2 Automatic Visualization
4 Automatic Model Optimization
5 Automatic Scoring Pipelines
Snowflake
Model
Documentation
Upload your own recipe(s)
Transformations Algorithms Scorers
3 Bring Your Own Recipes
Driverless AI executes automation on your recipes
Feature engineering, model selection, hyper-parameter tuning,
overfitting protection
Driverless AI automates
model scoring and
deployment using your
recipes
Amazon S3
Google BigQuery
Azure Blog Storage
6. Confidential6
Driverless AI: Supervised Learning
Regression:
How much will a customers spend?
Classification:
Will a customer make a purchase? Yes or No
X
y
xi
xj
yes
no
8. Confidential8 Confidential8 Confidential8
Features
Target
Modeling Table Model Building Model
Driverless AI Modeling
Data Types
• Numeric
• Categorical
• Time/Date
• Text
• Missing values allowed
Model Types
• Regression
• Classification
– Binary
– Multinomial
Build Process
• Feature engineering
– Including NLP (text)
• Automated hyperparameter
tuning
Both iid &
Time Series
• Single
• Grouped
• Gap between last
observation and
prediction
10. Confidential10
Top 10 Finish in BNP Kaggle Competition
single run, fully automated: 2h on DGX Station! 6h on PC
Driverless AI: 10th place in private LB at Kaggle (out of 2926)
14. Confidential14
• Dataset:
– Comes from a lender in Taiwan (April – August, 2005)
– Information on default payments, demographic factors, credit data, history of
payment, etc.
– Source:
– UCI Machine Learning Library
– kaggle.com/uciml/default-of-credit-card-clients-dataset
• Our Goal:
– Predict whether someone will default on their next credit card payment.
Credit Card Payment Default
14
15. Confidential15
The Data
Column Description
ID ID of each customer
Default Defaulted on next payment (1 = yes, 0 = no)
CreditLimit Credit limit in NT dollars
Sex Gender (M, F)
Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown)
Marriage Marital status (M, S, D, O)
Age Age in years
Status1 … Status6 Repayment status in September, 2005 – April, 2005
BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar)
PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)
16. Confidential16
Payment History Data
1 Month
Ago
Status1:
≤0, 1
BillAmt1
PayAmt1
2 Months
Ago
Status2:
≤0, 1, 2
BillAmt2
PayAmt2
3 Months
Ago
Status3:
≤0, 1, 2, 3
BillAmt3
PayAmt3
...
6 Months
Ago
Status6:
≤0, 1, ..., 6
BillAmt6
PayAmt6
Status:
-2: No balance
-1: Paid in full
0: Minimum balance paid
1: One month late
2: Two months late
etc.
22. Confidential22
Experiment Settings
ACCURACY
• Relative accuracy – higher values
should lead to higher confidence
in model performance (accuracy)
• Impacts things such as level of
data sampling, how many models
are used in the final ensemble,
parameter tuning level, among
others
Accuracy Time Interpretability
• Relative time for completing
the experiment
• Higher settings mean:
– More iterations are performed
to find the best set of features
– Longer “early stopping”
threshold
• Relative interpretability – higher
values favor more interpretable
models
• The higher the interpretability
setting, the lower the complexity
of the engineered features and
of the final model(s)
28. Confidential28
Linear Models Machine Learning
For a given well-understood dataset there is usually
one best model.
For a given well-understood dataset there are usually
many good models. This is often referred to as “the
multiplicity of good models.”
-- Leo Breiman. “Statistical modeling: The two cultures (with
comments and a rejoinder by the author).” Statistical Science.
2001. http://bit.ly/2pwz6m5
Why is Machine Learning Interpretability Difficult?
29. Confidential29
Interpretability
Complexity of learned functions:
• Linear, monotonic
• Nonlinear, monotonic
• Nonlinear, non-monotonic
Scope of interpretability:
Global vs. local
Application domain Understanding:
Model-agnostic vs. model-specificTrust:
Enhancing trust and understanding: the
mechanisms and results of an interpretable
model should be both transparent AND
dependable.
30. Confidential30
Global and Local Interpretability
Linear Models
Exact explanations for
approximate models.
Machine Learning
Approximate explanations
for exact models.