Dive into H2O: NYC

Introduction to Driverless AI
Chemere Davis

Confidential2
Please Create an Account on Aquarium

Confidential3
Please Sign Into Aquarium

Confidential4
H2O.ai Product Suite
Automatic feature engineering,
machine learning and interpretability
• 100% open source – Apache V2 licensed
• Built for data scientists – interface using R, Python, Scala,
H2O Flow (interactive notebook interface)
• Enterprise support subscriptions
• Enterprise software
• Built for domain users, analysts
and data scientists – GUI-based
interface for end-to-end data
science
• Fully automated machine learning
from ingest to deployment
• User licenses on a per seat basis
(annual subscription)
H2O AI open source engine
integration with Spark
Lightning fast machine
learning on GPUs
In-memory, distributed
machine learning algorithms
with H2O Flow GUI
Open Source

Confidential5
The Workflow of Driverless AI
SQL
HDFS
X Y
Automatic Model Optimization
Automatic
Scoring Pipeline
Deploy
Low-latency
Scoring to
Production
Modelling
Dataset
Model Recipes
• i.i.d. data
• Time-series
• More on the way
Advanced
Feature
Engineering
Algorithm
Model
Tuning+ +
Survival of the Fittest
1 Drag and Drop Data
2 Automatic Visualization
4 Automatic Model Optimization
5 Automatic Scoring Pipelines
Snowflake
Model
Documentation
 Upload your own recipe(s)
Transformations Algorithms Scorers
3 Bring Your Own Recipes
 Driverless AI executes automation on your recipes
Feature engineering, model selection, hyper-parameter tuning,
overfitting protection
 Driverless AI automates
model scoring and
deployment using your
recipes
Amazon S3
Google BigQuery
Azure Blog Storage

Confidential6
Driverless AI: Supervised Learning
Regression:
How much will a customers spend?
Classification:
Will a customer make a purchase? Yes or No
X
y
xi
xj
yes
no

Confidential7 Confidential7 Confidential7
Driverless AI
Features
Target
Data Quality and
Transformation Modeling
Table
Model
Building
Model
Data Integration
+
Typical Enterprise ML Workflow

Confidential8 Confidential8 Confidential8
Features
Target
Modeling Table Model Building Model
Driverless AI Modeling
Data Types
• Numeric
• Categorical
• Time/Date
• Text
• Missing values allowed
Model Types
• Regression
• Classification
– Binary
– Multinomial
Build Process
• Feature engineering
– Including NLP (text)
• Automated hyperparameter
tuning
Both iid &
Time Series
• Single
• Grouped
• Gap between last
observation and
prediction

Confidential9 Confidential9
How Well Does
Driverless AI Work?

Confidential10
Top 10 Finish in BNP Kaggle Competition
single run, fully automated: 2h on DGX Station! 6h on PC
Driverless AI: 10th place in private LB at Kaggle (out of 2926)

Confidential11
Top 5% in Amazon Kaggle competition

Confidential12
Other Kaggle Competitions: Driverless AI Results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Allstate
BNP Paribas
Amazon
Homesite
Otto Group
Relative error: Lower is Better
Kaggle Grandmaster Best AutoDL GBM BaselineRelative Error (Lower is Better)
Kaggle Grandmaster Best
Driverless AI
GBM Baseline

Credit Card Example

Confidential14
• Dataset:
– Comes from a lender in Taiwan (April – August, 2005)
– Information on default payments, demographic factors, credit data, history of
payment, etc.
– Source:
– UCI Machine Learning Library
– kaggle.com/uciml/default-of-credit-card-clients-dataset
• Our Goal:
– Predict whether someone will default on their next credit card payment.
Credit Card Payment Default
14

Confidential15
The Data
Column Description
ID ID of each customer
Default Defaulted on next payment (1 = yes, 0 = no)
CreditLimit Credit limit in NT dollars
Sex Gender (M, F)
Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown)
Marriage Marital status (M, S, D, O)
Age Age in years
Status1 … Status6 Repayment status in September, 2005 – April, 2005
BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar)
PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)

Confidential16
Payment History Data
1 Month
Ago
Status1:
≤0, 1
BillAmt1
PayAmt1
2 Months
Ago
Status2:
≤0, 1, 2
BillAmt2
PayAmt2
3 Months
Ago
Status3:
≤0, 1, 2, 3
BillAmt3
PayAmt3
...
6 Months
Ago
Status6:
≤0, 1, ..., 6
BillAmt6
PayAmt6
Status:
-2: No balance
-1: Paid in full
0: Minimum balance paid
1: One month late
2: Two months late
etc.

Automatic
Visualizations

Confidential18
Automatic Visualization (AutoViz)

Confidential19
Automatic Visualizations
Scalable outlier detection
Contains novel statistical algorithms to
only show “relevant” aspects of the data
(coming soon: automated data cleaning)

Machine Learning
Experimentation

Confidential21
Experiment Settings
3 KEY SETTINGS
Accuracy Time Interpretability

Confidential22
Experiment Settings
ACCURACY
• Relative accuracy – higher values
should lead to higher confidence
in model performance (accuracy)
• Impacts things such as level of
data sampling, how many models
are used in the final ensemble,
parameter tuning level, among
others
Accuracy Time Interpretability
• Relative time for completing
the experiment
• Higher settings mean:
– More iterations are performed
to find the best set of features
– Longer “early stopping”
threshold
• Relative interpretability – higher
values favor more interpretable
models
• The higher the interpretability
setting, the lower the complexity
of the engineered features and
of the final model(s)

Confidential23
Accuracy
Accuracy
Max Rows x
Cols
Ensemble
Level
Target
Transformation
Parameter
Tuning
Level
Num
Folds
Only First
Fold Model
Distribution
Check
1 100K 0 False 0 3 True No
2 1M 0 False 0 3 True No
3 50M 0 True 1 3 True No
4 100M 0 True 1 3-4 True No
5 200M 1 True 1 3-4 True Yes
6 500M 2 True 1 3-5 True Yes
7 750M <=3 True 2 3-10 Auto Yes
8 1B <=3 True 2 4-10 Auto Yes
9 2B <=3 True 3 4-10 Auto Yes
10 10B <=4 True 3 4-10 Auto Yes

Confidential24
Time
Time Iterations
Early Stopping
Rounds
1 1-5 None
2 10 5
3 30 5
4 40 5
5 50 10
6 100 10
7 150 15
8 200 20
9 300 30
10 500 50

Confidential25
Interpretability
Interpretability
Ensemble
Level
Target
Transformation
Feature Engineering
Feature Pre-
Pruning
Monotonicity
Constraints
1 - 3 <= 3 None Disabled
4 <= 3 Inverse None Disabled
5 <= 3 Anscombe
Clustering (ID, distance)
Truncated SVD
None Disabled
6 <= 2
Logit
Sigmoid
Feature selection Disabled
7 <= 2 Frequency Encoding Feature selection Enabled
8 <= 1 4th
Root Feature selection Enabled
9 <= 1
Square
Square Root
Bulk Interactions (add,
subtract, multiply,
divide)
Weight of Evidence
Feature selection Enabled
10 0
Identity
Unit Box
Log
Date Decompositions
Number Encoding
Target Encoding
Text (TF-IDF,
Frequency)
Feature selection Enabled
Good
start

Confidential26
Scoring Options
"
Classification Regression
Best For
Imbalanced
Data
Precision
Recall

Confidential27
Driverless AI - Machine Learning Interpretability
Gain confidence in models before deploying them!

Confidential28
Linear Models Machine Learning
For a given well-understood dataset there is usually
one best model.
For a given well-understood dataset there are usually
many good models. This is often referred to as “the
multiplicity of good models.”
-- Leo Breiman. “Statistical modeling: The two cultures (with
comments and a rejoinder by the author).” Statistical Science.
2001. http://bit.ly/2pwz6m5
Why is Machine Learning Interpretability Difficult?

Confidential29
Interpretability
Complexity of learned functions:
• Linear, monotonic
• Nonlinear, monotonic
• Nonlinear, non-monotonic
Scope of interpretability:
Global vs. local
Application domain Understanding:
Model-agnostic vs. model-specificTrust:
Enhancing trust and understanding: the
mechanisms and results of an interpretable
model should be both transparent AND
dependable.

Confidential30
Global and Local Interpretability
Linear Models
Exact explanations for
approximate models.
Machine Learning
Approximate explanations
for exact models.

Dive into H2O: NYC

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Dive into H2O: NYC

Ähnlich wie Dive into H2O: NYC (20)

Mehr von Sri Ambati

Mehr von Sri Ambati (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dive into H2O: NYC