SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Introduction to Driverless AI
Chemere Davis
Confidential2
Please Create an Account on Aquarium
Confidential3
Please Sign Into Aquarium
Confidential4
H2O.ai Product Suite
Automatic feature engineering,
machine learning and interpretability
• 100% open source – Apache V2 licensed
• Built for data scientists – interface using R, Python, Scala,
H2O Flow (interactive notebook interface)
• Enterprise support subscriptions
• Enterprise software
• Built for domain users, analysts
and data scientists – GUI-based
interface for end-to-end data
science
• Fully automated machine learning
from ingest to deployment
• User licenses on a per seat basis
(annual subscription)
H2O AI open source engine
integration with Spark
Lightning fast machine
learning on GPUs
In-memory, distributed
machine learning algorithms
with H2O Flow GUI
Open Source
Confidential5
The Workflow of Driverless AI
SQL
HDFS
X Y
Automatic Model Optimization
Automatic
Scoring Pipeline
Deploy
Low-latency
Scoring to
Production
Modelling
Dataset
Model Recipes
• i.i.d. data
• Time-series
• More on the way
Advanced
Feature
Engineering
Algorithm
Model
Tuning+ +
Survival of the Fittest
1 Drag and Drop Data
2 Automatic Visualization
4 Automatic Model Optimization
5 Automatic Scoring Pipelines
Snowflake
Model
Documentation
 Upload your own recipe(s)
Transformations Algorithms Scorers
3 Bring Your Own Recipes
 Driverless AI executes automation on your recipes
Feature engineering, model selection, hyper-parameter tuning,
overfitting protection
 Driverless AI automates
model scoring and
deployment using your
recipes
Amazon S3
Google BigQuery
Azure Blog Storage
Confidential6
Driverless AI: Supervised Learning
Regression:
How much will a customers spend?
Classification:
Will a customer make a purchase? Yes or No
X
y
xi
xj
yes
no
Confidential7 Confidential7 Confidential7
Driverless AI
Features
Target
Data Quality and
Transformation Modeling
Table
Model
Building
Model
Data Integration
+
Typical Enterprise ML Workflow
Confidential8 Confidential8 Confidential8
Features
Target
Modeling Table Model Building Model
Driverless AI Modeling
Data Types
• Numeric
• Categorical
• Time/Date
• Text
• Missing values allowed
Model Types
• Regression
• Classification
– Binary
– Multinomial
Build Process
• Feature engineering
– Including NLP (text)
• Automated hyperparameter
tuning
Both iid &
Time Series
• Single
• Grouped
• Gap between last
observation and
prediction
Confidential9 Confidential9
How Well Does
Driverless AI Work?
Confidential10
Top 10 Finish in BNP Kaggle Competition
single run, fully automated: 2h on DGX Station! 6h on PC
Driverless AI: 10th place in private LB at Kaggle (out of 2926)
Confidential11
Top 5% in Amazon Kaggle competition
Confidential12
Other Kaggle Competitions: Driverless AI Results
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Allstate
BNP Paribas
Amazon
Homesite
Otto Group
Relative error: Lower is Better
Kaggle Grandmaster Best AutoDL GBM BaselineRelative Error (Lower is Better)
Kaggle Grandmaster Best
Driverless AI
GBM Baseline
Confidential13 Confidential13
Credit Card Example
Confidential14
• Dataset:
– Comes from a lender in Taiwan (April – August, 2005)
– Information on default payments, demographic factors, credit data, history of
payment, etc.
– Source:
– UCI Machine Learning Library
– kaggle.com/uciml/default-of-credit-card-clients-dataset
• Our Goal:
– Predict whether someone will default on their next credit card payment.
Credit Card Payment Default
14
Confidential15
The Data
Column Description
ID ID of each customer
Default Defaulted on next payment (1 = yes, 0 = no)
CreditLimit Credit limit in NT dollars
Sex Gender (M, F)
Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown)
Marriage Marital status (M, S, D, O)
Age Age in years
Status1 … Status6 Repayment status in September, 2005 – April, 2005
BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar)
PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)
Confidential16
Payment History Data
1 Month
Ago
Status1:
≤0, 1
BillAmt1
PayAmt1
2 Months
Ago
Status2:
≤0, 1, 2
BillAmt2
PayAmt2
3 Months
Ago
Status3:
≤0, 1, 2, 3
BillAmt3
PayAmt3
...
6 Months
Ago
Status6:
≤0, 1, ..., 6
BillAmt6
PayAmt6
Status:
-2: No balance
-1: Paid in full
0: Minimum balance paid
1: One month late
2: Two months late
etc.
Confidential17 Confidential17
Automatic
Visualizations
Confidential18
Automatic Visualization (AutoViz)
Confidential19
Automatic Visualizations
Scalable outlier detection
Contains novel statistical algorithms to
only show “relevant” aspects of the data
(coming soon: automated data cleaning)
Confidential20 Confidential20
Machine Learning
Experimentation
Confidential21
Experiment Settings
3 KEY SETTINGS
Accuracy Time Interpretability
Confidential22
Experiment Settings
ACCURACY
• Relative accuracy – higher values
should lead to higher confidence
in model performance (accuracy)
• Impacts things such as level of
data sampling, how many models
are used in the final ensemble,
parameter tuning level, among
others
Accuracy Time Interpretability
• Relative time for completing
the experiment
• Higher settings mean:
– More iterations are performed
to find the best set of features
– Longer “early stopping”
threshold
• Relative interpretability – higher
values favor more interpretable
models
• The higher the interpretability
setting, the lower the complexity
of the engineered features and
of the final model(s)
Confidential23
Accuracy
Accuracy
Max Rows x
Cols
Ensemble
Level
Target
Transformation
Parameter
Tuning
Level
Num
Folds
Only First
Fold Model
Distribution
Check
1 100K 0 False 0 3 True No
2 1M 0 False 0 3 True No
3 50M 0 True 1 3 True No
4 100M 0 True 1 3-4 True No
5 200M 1 True 1 3-4 True Yes
6 500M 2 True 1 3-5 True Yes
7 750M <=3 True 2 3-10 Auto Yes
8 1B <=3 True 2 4-10 Auto Yes
9 2B <=3 True 3 4-10 Auto Yes
10 10B <=4 True 3 4-10 Auto Yes
Confidential24
Time
Time Iterations
Early Stopping
Rounds
1 1-5 None
2 10 5
3 30 5
4 40 5
5 50 10
6 100 10
7 150 15
8 200 20
9 300 30
10 500 50
Confidential25
Interpretability
Interpretability
Ensemble
Level
Target
Transformation
Feature Engineering
Feature Pre-
Pruning
Monotonicity
Constraints
1 - 3 <= 3 None Disabled
4 <= 3 Inverse None Disabled
5 <= 3 Anscombe
Clustering (ID, distance)
Truncated SVD
None Disabled
6 <= 2
Logit
Sigmoid
Feature selection Disabled
7 <= 2 Frequency Encoding Feature selection Enabled
8 <= 1 4th
Root Feature selection Enabled
9 <= 1
Square
Square Root
Bulk Interactions (add,
subtract, multiply,
divide)
Weight of Evidence
Feature selection Enabled
10 0
Identity
Unit Box
Log
Date Decompositions
Number Encoding
Target Encoding
Text (TF-IDF,
Frequency)
Feature selection Enabled
Good
start
Confidential26
Scoring Options
"
Classification Regression
Best For
Imbalanced
Data
Precision
Recall
Confidential27
Driverless AI - Machine Learning Interpretability
Gain confidence in models before deploying them!
Confidential28
Linear Models Machine Learning
For a given well-understood dataset there is usually
one best model.
For a given well-understood dataset there are usually
many good models. This is often referred to as “the
multiplicity of good models.”
-- Leo Breiman. “Statistical modeling: The two cultures (with
comments and a rejoinder by the author).” Statistical Science.
2001. http://bit.ly/2pwz6m5
Why is Machine Learning Interpretability Difficult?
Confidential29
Interpretability
Complexity of learned functions:
• Linear, monotonic
• Nonlinear, monotonic
• Nonlinear, non-monotonic
Scope of interpretability:
Global vs. local
Application domain Understanding:
Model-agnostic vs. model-specificTrust:
Enhancing trust and understanding: the
mechanisms and results of an interpretable
model should be both transparent AND
dependable.
Confidential30
Global and Local Interpretability
Linear Models
Exact explanations for
approximate models.
Machine Learning
Approximate explanations
for exact models.

Weitere ähnliche Inhalte

Was ist angesagt?

Near realtime AI deployment with huge data and super low latency - Levi Brack...
Near realtime AI deployment with huge data and super low latency - Levi Brack...Near realtime AI deployment with huge data and super low latency - Levi Brack...
Near realtime AI deployment with huge data and super low latency - Levi Brack...
Sri Ambati
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
Sri Ambati
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
Sri Ambati
 
A Look Under the Hood of H2O Driverless AI
A Look Under the Hood of H2O Driverless AIA Look Under the Hood of H2O Driverless AI
A Look Under the Hood of H2O Driverless AI
Sri Ambati
 

Was ist angesagt? (20)

AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
H2O Driverless AI Workshop
H2O Driverless AI WorkshopH2O Driverless AI Workshop
H2O Driverless AI Workshop
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izra...
 
Accelerating AI Adoption with Partners
Accelerating AI Adoption with PartnersAccelerating AI Adoption with Partners
Accelerating AI Adoption with Partners
 
Near realtime AI deployment with huge data and super low latency - Levi Brack...
Near realtime AI deployment with huge data and super low latency - Levi Brack...Near realtime AI deployment with huge data and super low latency - Levi Brack...
Near realtime AI deployment with huge data and super low latency - Levi Brack...
 
Towards Human-Centered Machine Learning
Towards Human-Centered Machine LearningTowards Human-Centered Machine Learning
Towards Human-Centered Machine Learning
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
 
AI in the Enterprise at Scale
AI in the Enterprise at ScaleAI in the Enterprise at Scale
AI in the Enterprise at Scale
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
H2O.ai's Driverless AI
H2O.ai's Driverless AIH2O.ai's Driverless AI
H2O.ai's Driverless AI
 
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorldAnkit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
 
Accelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSAccelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWS
 
Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...Fast Data at ING – the why, what and how of the streaming analytics platform ...
Fast Data at ING – the why, what and how of the streaming analytics platform ...
 
Custom Machine Learning Recipes for the Enterprise
Custom Machine Learning Recipes for the EnterpriseCustom Machine Learning Recipes for the Enterprise
Custom Machine Learning Recipes for the Enterprise
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
A Look Under the Hood of H2O Driverless AI
A Look Under the Hood of H2O Driverless AIA Look Under the Hood of H2O Driverless AI
A Look Under the Hood of H2O Driverless AI
 
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
 

Ähnlich wie Dive into H2O: NYC

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
byteLAKE
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
Jared Flanders
 

Ähnlich wie Dive into H2O: NYC (20)

Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Guidelines to Measuring Test Automation ROI
 Guidelines to Measuring Test Automation ROI Guidelines to Measuring Test Automation ROI
Guidelines to Measuring Test Automation ROI
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
2-1 Remember the Help Desk with AFCU - Jared Flanders, Final
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 

Mehr von Sri Ambati

ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 
AI and AutoML: Debunking Myths
AI and AutoML: Debunking MythsAI and AutoML: Debunking Myths
AI and AutoML: Debunking Myths
Sri Ambati
 

Mehr von Sri Ambati (19)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
AI and AutoML: Debunking Myths
AI and AutoML: Debunking MythsAI and AutoML: Debunking Myths
AI and AutoML: Debunking Myths
 
Scalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2OScalable Automatic Machine Learning with H2O
Scalable Automatic Machine Learning with H2O
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Dive into H2O: NYC

  • 1. Introduction to Driverless AI Chemere Davis
  • 2. Confidential2 Please Create an Account on Aquarium
  • 4. Confidential4 H2O.ai Product Suite Automatic feature engineering, machine learning and interpretability • 100% open source – Apache V2 licensed • Built for data scientists – interface using R, Python, Scala, H2O Flow (interactive notebook interface) • Enterprise support subscriptions • Enterprise software • Built for domain users, analysts and data scientists – GUI-based interface for end-to-end data science • Fully automated machine learning from ingest to deployment • User licenses on a per seat basis (annual subscription) H2O AI open source engine integration with Spark Lightning fast machine learning on GPUs In-memory, distributed machine learning algorithms with H2O Flow GUI Open Source
  • 5. Confidential5 The Workflow of Driverless AI SQL HDFS X Y Automatic Model Optimization Automatic Scoring Pipeline Deploy Low-latency Scoring to Production Modelling Dataset Model Recipes • i.i.d. data • Time-series • More on the way Advanced Feature Engineering Algorithm Model Tuning+ + Survival of the Fittest 1 Drag and Drop Data 2 Automatic Visualization 4 Automatic Model Optimization 5 Automatic Scoring Pipelines Snowflake Model Documentation  Upload your own recipe(s) Transformations Algorithms Scorers 3 Bring Your Own Recipes  Driverless AI executes automation on your recipes Feature engineering, model selection, hyper-parameter tuning, overfitting protection  Driverless AI automates model scoring and deployment using your recipes Amazon S3 Google BigQuery Azure Blog Storage
  • 6. Confidential6 Driverless AI: Supervised Learning Regression: How much will a customers spend? Classification: Will a customer make a purchase? Yes or No X y xi xj yes no
  • 7. Confidential7 Confidential7 Confidential7 Driverless AI Features Target Data Quality and Transformation Modeling Table Model Building Model Data Integration + Typical Enterprise ML Workflow
  • 8. Confidential8 Confidential8 Confidential8 Features Target Modeling Table Model Building Model Driverless AI Modeling Data Types • Numeric • Categorical • Time/Date • Text • Missing values allowed Model Types • Regression • Classification – Binary – Multinomial Build Process • Feature engineering – Including NLP (text) • Automated hyperparameter tuning Both iid & Time Series • Single • Grouped • Gap between last observation and prediction
  • 9. Confidential9 Confidential9 How Well Does Driverless AI Work?
  • 10. Confidential10 Top 10 Finish in BNP Kaggle Competition single run, fully automated: 2h on DGX Station! 6h on PC Driverless AI: 10th place in private LB at Kaggle (out of 2926)
  • 11. Confidential11 Top 5% in Amazon Kaggle competition
  • 12. Confidential12 Other Kaggle Competitions: Driverless AI Results 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Allstate BNP Paribas Amazon Homesite Otto Group Relative error: Lower is Better Kaggle Grandmaster Best AutoDL GBM BaselineRelative Error (Lower is Better) Kaggle Grandmaster Best Driverless AI GBM Baseline
  • 14. Confidential14 • Dataset: – Comes from a lender in Taiwan (April – August, 2005) – Information on default payments, demographic factors, credit data, history of payment, etc. – Source: – UCI Machine Learning Library – kaggle.com/uciml/default-of-credit-card-clients-dataset • Our Goal: – Predict whether someone will default on their next credit card payment. Credit Card Payment Default 14
  • 15. Confidential15 The Data Column Description ID ID of each customer Default Defaulted on next payment (1 = yes, 0 = no) CreditLimit Credit limit in NT dollars Sex Gender (M, F) Education (1: graduate school, 2: university, 3: high school, 4: others, 5-6: unknown) Marriage Marital status (M, S, D, O) Age Age in years Status1 … Status6 Repayment status in September, 2005 – April, 2005 BillAmt1 … BillAmt6 Amount of bill statement in September, 2005 – April, 2005 (NT dollar) PayAmt1 … PayAmt6 Amount of previous payment in September, 2005 – April, 2005 (NT dollar)
  • 16. Confidential16 Payment History Data 1 Month Ago Status1: ≤0, 1 BillAmt1 PayAmt1 2 Months Ago Status2: ≤0, 1, 2 BillAmt2 PayAmt2 3 Months Ago Status3: ≤0, 1, 2, 3 BillAmt3 PayAmt3 ... 6 Months Ago Status6: ≤0, 1, ..., 6 BillAmt6 PayAmt6 Status: -2: No balance -1: Paid in full 0: Minimum balance paid 1: One month late 2: Two months late etc.
  • 19. Confidential19 Automatic Visualizations Scalable outlier detection Contains novel statistical algorithms to only show “relevant” aspects of the data (coming soon: automated data cleaning)
  • 21. Confidential21 Experiment Settings 3 KEY SETTINGS Accuracy Time Interpretability
  • 22. Confidential22 Experiment Settings ACCURACY • Relative accuracy – higher values should lead to higher confidence in model performance (accuracy) • Impacts things such as level of data sampling, how many models are used in the final ensemble, parameter tuning level, among others Accuracy Time Interpretability • Relative time for completing the experiment • Higher settings mean: – More iterations are performed to find the best set of features – Longer “early stopping” threshold • Relative interpretability – higher values favor more interpretable models • The higher the interpretability setting, the lower the complexity of the engineered features and of the final model(s)
  • 23. Confidential23 Accuracy Accuracy Max Rows x Cols Ensemble Level Target Transformation Parameter Tuning Level Num Folds Only First Fold Model Distribution Check 1 100K 0 False 0 3 True No 2 1M 0 False 0 3 True No 3 50M 0 True 1 3 True No 4 100M 0 True 1 3-4 True No 5 200M 1 True 1 3-4 True Yes 6 500M 2 True 1 3-5 True Yes 7 750M <=3 True 2 3-10 Auto Yes 8 1B <=3 True 2 4-10 Auto Yes 9 2B <=3 True 3 4-10 Auto Yes 10 10B <=4 True 3 4-10 Auto Yes
  • 24. Confidential24 Time Time Iterations Early Stopping Rounds 1 1-5 None 2 10 5 3 30 5 4 40 5 5 50 10 6 100 10 7 150 15 8 200 20 9 300 30 10 500 50
  • 25. Confidential25 Interpretability Interpretability Ensemble Level Target Transformation Feature Engineering Feature Pre- Pruning Monotonicity Constraints 1 - 3 <= 3 None Disabled 4 <= 3 Inverse None Disabled 5 <= 3 Anscombe Clustering (ID, distance) Truncated SVD None Disabled 6 <= 2 Logit Sigmoid Feature selection Disabled 7 <= 2 Frequency Encoding Feature selection Enabled 8 <= 1 4th Root Feature selection Enabled 9 <= 1 Square Square Root Bulk Interactions (add, subtract, multiply, divide) Weight of Evidence Feature selection Enabled 10 0 Identity Unit Box Log Date Decompositions Number Encoding Target Encoding Text (TF-IDF, Frequency) Feature selection Enabled Good start
  • 27. Confidential27 Driverless AI - Machine Learning Interpretability Gain confidence in models before deploying them!
  • 28. Confidential28 Linear Models Machine Learning For a given well-understood dataset there is usually one best model. For a given well-understood dataset there are usually many good models. This is often referred to as “the multiplicity of good models.” -- Leo Breiman. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical Science. 2001. http://bit.ly/2pwz6m5 Why is Machine Learning Interpretability Difficult?
  • 29. Confidential29 Interpretability Complexity of learned functions: • Linear, monotonic • Nonlinear, monotonic • Nonlinear, non-monotonic Scope of interpretability: Global vs. local Application domain Understanding: Model-agnostic vs. model-specificTrust: Enhancing trust and understanding: the mechanisms and results of an interpretable model should be both transparent AND dependable.
  • 30. Confidential30 Global and Local Interpretability Linear Models Exact explanations for approximate models. Machine Learning Approximate explanations for exact models.