The era of Big Data has passed, and the era of sensory overload – that is, the proliferation of sensor data – is upon us. The challenge today is how to create the next generation of business and consumer applications that transform how we interact with sensors themselves. Applications need to learn from every user interaction and data point and predict what can happen next. The future depends on Machine Learning, as much as it depends on the data itself, to change the way we interact with these systems.
In this talk, we explain H2O’s scalable distributed in-memory math architecture and its design principles. The platform was built alongside (and on top of) both Hadoop and Spark clusters and includes interfaces for R, Python, Scala, Java, JavaScript and JSON, along with its interactive graphical Flow interface that make it easier for non-engineers to stitch together complete analytic workflows. We outline the implementation of distributed machine learning algorithms such as Elastic Net, Random Forest, Gradient Boosting and Deep Learning. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases. By the end of this presentation, you will know how to create your own machine learning workflows on your data using R, Python (iPython Notebooks) or the Flow GUI.
1. SCALABLE DATA SCIENCE AND
DEEP LEARNING WITH H2O
Arno Candel, H2O.ai
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_ BOSTON 2015
@opendatasci
2. H2O.ai
Machine Intelligence
Who Am I?
Arno Candel
Chief Architect,
Physicist & Hacker at H2O.ai
PhD Physics, ETH Zurich 2005
10+ yrs Supercomputing (HPC)
6 yrs at SLAC (Stanford Lin. Accel.)
3.5 yrs Machine Learning
1.5 yrs at H2O.ai
Fortune Magazine
Big Data All Star 2014
Follow me @ArnoCandel 2
3. H2O.ai
Machine Intelligence
Outline
• Introduction
• H2O Deep Learning Architecture
• Live Demos:
Flow GUI - Airline Dataset
R - MNIST World Record + Anomaly Detection
Flow GUI - Higgs Boson Classification
Sparkling Water - Chicago Crime Prediction
iPython - CitiBike Demand Prediction
Scoring Engine - Million Songs Classification
• Outlook
3
5. H2O.ai
Machine Intelligence
In-Memory ML
Distributed
Open Source
APIs
5
Memory-Efficient Data Structures
Cutting-Edge Algorithms
Use all your Data (No Sampling)
Accuracy with Speed and Scale
Ownership of Methods - Apache V2
Easy to Deploy: Bare, Hadoop, Spark, etc.
Java, Scala, R, Python, JavaScript, JSON
NanoFast Scoring Engine (POJO)
H2O - Product Overview
9. H2O.ai
Machine Intelligence
9
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models,
15x faster with H2O than before)
Fraud Detection (11% higher accuracy with
H2O Deep Learning - saves millions)
…and many large insurance and financial
services companies!
Real-time marketing (H2O is 10x faster than
anything else)
Actual Customer Use Cases
12. H2O.ai
Machine Intelligence
12
Results in Seconds on Big Data
Logistic Regression: ~20s
elastic net, alpha=0.5, lambda=1.379e-4 (auto)
Deep Learning: ~70s
4 hidden ReLU layers of 20 neurons, 2 epochs
8-node EC2 cluster: 64 virtual cores, 1GbE
Year, Month, Sched.
Dep. Time have
non-linear impact
Chicago, Atlanta,
Dallas:
often delayed
All cores maxed out
+9% AUC
+--+++
13. H2O.ai
Machine Intelligence
Multi-layer feed-forward Neural Network
Trained with back-propagation (SGD, ADADELTA)
+ distributed processing for big data
(fine-grain in-memory MapReduce on distributed data)
+ multi-threaded speedup
(async fork/join worker threads operate at FORTRAN speeds)
+ smart algorithms for fast & accurate results
(automatic standardization, one-hot encoding of categoricals, missing value imputation, weight &
bias initialization, adaptive learning rate, momentum, dropout/l1/L2 regularization, grid search,
N-fold cross-validation, checkpointing, load balancing, auto-tuning, model averaging, etc.)
= powerful tool for (un)supervised machine
learning on real-world data
13
H2O Deep Learning
all 320 cores maxed out
14. H2O.ai
Machine Intelligence
threads: async
14
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:
each node trains a copy of
the weights and biases with
(some* or all of) its local
data with asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O in-memory non-
blocking hash map:
K-V store
reduce:
model averaging: average weights
and biases from all nodes,
speedup is at least #nodes/
log(#rows)
http://arxiv.org/abs/1209.4129
Keep iterating over the data (“epochs”), score at user-given times
Query & display the
model via JSON, WWW
2
2 4
31
1
1
1
4
3 2
1 2
1
i
*auto-tuned (default) or user-
specified number of rows per
MapReduce iteration
Main Loop:
15. H2O.ai
Machine Intelligence
15
H2O Deep Learning beats MNIST
MNIST: Handwritten digits: 28^2=784 gray-scale pixel values
full run: 10 hours on 10-node cluster
2 hours on desktop gets to 0.9% test set error
Just supervised training
on original 60k/10k dataset:
No data augmentation
No distortions
No convolutions
No pre-training
No ensemble
0.83% test set error:
current world record
1-liner: call h2o.deeplearning() in R
17. H2O.ai
Machine Intelligence
17
Images courtesy CERN / LHC
Higgs
vs
Background
Large Hadron Collider: Largest experiment of mankind!
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc.
Higgs boson discovery (July ’12) led to 2013 Nobel prize!
Higgs Boson - Classification Problem
18. H2O.ai
Machine Intelligence
18
UCI Higgs Dataset: 11M rows, 29 cols
C2-C22: 21 low-level features
(detector data)
7 high-level features
(physics formulae)
Assume we don’t know Physics…
19. H2O.ai
Machine Intelligence
19
? ? ?
Former CERN baseline for AUC: 0.733 and 0.816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
Random Forest 0.764 0.840
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830
H2O Deep Learning ?
add
derived
features
Deep Learning for Higgs Detection?
Q: Can Deep Learning learn Physics for us?
21. H2O.ai
Machine Intelligence
21
Deep DL model on
low-level features
only
valid 500k rows
test 500k rows
train 10M rows
H2O Deep Learning Higgs Demo
H2O: same results as Nature paper
Deep Learning just learned Particle Physics!
8 EC2 nodes:
AUC = 0.86 after 100 mins
AUC = 0.87+ overnight
29. H2O.ai
Machine Intelligence
29
Build H2O Deep Learning Model
Train a H2O Deep Learning
Model on Data obtained by
Spark SQL Query
Predict whether Arrest will be
made with AUC of 0.90+
36. H2O.ai
Machine Intelligence
36
Example: First GBM tree
Fast and easy path to Production (batch or real-time)!
POJO Scoring Engine
Standalone Java scoring code is auto-generated!
Note:
no heap allocation,
pure decision-making
43. H2O.ai
Machine Intelligence
Outlook - Algorithm Roadmap
• Ensembles (Erin LeDell et al.)
• Automated Hyper-Parameter Tuning
• Convolutional Layers for Deep Learning
• Natural Language Processing: tf-idf, Word2Vec, …
• Generalized Low Rank Models
• PCA, SVD, K-Means, Matrix Factorization
• Recommender Systems
And many more!
43
Public JIRAs - Join H2O!
44. H2O.ai
Machine Intelligence
Key Take-Aways
H2O is an open source predictive analytics
platform for data scientists and business analysts
who need scalable, fast and accurate machine
learning.
H2O Deep Learning is ready to take your
advanced analytics to the next level.
Try it on your data!
44
https://github.com/h2oai
H2O Google Group
http://h2o.ai
@h2oai
Thank You!