Machine Learning with H2O, Spark, and Python at Strata SJ 2015-by Cliff Click and Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. H2O.ai
Machine Intelligence
Who Am I?
Cliff Click
CTO, Co-Founder H2O.ai
cliff@h2o.ai
40 yrs coding
35 yrs building compilers
30 yrs distributed computation
20 yrs OS, device drivers, HPC, HotSpot
10 yrs Low-latency GC, custom java hardware
NonBlockingHashMap
20 patents, dozens of papers
100s of public talks
PhD Computer Science
1995 Rice University
HotSpot JVM Server Compiler
“showed the world JITing is possible”
3. H2O.ai
Machine Intelligence
H2O Open Source In-Memory
Machine Learning for Big Data
Distributed In-Memory Math Platform
GLM, GBM, RF, K-Means, PCA, Deep Learning
Easy to use SDK & API
Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI
Use ALL your data
Modeling without sampling
HDFS, S3, NFS, NoSql
Big Data & Better Algorithms
Better Predictions!
5. H2O.ai
Machine Intelligence
Practical Machine Learning
Value Requirements
Fast & Interactive In-Memory
Big Data (No Sampling) Distributed
Ownership Open Source
Extensibility API/SDK
Portability Java, REST/JSON
Infrastructure
Cloud or On-Premise Hadoop
or Private Cluster
6. H2O.ai
Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine
Web Interface
Spark Scala REPL
Nano-Fast
Scoring Engine
Distributed
In-Memory K/V Store
Column Compress Data
Map/Reduce
Memory Manager
Algorithms!
GBM, Random Forest,
GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
RealTime
DataFlow
7. H2O.ai
Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine
Web Interface
Spark Scala REPL
Nano-Fast
Scoring Engine
Distributed
In-Memory K/V Store
Column Compress Data
Map/Reduce
Memory Manager
Algorithms!
GBM, Random Forest,
GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
RealTime
DataFlow
8. H2O.ai
Machine Intelligence
Python & Sparkling Water
● CitiBike of NYC
● Predict bikes-per-hour-per-station
– From per-trip logs
● 10M rows of data
● Group-By, date/time feature-munging
Demo!
9. H2O.ai
Machine Intelligence
H2O: A Platform for Big Math
● Most Any Java on Big 2-D Tables
– Write like its single-thread POJO code
– Runs distributed & parallel by default
● Fast: billion row logistic regression takes 4 sec
● Worlds first parallel & distributed GBM
– Plus Deep Learn / Neural Nets, RF, PCA, K-means...
● R integration: use terabyte datasets from R
● Sparkling Water: Direct Spark integration
10. H2O.ai
Machine Intelligence
H2O: A Platform for Big Math
● Easy launch: “java -jar h2o.jar”
– No GC tuning: -Xmx as big as you like
● Production ready:
– Private on-premise cluster OR
In the Cloud
– Hadoop, Yarn, EC2, or standalone cluster
– HDFS, S3, NFS, URI & other datasources
– Open Source, Apache v2
11. Can I call H2O’s
algorithms from
my Spark
workflow?
14. Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and
algorithms with Spark API
Excels in Spark workflows requiring
advanced Machine Learning algorithms
16. Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
22. LOAD CITIBIKE DATA
USING H2O API
val dataFiles = Array[String](
"2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv",
"2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f))
// Load and parse data
val bikesDF = new DataFrame(dataFiles:_*)
// Rename columns and remove all spaces in header
val colNames = bikesDF.names().map( n => n.replace(' ', '_'))
bikesDF._names = colNames
bikesDF.update(null)
23. USER-DEFINED COLUMN TRANSFORMATION
// Select column 'startime'
val startTimeF = bikesDF('starttime)
// Invoke column transformation and append the created column
bikesDF.add(new TimeSplit().doIt(startTimeF))
// Do not forget to update frame in K/V store
bikesDF.update(null)
24. OPEN H2O FLOW UI
openFlow
AND EXPLORE DATA...
> getFrames
...
26. USE SPARK SQL
// Register table and SQL table
sqlContext.registerRDDAsTable(bikesRdd, "bikesRdd")
// Perform SQL group operation
val bikesPerDayRdd = sql(
"""SELECT Days, start_station_id, count(*) bikes
|FROM bikesRdd
|GROUP BY Days, start_station_id """.stripMargin)
27. FROM RDD TO H2O'S DATAFRAME
val bikesPerDayDF:DataFrame = bikesPerDayRdd
AND PERFORM ADDITIONAL COLUMN TRANSFORMATION
// Select "Days" column
val daysVec = bikesPerDayDF('Days)
// Refine column into "Month" and "DayOfWeek"
val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))
29. GBM MODEL BUILDER
def buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = {
// Split into train and test parts
val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1))
val (train, test, hold) = (frs(0), frs(1), frs(2))
// Configure GBM parameters
val gbmParams = new GBMParameters()
gbmParams._train = train
gbmParams._valid = test
gbmParams._response_column = 'bikes
gbmParams._ntrees = trees
gbmParams._max_depth = depth
// Build a model
val gbmModel = new GBM(gbmParams).trainModel.get
// Score datasets
Seq(train,test,hold).foreach(gbmModel.score(_).delete)
// Collect R2 metrics
val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold))
// Perform clean-up
Seq(train, test, hold).foreach(_.delete())
result
}
30. BUILD A GBM MODEL
val result1 = buildModel(finalBikeDF)
31. CAN WE IMPROVE MODEL
BY USING INFORMATION
ABOUT WEATHER?
32. LOAD WEATHER DATA
USING SPARK API
// Load weather data in NY 2013
val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv")
// Parse data and filter them
val weatherRdd = weatherData.map(_.split(",")).
map(row => NYWeatherParse(row)).
filter(!_.isWrongRow()).
filter(_.HourLocal == Some(12)).setName("weather").cache()
33. CREATE A JOINED TABLE
USING H2O'S DATAFRAME AND SPARK'S RDD
// Join with bike table
sqlContext.registerRDDAsTable(weatherRdd, "weatherRdd")
sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd")
val bikesWeatherRdd = sql(
"""SELECT b.Days, b.start_station_id, b.bikes,
|b.Month, b.DayOfWeek,
|w.DewPoint, w.HumidityFraction, w.Prcp1Hour,
|w.Temperature, w.WeatherCode1
| FROM bikesRdd b
| JOIN weatherRdd w
| ON b.Days = w.Days
""".stripMargin)
34. BUILD A NEW MODEL
USING SPARK'S RDD IN H2O'S API
val result2 = buildModel(bikesWeatherRdd)
35. Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai
More info