SlideShare ist ein Scribd-Unternehmen logo
1 von 27
H2O – The Open Source Math Engine
Big Data Science
with H2O in R
4/23/13
H2O –
Open Source Math
& Machine Learning
for Big Data
Anqi Fu, August 2013
Universe is sparse. Life is messy.
Data is sparse & messy.
- Lao Tzu
Introduction to Big Data
• There are about as many bits of information in our digital
universe as there are stars in our actual universe.
• The process to decode the human genome took 10 years.
It can now be done in a week.
• Big data means more than “lots of data”
H2O – The Open Source Math Engine
Better
Predictions
Same Interface
Installation
1. Install and run H2O
• Command line: java –Xmx2g –jar h2o.jar
• Pull up http://localhost:54321 in browser
2. Install the R package
• install.packages(c(“RCurl”, “rjson”, “bitops”))
• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL,
type = "source")
3. In R console, type library(h2o)
• demo(package=“h2o”)
• demo(h2o.glm)
Replace this!
Always have H2O running first!
Basic R Script
1. Tell R where H2O is running:
localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)
2. Check connection:
h2o.checkClient(localH2O)
3. Pass H2OClient as parameter to import:
h2o.importFile(localH2O, path=“Path/To/Data”, …)
Overview of Objects
• H2OClient: ip=character, port=numeric
• H2OParsedData: h2o=H2OClient, key=character
• H2OGLMModel: key=character, data=H2OParsedData,
model=list(coefficients, deviance, aic, etc)
Example: myModel@model$coefficients
H2O
key=“prostate.hex”
key=“airlines.hex”
Overview of Methods
Standard R H2O
read.csv, read.table, etc h2o.importFile, h2o.importURL
summary summary (limited to data only)
glm, glmnet h2o.glm(y, x, data, family, nfolds,
alpha, lambda)
kmeans h2o.kmeans(data, centers, cols,
iter.max)
randomForest, cforest h2o.randomForest(y, x_ignore,
data, ntree, depth, classwt)
Demo 1: Basic GLM in H2O through R
Demo 1: Prostate Cancer Data
• Prostate cancer data set from Ohio State University
Comprehensive Cancer Center
• N = 380 patients, ages ranging from 43-79
• Goal: Predict presence of tumor from baseline exam of
patient (age, race, PSA, total gleason score, etc)
Prostate Cancer
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Prostate Cancer
Logistic Regression Fit
Family: Binomial, Link: Logit
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Goal:
Estimate probability
CAPSULE = 1
GLM Parameters
• y = response variable
• x = predictor variables (vector)
• family = binomial (default link = logit)
• data = H2OParsedData object
• nfolds = cross-validation
• lambda = weight on penalty factor
• alpha = elastic net mixing parameter
• alpha = 0 is ridge penalty (L2 norm)
• alpha = 1 is lasso penalty (L1 norm)
Under the Hood: Hacking R for H2O
Under the Hood
REST API
Data
(JSON)
Import
Parse
H2O
Data Scientist,
Analyst, etc
GLM Code Snippet
• Create an object to represent model
setClass("H2OGLMModel", representation(key="character",
data="H2OParsedData", model="list"))
• Declare new method for algorithm
setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha
= 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })
Name Slots
Parameter Initial Value
GLM Code Snippet
setMethod("h2o.glm", signature(x="character", y="character",
data="H2OParsedData", …), function(x, y, data, …) {
• Send parameters to GLM.json page  GLM job started
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key
= data@key, y = y, x = paste(x, sep="", collapse=","), …)
• Keep polling and wait until job completed
while(h2o.__poll(data@h2o,
res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }
• Query Inspect.json page with GLM model key to get results
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT,
key=res$destination_key)
http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
Demo 2: Data Munging and Remote H2O
Demo 2: Airlines Data
• Airlines data set 1987-2013 from RITA (25%)
• Goal: Predict if flight’s arrival will be delayed
• Examine slices of data directly
head(airlines.hex, n = 10); tail(airlines.hex)
summary(airlines.hex$DepTime)
• Take a subset of data to play with in R
airlines.small = as.data.frame(airlines.hex[1:1000,])
glm(IsArrDelayed ~ Dest + Origin, family = binomial, data =
airlines.small)
http://www.transtats.bts.gov/Fields.asp?Table_ID=236
Connecting to H2O Remotely
• Your slip of paper contains IP/port of your assigned cluster
• Point R to remote H2O client
remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)
• All data operations occur on cluster
h2o.importFile(remoteH2O, path =
“Path/On/Remote/Server/To/Data”, …)
• Objects/methods operate just like before!
Roadmap
• Long-term Goal: Full H2O/R Integration
• Subset col by name/index: df[,c(1,2)]; df[,”name”]
• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1
• Filter rows: df[df$cName < 5,]
• Combine data frames by row/col: rbind, cbind
• Apply functions: tapply, sapply, lapply
• Support for R libraries (plyr, ggplot2, etc)
• More Algorithms: GBM, PCA, Neural Networks
4/23/13
Questions and
Suggestions?

Weitere ähnliche Inhalte

Was ist angesagt?

Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
Kim Herzig
 
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big Data
Takahiro Inoue
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
MongoDB
 
Confidentiality as a service –usable security for the cloud
Confidentiality as a service –usable security for the cloudConfidentiality as a service –usable security for the cloud
Confidentiality as a service –usable security for the cloud
Maha Saad
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB Webinar
MongoDB
 

Was ist angesagt? (20)

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
 
RedisConf17- durable_rules
RedisConf17- durable_rulesRedisConf17- durable_rules
RedisConf17- durable_rules
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
JavascriptのGC入門
JavascriptのGC入門JavascriptのGC入門
JavascriptのGC入門
 
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big Data
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
A Year With MongoDB: The Tips
A Year With MongoDB: The TipsA Year With MongoDB: The Tips
A Year With MongoDB: The Tips
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
RDO hangout on gnocchi
RDO hangout on gnocchiRDO hangout on gnocchi
RDO hangout on gnocchi
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
 
Akka with Scala
Akka with ScalaAkka with Scala
Akka with Scala
 
Confidentiality as a service –usable security for the cloud
Confidentiality as a service –usable security for the cloudConfidentiality as a service –usable security for the cloud
Confidentiality as a service –usable security for the cloud
 
MapReduce@DirectI
MapReduce@DirectIMapReduce@DirectI
MapReduce@DirectI
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB Webinar
 

Ähnlich wie Big datascienceh2oandr

R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
DataMine Lab
 

Ähnlich wie Big datascienceh2oandr (20)

2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Semantic Support for Complex Ecosystem Research Environments
Semantic Support for Complex Ecosystem Research EnvironmentsSemantic Support for Complex Ecosystem Research Environments
Semantic Support for Complex Ecosystem Research Environments
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
A Step Towards Reproducibility in R
A Step Towards Reproducibility in RA Step Towards Reproducibility in R
A Step Towards Reproducibility in R
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 

Mehr von Sri Ambati

Mehr von Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Kürzlich hochgeladen

Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
chetankumar9855
 
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 

Kürzlich hochgeladen (20)

Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
Andheri East ) Call Girls in Mumbai Phone No 9004268417 Elite Escort Service ...
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
Russian Call Girls Service Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
Russian Call Girls Service  Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...Russian Call Girls Service  Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
Russian Call Girls Service Jaipur {8445551418} ❤️PALLAVI VIP Jaipur Call Gir...
 
Call Girls Hyderabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Hyderabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 8250077686 Top Class Call Girl Service Available
 
Coimbatore Call Girls in Coimbatore 7427069034 genuine Escort Service Girl 10...
Coimbatore Call Girls in Coimbatore 7427069034 genuine Escort Service Girl 10...Coimbatore Call Girls in Coimbatore 7427069034 genuine Escort Service Girl 10...
Coimbatore Call Girls in Coimbatore 7427069034 genuine Escort Service Girl 10...
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
 
💕SONAM KUMAR💕Premium Call Girls Jaipur ↘️9257276172 ↙️One Night Stand With Lo...
💕SONAM KUMAR💕Premium Call Girls Jaipur ↘️9257276172 ↙️One Night Stand With Lo...💕SONAM KUMAR💕Premium Call Girls Jaipur ↘️9257276172 ↙️One Night Stand With Lo...
💕SONAM KUMAR💕Premium Call Girls Jaipur ↘️9257276172 ↙️One Night Stand With Lo...
 
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
Call Girl In Pune 👉 Just CALL ME: 9352988975 💋 Call Out Call Both With High p...
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
 
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Hosur Just Call 9630942363 Top Class Call Girl Service Available
 
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
🌹Attapur⬅️ Vip Call Girls Hyderabad 📱9352852248 Book Well Trand Call Girls In...
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
 
Call Girls Ahmedabad Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Ahmedabad Just Call 9630942363 Top Class Call Girl Service AvailableCall Girls Ahmedabad Just Call 9630942363 Top Class Call Girl Service Available
Call Girls Ahmedabad Just Call 9630942363 Top Class Call Girl Service Available
 
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
Model Call Girls In Chennai WhatsApp Booking 7427069034 call girl service 24 ...
 
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service AvailableTrichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
Trichy Call Girls Book Now 9630942363 Top Class Trichy Escort Service Available
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
Coimbatore Call Girls in Thudiyalur : 7427069034 High Profile Model Escorts |...
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
Independent Call Girls Service Mohali Sector 116 | 6367187148 | Call Girl Ser...
Independent Call Girls Service Mohali Sector 116 | 6367187148 | Call Girl Ser...Independent Call Girls Service Mohali Sector 116 | 6367187148 | Call Girl Ser...
Independent Call Girls Service Mohali Sector 116 | 6367187148 | Call Girl Ser...
 

Big datascienceh2oandr

  • 1. H2O – The Open Source Math Engine Big Data Science with H2O in R
  • 2. 4/23/13 H2O – Open Source Math & Machine Learning for Big Data Anqi Fu, August 2013
  • 3. Universe is sparse. Life is messy. Data is sparse & messy. - Lao Tzu
  • 4. Introduction to Big Data • There are about as many bits of information in our digital universe as there are stars in our actual universe. • The process to decode the human genome took 10 years. It can now be done in a week. • Big data means more than “lots of data”
  • 5. H2O – The Open Source Math Engine Better Predictions Same Interface
  • 6. Installation 1. Install and run H2O • Command line: java –Xmx2g –jar h2o.jar • Pull up http://localhost:54321 in browser 2. Install the R package • install.packages(c(“RCurl”, “rjson”, “bitops”)) • install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source") 3. In R console, type library(h2o) • demo(package=“h2o”) • demo(h2o.glm) Replace this!
  • 7. Always have H2O running first!
  • 8. Basic R Script 1. Tell R where H2O is running: localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321) 2. Check connection: h2o.checkClient(localH2O) 3. Pass H2OClient as parameter to import: h2o.importFile(localH2O, path=“Path/To/Data”, …)
  • 9. Overview of Objects • H2OClient: ip=character, port=numeric • H2OParsedData: h2o=H2OClient, key=character • H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients H2O key=“prostate.hex” key=“airlines.hex”
  • 10. Overview of Methods Standard R H2O read.csv, read.table, etc h2o.importFile, h2o.importURL summary summary (limited to data only) glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda) kmeans h2o.kmeans(data, centers, cols, iter.max) randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
  • 11. Demo 1: Basic GLM in H2O through R
  • 12. Demo 1: Prostate Cancer Data • Prostate cancer data set from Ohio State University Comprehensive Cancer Center • N = 380 patients, ages ranging from 43-79 • Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
  • 13.
  • 14. Prostate Cancer Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen)
  • 15. Prostate Cancer Logistic Regression Fit Family: Binomial, Link: Logit Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen) Goal: Estimate probability CAPSULE = 1
  • 16. GLM Parameters • y = response variable • x = predictor variables (vector) • family = binomial (default link = logit) • data = H2OParsedData object • nfolds = cross-validation • lambda = weight on penalty factor • alpha = elastic net mixing parameter • alpha = 0 is ridge penalty (L2 norm) • alpha = 1 is lasso penalty (L1 norm)
  • 17. Under the Hood: Hacking R for H2O
  • 18. Under the Hood REST API Data (JSON) Import Parse H2O Data Scientist, Analyst, etc
  • 19. GLM Code Snippet • Create an object to represent model setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list")) • Declare new method for algorithm setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") }) Name Slots Parameter Initial Value
  • 20. GLM Code Snippet setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) { • Send parameters to GLM.json page  GLM job started res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …) • Keep polling and wait until job completed while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) } • Query Inspect.json page with GLM model key to get results res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key) http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
  • 21. Demo 2: Data Munging and Remote H2O
  • 22. Demo 2: Airlines Data • Airlines data set 1987-2013 from RITA (25%) • Goal: Predict if flight’s arrival will be delayed • Examine slices of data directly head(airlines.hex, n = 10); tail(airlines.hex) summary(airlines.hex$DepTime) • Take a subset of data to play with in R airlines.small = as.data.frame(airlines.hex[1:1000,]) glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
  • 23.
  • 25. Connecting to H2O Remotely • Your slip of paper contains IP/port of your assigned cluster • Point R to remote H2O client remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321) • All data operations occur on cluster h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …) • Objects/methods operate just like before!
  • 26. Roadmap • Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”] • Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1 • Filter rows: df[df$cName < 5,] • Combine data frames by row/col: rbind, cbind • Apply functions: tapply, sapply, lapply • Support for R libraries (plyr, ggplot2, etc) • More Algorithms: GBM, PCA, Neural Networks

Hinweis der Redaktion

  1. http://docs.0xdata.com/quickstart/quickstart_R.htmlPackages  Install package(s)  Select CRAN mirror (US CA1)  Search for RCurl, rjson and bitops
  2. Pull up R and demo this in the console, making sure everyone can follow along
  3. H2OParsedData: Each data set/calculation associated with unique hex key, object acts like a “pointer”Model: coefficients, deviance, aic, df.residual, etc
  4. As penalty factor increases, lasso gives more sparse results (zero values), while ridge causes all coefficients to fall (but not hit zero necessarily)