SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Massive distributed processing with H2O
Codemotion,
Milan, 10 November 2017
Gabriele Nocco, Senior Data Scientist
● H2O Introduction
● GBM
● Demo
2
AGENDA
● H2O Introduction
● GBM
● Demo
3
AGENDA
H2O INTRODUCTION
H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes
comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user
interface called Flow.
The transversality of languages enables the access to the framework for many
different professional roles, from analysts to programmers, up to more “academic”
data scientists. So H2O can be a complete infrastructure, from the prototype model
to the engineering solution.
H2O INTRODUCTION - GARTNER
In 2017, H2O.ai became a Visionary in
the Magic Quadrant for Data Science
Platforms:
STRENGTHS
● Market awareness
● Customer satisfaction
● Flexibility and scalability
CAUTIONS
● Data access and preparation
● High technical bar for use
● Visualization and data exploration
● Sales execution
https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb
H2O INTRODUCTION - FEATURES
● H2O Eco-System Benefits:
○ Scalable to massive datasets on large clusters, fully parallelized
○ Low-latency Java (“POJO”) scoring code is auto-generated
○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC
○ APIs include R, Python, Flow, Scala, Java, Javascript, REST
● Regularization techniques: Dropout, L1/L2
● Early stopping, N-fold cross-validation, Grid search
● Handling of categorical, missing and sparse data
● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation weights,
various loss functions
● Unsupervised mode for nonlinear dimensionality reduction, outlier detection
● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet
H2O INTRODUCTION - ALGORITHMS
H2O INTRODUCTION - ENSEMBLES
In statistics and machine learning, ensemble methods use multiple models to obtain
better predictive performance than could be obtained from any of the constituent
models.
If your set of base learners does not contain the true prediction function, ensembles
can give a good approximation of that function. Ensembles perform better than the
individual base algorithms.
You can use ensemble of weak learners or combine the predictions from multiple
models (Generalized Model Stacking).
Ensembles
H2O INTRODUCTION - DRIVERLESS AI
At the research level, machine
learning problems are
complex and unpredictable,
but the reality is that a lot of
corporates today use machine
learning for relatively
predictable problems.
Driverless AI is the latest
product from H2O.ai aimed at
lowering the barrier to
making data science work in a
corporate context.
Driverless AI
H2O INTRODUCTION - ARCHITECTURE
H2O INTRODUCTION - ARCHITECTURE
H2O has the ability to develop Deep Neural Networks natively, or through integration with
TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is
possible to handle huge amounts of data, in the order of GBs or TBs.
Another great advantage is the ability to exploit the potential of GPU to perform
computations.
H2O INTRODUCTION - H2O + TENSORFLOW
With the release of
TensorFlow, H2O has
embraced the wave of
enthusiasm for the
growth of Deep Learning.
Thanks to Deep Water,
H2O allows us to interact
in a direct and simple way
with Deep Learning tools
like TensorFlow, MXNet
and Caffe.
H2O INTRODUCTION - H2O + TENSORFLOW
H2O INTRODUCTION - ARCHITECTURE
H2O INTRODUCTION - H2O + SPARK
One of the first plugin
developed in H2O was the
one for Apache Spark,
named Sparkling Water.
Binding to an opensource
project on the rise such as
Spark, with the power of
calculation that
distributed computing
allows, has been a great
driving force for the
growth of H2O.
A Sparkling Water
application runs like a job
that can be started with
spark-submit.
At this point the Spark
Master produces the DAG
and divides the execution
for each Worker, in which
the H2O libraries are
loaded in the Java process.
H2O INTRODUCTION - H2O + SPARK
The Sparkling Water
solution is obviously
certificated for all the
Spark distributions:
Hortonworks, Cloudera,
MapR.
Databricks provides a
Spark cluster in cloud, and
H2O works perfectly in
this environment. H2O
Rains with Databricks
Cloud!
H2O INTRODUCTION - H2O + SPARK
● H2O Introduction
● GBM
● Demo
18
AGENDA
Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It
can be applied for classification or regression, so it’s a supervised algorithm.
This is one of the most diffused and used algorithm in the Kaggle community, performing better
than SVMs, Decision Trees and Neural Networks in a large number of cases.
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
GBM can be an optimal solution when the dimension of the dataset or the computing power
doesn’t allow to train a Deep Neural Network.
GBM
Gradient Boosting Machine
Kaggle is the biggest platform for Machine
Learning contests in the world.
https://www.kaggle.com/
In the beginning of March 2017, Google announces
the acquisition of the Kaggle community.
GBM - KAGGLE
GBM - GRADIENT BOOSTING
Summarizing, GBM requires to specify three different components:
● The loss function with respect to the new weak
learners.
● The specific form of the weak learner (e.g., short
decision trees).
● A technique to add weak learners between them
to minimize the loss function.
How Gradient Boosting Works
GBM - GRADIENT BOOSTING
The loss function determines the behavior of the
algorithm.
The only requirement is differentiability, in order to
allow gradient descent on it. Although you can define
arbitrary losses, in practice only a handful are used.
For example, regression may use a squared error and
classification may use logarithmic loss.
Loss Function
GBM - GRADIENT BOOSTING
In H2O, the weak learners are implemented as decision trees. In order to
allow the addition of their outputs, regression trees (having real values in
output) are used.
When building each decision tree, the algorithm iteratively
selects a split point in order to minimize the loss. It is
possible to increase the depth of the trees to handle more
complex problems.
On the contrary, to limit overfitting we can constrain the
topology of tree by, e.g. limiting the depth, the number of
splits, or the number of leaf nodes.
Weak Learner
GBM - GRADIENT BOOSTING
In a GBM with squared loss, the resulting algorithm is
extremely simple: at each step we train a new tree on
the “residual errors” with respect to the previous weak
learners.
This can be seen as a gradient descent step with respect
to our loss, where all previous weak learners are kept
fixed and the gradient is approximated (it can be seen
as optimization in a functional space, click here to go
deeply). This generalizes easily to different losses.
Additive Model
GBM - GRADIENT BOOSTING
The output for the new tree is then added to the
output of the existing sequence of trees in an effort to
correct or improve the final output of the model. In
particular, we associate a different weighting
parameter to each decision region of the newly
constructed tree.
A fixed number of trees are added or training stops
once loss reaches an acceptable level or no longer
improves on an external validation dataset.
Output and Stop Condition
GBM - GRADIENT BOOSTING
Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.
It can benefit from regularization methods that penalize various parts of the algorithm
and generally improve the performance of the algorithm by reducing overfitting.
There are 4 enhancements to basic gradient boosting:
● Tree Constraints
● Learning Rate
● Stochastic Gradient Boosting
● Penalized Learning (Regularization of regression trees output in L1 or L2)
Improvements to Basic Gradient Boosting
● H2O Introduction
● GBM
● Demo
27
AGENDA
Q&A
mail: gabriele.nocco@gmail.com
meetup: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/
IAML - Italian Association for Machine Learning: https://www.iaml.it/

Weitere ähnliche Inhalte

Was ist angesagt?

Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
Databricks
 

Was ist angesagt? (20)

Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 
GraphQL-ify your APIs
GraphQL-ify your APIsGraphQL-ify your APIs
GraphQL-ify your APIs
 
Kubeflow
KubeflowKubeflow
Kubeflow
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Google Cloud Platform Solutions for DevOps Engineers
Google Cloud Platform Solutions  for DevOps EngineersGoogle Cloud Platform Solutions  for DevOps Engineers
Google Cloud Platform Solutions for DevOps Engineers
 
Serverless Functions: Accelerating DevOps Adoption
Serverless Functions: Accelerating DevOps AdoptionServerless Functions: Accelerating DevOps Adoption
Serverless Functions: Accelerating DevOps Adoption
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
Importance of GCP: 30 Days of GCP
Importance of GCP: 30 Days of GCPImportance of GCP: 30 Days of GCP
Importance of GCP: 30 Days of GCP
 
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer ExpertigVibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
Vibe Koli 2019 - Utazás az egyetem padjaitól a Google Developer Expertig
 
ng4 webpack and yarn in JHipster
ng4 webpack and yarn in JHipsterng4 webpack and yarn in JHipster
ng4 webpack and yarn in JHipster
 
Moven - Apache Big Data Europe 2016 - SSIX Project
Moven - Apache Big Data Europe 2016 - SSIX ProjectMoven - Apache Big Data Europe 2016 - SSIX Project
Moven - Apache Big Data Europe 2016 - SSIX Project
 
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
 
"Production Driven Development", Serhii Kalinets
"Production Driven Development", Serhii Kalinets"Production Driven Development", Serhii Kalinets
"Production Driven Development", Serhii Kalinets
 
Machine learning at scale by Amy Unruh from Google
Machine learning at scale by  Amy Unruh from GoogleMachine learning at scale by  Amy Unruh from Google
Machine learning at scale by Amy Unruh from Google
 
Scaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling DownScaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling Down
 
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
Accelerating Inference in the Data Center with Malini Bhandaru and Karol Zale...
 

Andere mochten auch

Engineering Design for Facebook
Engineering Design for FacebookEngineering Design for Facebook
Engineering Design for Facebook
Codemotion
 

Andere mochten auch (20)

Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Tiffany Conroy - Remote device sign-in – Authenticating without a keyboard - ...
Tiffany Conroy - Remote device sign-in – Authenticating without a keyboard - ...Tiffany Conroy - Remote device sign-in – Authenticating without a keyboard - ...
Tiffany Conroy - Remote device sign-in – Authenticating without a keyboard - ...
 
Steve Sfartz - How to embed Messaging and Video in your apps - Codemotion Mil...
Steve Sfartz - How to embed Messaging and Video in your apps - Codemotion Mil...Steve Sfartz - How to embed Messaging and Video in your apps - Codemotion Mil...
Steve Sfartz - How to embed Messaging and Video in your apps - Codemotion Mil...
 
Matteo Valoriani - How Augment your Reality: different perspective on the Rea...
Matteo Valoriani - How Augment your Reality: different perspective on the Rea...Matteo Valoriani - How Augment your Reality: different perspective on the Rea...
Matteo Valoriani - How Augment your Reality: different perspective on the Rea...
 
Giovanni Laquidara - Hello ARCore - Codemotion Milan 2017
Giovanni Laquidara - Hello ARCore - Codemotion Milan 2017Giovanni Laquidara - Hello ARCore - Codemotion Milan 2017
Giovanni Laquidara - Hello ARCore - Codemotion Milan 2017
 
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
 
Nicola Corti - Building UI Consistent Android Apps - Codemotion Milan 2017
Nicola Corti - Building UI Consistent Android Apps - Codemotion Milan 2017Nicola Corti - Building UI Consistent Android Apps - Codemotion Milan 2017
Nicola Corti - Building UI Consistent Android Apps - Codemotion Milan 2017
 
Gabriele Petronella - Mythical trees and where to find them - Codemotion Mila...
Gabriele Petronella - Mythical trees and where to find them - Codemotion Mila...Gabriele Petronella - Mythical trees and where to find them - Codemotion Mila...
Gabriele Petronella - Mythical trees and where to find them - Codemotion Mila...
 
Anna Makarudze - Django Girls: Inspiring women to fall in love with programmi...
Anna Makarudze - Django Girls: Inspiring women to fall in love with programmi...Anna Makarudze - Django Girls: Inspiring women to fall in love with programmi...
Anna Makarudze - Django Girls: Inspiring women to fall in love with programmi...
 
Massimo Bonanni - L'approccio ai microservizi secondo Service Fabric - Codemo...
Massimo Bonanni - L'approccio ai microservizi secondo Service Fabric - Codemo...Massimo Bonanni - L'approccio ai microservizi secondo Service Fabric - Codemo...
Massimo Bonanni - L'approccio ai microservizi secondo Service Fabric - Codemo...
 
Vincenzo Chianese - REST, for real! - Codemotion Milan 2017
Vincenzo Chianese - REST, for real! - Codemotion Milan 2017Vincenzo Chianese - REST, for real! - Codemotion Milan 2017
Vincenzo Chianese - REST, for real! - Codemotion Milan 2017
 
Maurizio Moriconi - ARKit: Augmented Reality made simple - Codemotion Milan 2017
Maurizio Moriconi - ARKit: Augmented Reality made simple - Codemotion Milan 2017Maurizio Moriconi - ARKit: Augmented Reality made simple - Codemotion Milan 2017
Maurizio Moriconi - ARKit: Augmented Reality made simple - Codemotion Milan 2017
 
Claudio Carboni - ArcGIS platformthe foundation of your idea - Codemotion Mil...
Claudio Carboni - ArcGIS platformthe foundation of your idea - Codemotion Mil...Claudio Carboni - ArcGIS platformthe foundation of your idea - Codemotion Mil...
Claudio Carboni - ArcGIS platformthe foundation of your idea - Codemotion Mil...
 
Matteo Manchi - React Native for multi-platform mobile applications - Codemot...
Matteo Manchi - React Native for multi-platform mobile applications - Codemot...Matteo Manchi - React Native for multi-platform mobile applications - Codemot...
Matteo Manchi - React Native for multi-platform mobile applications - Codemot...
 
Erik Tiengo - Embedding Cisco Spark and Location applications (ESRI) into bus...
Erik Tiengo - Embedding Cisco Spark and Location applications (ESRI) into bus...Erik Tiengo - Embedding Cisco Spark and Location applications (ESRI) into bus...
Erik Tiengo - Embedding Cisco Spark and Location applications (ESRI) into bus...
 
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
 
Agnieszka Naplocha - Breaking the norm with creative CSS - Codemotion Milan 2017
Agnieszka Naplocha - Breaking the norm with creative CSS - Codemotion Milan 2017Agnieszka Naplocha - Breaking the norm with creative CSS - Codemotion Milan 2017
Agnieszka Naplocha - Breaking the norm with creative CSS - Codemotion Milan 2017
 
Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...
Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...
Oded Coster - Stack Overflow behind the scenes - how it's made - Codemotion M...
 
Engineering Design for Facebook
Engineering Design for FacebookEngineering Design for Facebook
Engineering Design for Facebook
 
5
55
5
 

Ähnlich wie Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

Gradle(the innovation continues)
Gradle(the innovation continues)Gradle(the innovation continues)
Gradle(the innovation continues)
Sejong Park
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can Koklu
Can Köklü
 

Ähnlich wie Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017 (20)

H20 - Thirst for Machine Learning
H20 - Thirst for Machine LearningH20 - Thirst for Machine Learning
H20 - Thirst for Machine Learning
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Bodo Value Guide.pdf
Bodo Value Guide.pdfBodo Value Guide.pdf
Bodo Value Guide.pdf
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Dark launch
Dark launchDark launch
Dark launch
 
TDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDWTDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDW
 
Gradle(the innovation continues)
Gradle(the innovation continues)Gradle(the innovation continues)
Gradle(the innovation continues)
 
Beginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANABeginner's Guide: Programming with ABAP on HANA
Beginner's Guide: Programming with ABAP on HANA
 
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
 
Fighting legacy with hexagonal architecture and frameworkless php
Fighting legacy with hexagonal architecture and frameworkless phpFighting legacy with hexagonal architecture and frameworkless php
Fighting legacy with hexagonal architecture and frameworkless php
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can Koklu
 
James Jara Portfolio 2014 - Enterprise datagrid - Part 3
James Jara Portfolio 2014  - Enterprise datagrid - Part 3James Jara Portfolio 2014  - Enterprise datagrid - Part 3
James Jara Portfolio 2014 - Enterprise datagrid - Part 3
 
Google cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptx
 
Predictable reactive state management for enterprise apps using NGRX/platform
Predictable reactive state management for enterprise apps using NGRX/platformPredictable reactive state management for enterprise apps using NGRX/platform
Predictable reactive state management for enterprise apps using NGRX/platform
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
In-Memory Compute Grids… Explained
In-Memory Compute Grids… ExplainedIn-Memory Compute Grids… Explained
In-Memory Compute Grids… Explained
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 

Mehr von Codemotion

Mehr von Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017

  • 1. Massive distributed processing with H2O Codemotion, Milan, 10 November 2017 Gabriele Nocco, Senior Data Scientist
  • 2. ● H2O Introduction ● GBM ● Demo 2 AGENDA
  • 3. ● H2O Introduction ● GBM ● Demo 3 AGENDA
  • 4. H2O INTRODUCTION H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user interface called Flow. The transversality of languages enables the access to the framework for many different professional roles, from analysts to programmers, up to more “academic” data scientists. So H2O can be a complete infrastructure, from the prototype model to the engineering solution.
  • 5. H2O INTRODUCTION - GARTNER In 2017, H2O.ai became a Visionary in the Magic Quadrant for Data Science Platforms: STRENGTHS ● Market awareness ● Customer satisfaction ● Flexibility and scalability CAUTIONS ● Data access and preparation ● High technical bar for use ● Visualization and data exploration ● Sales execution https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb
  • 6. H2O INTRODUCTION - FEATURES ● H2O Eco-System Benefits: ○ Scalable to massive datasets on large clusters, fully parallelized ○ Low-latency Java (“POJO”) scoring code is auto-generated ○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC ○ APIs include R, Python, Flow, Scala, Java, Javascript, REST ● Regularization techniques: Dropout, L1/L2 ● Early stopping, N-fold cross-validation, Grid search ● Handling of categorical, missing and sparse data ● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation weights, various loss functions ● Unsupervised mode for nonlinear dimensionality reduction, outlier detection ● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet
  • 7. H2O INTRODUCTION - ALGORITHMS
  • 8. H2O INTRODUCTION - ENSEMBLES In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. If your set of base learners does not contain the true prediction function, ensembles can give a good approximation of that function. Ensembles perform better than the individual base algorithms. You can use ensemble of weak learners or combine the predictions from multiple models (Generalized Model Stacking). Ensembles
  • 9. H2O INTRODUCTION - DRIVERLESS AI At the research level, machine learning problems are complex and unpredictable, but the reality is that a lot of corporates today use machine learning for relatively predictable problems. Driverless AI is the latest product from H2O.ai aimed at lowering the barrier to making data science work in a corporate context. Driverless AI
  • 10. H2O INTRODUCTION - ARCHITECTURE
  • 11. H2O INTRODUCTION - ARCHITECTURE
  • 12. H2O has the ability to develop Deep Neural Networks natively, or through integration with TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is possible to handle huge amounts of data, in the order of GBs or TBs. Another great advantage is the ability to exploit the potential of GPU to perform computations. H2O INTRODUCTION - H2O + TENSORFLOW
  • 13. With the release of TensorFlow, H2O has embraced the wave of enthusiasm for the growth of Deep Learning. Thanks to Deep Water, H2O allows us to interact in a direct and simple way with Deep Learning tools like TensorFlow, MXNet and Caffe. H2O INTRODUCTION - H2O + TENSORFLOW
  • 14. H2O INTRODUCTION - ARCHITECTURE
  • 15. H2O INTRODUCTION - H2O + SPARK One of the first plugin developed in H2O was the one for Apache Spark, named Sparkling Water. Binding to an opensource project on the rise such as Spark, with the power of calculation that distributed computing allows, has been a great driving force for the growth of H2O.
  • 16. A Sparkling Water application runs like a job that can be started with spark-submit. At this point the Spark Master produces the DAG and divides the execution for each Worker, in which the H2O libraries are loaded in the Java process. H2O INTRODUCTION - H2O + SPARK
  • 17. The Sparkling Water solution is obviously certificated for all the Spark distributions: Hortonworks, Cloudera, MapR. Databricks provides a Spark cluster in cloud, and H2O works perfectly in this environment. H2O Rains with Databricks Cloud! H2O INTRODUCTION - H2O + SPARK
  • 18. ● H2O Introduction ● GBM ● Demo 18 AGENDA
  • 19. Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It can be applied for classification or regression, so it’s a supervised algorithm. This is one of the most diffused and used algorithm in the Kaggle community, performing better than SVMs, Decision Trees and Neural Networks in a large number of cases. http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ GBM can be an optimal solution when the dimension of the dataset or the computing power doesn’t allow to train a Deep Neural Network. GBM Gradient Boosting Machine
  • 20. Kaggle is the biggest platform for Machine Learning contests in the world. https://www.kaggle.com/ In the beginning of March 2017, Google announces the acquisition of the Kaggle community. GBM - KAGGLE
  • 21. GBM - GRADIENT BOOSTING Summarizing, GBM requires to specify three different components: ● The loss function with respect to the new weak learners. ● The specific form of the weak learner (e.g., short decision trees). ● A technique to add weak learners between them to minimize the loss function. How Gradient Boosting Works
  • 22. GBM - GRADIENT BOOSTING The loss function determines the behavior of the algorithm. The only requirement is differentiability, in order to allow gradient descent on it. Although you can define arbitrary losses, in practice only a handful are used. For example, regression may use a squared error and classification may use logarithmic loss. Loss Function
  • 23. GBM - GRADIENT BOOSTING In H2O, the weak learners are implemented as decision trees. In order to allow the addition of their outputs, regression trees (having real values in output) are used. When building each decision tree, the algorithm iteratively selects a split point in order to minimize the loss. It is possible to increase the depth of the trees to handle more complex problems. On the contrary, to limit overfitting we can constrain the topology of tree by, e.g. limiting the depth, the number of splits, or the number of leaf nodes. Weak Learner
  • 24. GBM - GRADIENT BOOSTING In a GBM with squared loss, the resulting algorithm is extremely simple: at each step we train a new tree on the “residual errors” with respect to the previous weak learners. This can be seen as a gradient descent step with respect to our loss, where all previous weak learners are kept fixed and the gradient is approximated (it can be seen as optimization in a functional space, click here to go deeply). This generalizes easily to different losses. Additive Model
  • 25. GBM - GRADIENT BOOSTING The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model. In particular, we associate a different weighting parameter to each decision region of the newly constructed tree. A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset. Output and Stop Condition
  • 26. GBM - GRADIENT BOOSTING Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting. There are 4 enhancements to basic gradient boosting: ● Tree Constraints ● Learning Rate ● Stochastic Gradient Boosting ● Penalized Learning (Regularization of regression trees output in L1 or L2) Improvements to Basic Gradient Boosting
  • 27. ● H2O Introduction ● GBM ● Demo 27 AGENDA
  • 28. Q&A