SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Scalable Machine 
Learning
me: 
Sam Bessalah 
Software Engineer, Freelance 
Big Data, Distributed Computing, Machine Learning 
Paris Data Geek Co-organizer 
@samklr @DataParis
Machine Learning Land 
VOWPAL WABBIT
Some Observations in Big Data Land 
● New use cases push towards faster execution platforms and real 
time predictions engines. 
● Traditional MapReduce on Hadoop is fading away, especially for 
Machine Learning 
● Apache Spark has become the darling of the Big Data world, 
thanks to its high level API and performances. 
● Rise of Machine Learning public APIs to easily integrate models 
into application and other data processing workflows.
● Used to be the only Hadoop MapReduce Framework 
● Moved from MapReduce towards modern and faster 
backends, namely 
● Now provide a fluent DSL that integrates with Scala and 
Spark
Mahout Example 
Simple Co-occurence analysis in Mahout 
val A = 
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) 
val cooccurencesMatrix = A.t %*% A 
val numInteractions = 
drmBroadcast(A.colsums) 
val I = C.mapBlock(){ 
case (keys, block) => 
val indicatorBlock = sparse(row, col) 
for (r <- block ) 
indicatorBlock = computeLLR (row, nbInt) 
keys <- indicatorblock 
}
Dataflow system, materialized by immutable and lazy, in-memory distributed 
collections suited for iterative and complex transformations, like in most Machine 
Learning algorithms. 
Those in-memory collections are called Resilient Distributed Datasets (RDD) 
They provide : 
● Partitioned data 
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) 
● No side effects 
● Fault recovery via lineage
Some operations on RDDs
Spark 
Ecosystem
MLlib 
Machine Learning library within Spark : 
● Provides an integrated predictive and data analysis 
workflow 
● Broad collections of algorithms and applications 
● Integrates with the whole Spark Ecosystem 
Three APIs in :
Algorithms in MLlib
Example: Clustering via K-means 
// Load and parse data 
val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) 
val parsedData = data.map { x => 
Vectors.dense(x.split(“ “).map.(_.toDouble )) 
}.cache() 
//Cluster data into 5 classes using K-means 
val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) 
//Evaluate model error 
val cost = clusters.computeCost(parsedData)
Coming to Spark 1.2 
● Ensembles of decision trees : Random Forests 
● Boosting 
● Topic modeling 
● Streaming Kmeans 
● A pipeline interface for machine workflows 
A lot of contributions from the community
Machine Learning Pipeline 
Typical machine learning workflows are complex ! 
Coming in next iterations of MLLib
● H20 is a fast (really fast), statistics, Machine Learning 
and maths engine on the JVM. 
● Edited by 0xdata (commercial entity) and focus on 
bringing robust and highly performant machine learning 
algorithms to popular Big Data workloads. 
● Has APIs in R, Java, Scala and Python and integrates 
to third parties tools like Tableau and Excel.
Example in R 
library(h2o) 
localH2O = h2o.init(ip = 'localhost', port = 54321) 
irisPath = system.file("extdata", "iris.csv", package="h2o") 
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") 
iris.data.frame <- as.data.frame(iris.hex) 
> colnames(iris.hex) 
[1] "C1" "C2" "C3" "C4" "C5" 
>
Simple Logistic Regressioon to predict prostate cancer outcomes: 
> prostate.hex = h2o.importFile(localH2O, 
path="https://raw.github.com/0xdata/h2o/../prostate.csv", 
key = "prostate.hex") 
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), 
data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) 
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
> (prostate.fit) 
IP Address: 127.0.0.1 
Port : 54321 
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 
predict X0 X1 
1 0 0.7452267 0.2547732 
2 1 0.3969807 0.6030193 
3 1 0.4120950 0.5879050 
4 1 0.3726134 0.6273866 
5 1 0.6465137 0.3534863 
6 1 0.4331880 0.5668120
Sparkling Water 
Transparent use of H2O data and algorithms with the Spark API. 
Provides a custom RDD : H2ORDD
val sqlContext = new SQLContext(sc) 
import sqlContext._ 
airlinesTable.registerTempTable("airlinesTable") //H20 methods 
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest 
LIKE 'SJC' OR Dest LIKE 'OAK'“ 
val result = sql(query) 
result.count
Same but with Spark API 
// H2O Context provide useful implicits for conversions 
val h2oContext = new H2OContext(sc) 
import h2oContext._ 
// Create RDD wrapper around DataFrame 
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) 
airlinesTable.count 
// And use Spark RDD API directly 
val flightsOnlyToSF = airlinesTable.filter(f => 
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") 
) 
flightsOnlyToSF.count
Build a model 
import hex.deeplearning._ 
import hex.deeplearning.DeepLearningModel.DeepLearningParameters 
val dlParams = new DeepLearningParameters() 
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 
'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 
FlightNum, 'TailNum, 'CRSElapsedTime, 
'Origin, 'Dest,'Distance,‘IsDepDelayed) 
dlParams.response_column = 'IsDepDelayed.name 
// Create a new model builder 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train.get
Predict 
// Use model to score data 
val prediction = dlModel.score(result)(‘predict) 
// Collect predicted values via the RDD API 
val predictionValues = toRDD[DoubleHolder](prediction) 
.collect 
.map ( _.result.getOrElse("NaN") )
Slides: http://speakerdeck.com/samklr/

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 

Was ist angesagt? (20)

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
Scalding
ScaldingScalding
Scalding
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 

Ähnlich wie scalable machine learning

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 

Ähnlich wie scalable machine learning (20)

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 

Mehr von Samir Bessalah

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsSamir Bessalah
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTSSamir Bessalah
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Samir Bessalah
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with FinagleSamir Bessalah
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Structures de données exotiques
Structures de données exotiquesStructures de données exotiques
Structures de données exotiquesSamir Bessalah
 

Mehr von Samir Bessalah (7)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTS
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with Finagle
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Structures de données exotiques
Structures de données exotiquesStructures de données exotiques
Structures de données exotiques
 

Kürzlich hochgeladen

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

scalable machine learning

  • 2. me: Sam Bessalah Software Engineer, Freelance Big Data, Distributed Computing, Machine Learning Paris Data Geek Co-organizer @samklr @DataParis
  • 3. Machine Learning Land VOWPAL WABBIT
  • 4. Some Observations in Big Data Land ● New use cases push towards faster execution platforms and real time predictions engines. ● Traditional MapReduce on Hadoop is fading away, especially for Machine Learning ● Apache Spark has become the darling of the Big Data world, thanks to its high level API and performances. ● Rise of Machine Learning public APIs to easily integrate models into application and other data processing workflows.
  • 5. ● Used to be the only Hadoop MapReduce Framework ● Moved from MapReduce towards modern and faster backends, namely ● Now provide a fluent DSL that integrates with Scala and Spark
  • 6.
  • 7. Mahout Example Simple Co-occurence analysis in Mahout val A = drmFromHDFS (“ hdfs://nivdul/babygirl.txt“) val cooccurencesMatrix = A.t %*% A val numInteractions = drmBroadcast(A.colsums) val I = C.mapBlock(){ case (keys, block) => val indicatorBlock = sparse(row, col) for (r <- block ) indicatorBlock = computeLLR (row, nbInt) keys <- indicatorblock }
  • 8. Dataflow system, materialized by immutable and lazy, in-memory distributed collections suited for iterative and complex transformations, like in most Machine Learning algorithms. Those in-memory collections are called Resilient Distributed Datasets (RDD) They provide : ● Partitioned data ● High level operations (map, filter, collect, reduce, zip, join, sample, etc …) ● No side effects ● Fault recovery via lineage
  • 11. MLlib Machine Learning library within Spark : ● Provides an integrated predictive and data analysis workflow ● Broad collections of algorithms and applications ● Integrates with the whole Spark Ecosystem Three APIs in :
  • 13. Example: Clustering via K-means // Load and parse data val data = sc.textFile(“hdfs://bbgrl/dataset.txt”) val parsedData = data.map { x => Vectors.dense(x.split(“ “).map.(_.toDouble )) }.cache() //Cluster data into 5 classes using K-means val clusters = Kmeans.train(parsedData, k=5, numIterations=20 ) //Evaluate model error val cost = clusters.computeCost(parsedData)
  • 14.
  • 15. Coming to Spark 1.2 ● Ensembles of decision trees : Random Forests ● Boosting ● Topic modeling ● Streaming Kmeans ● A pipeline interface for machine workflows A lot of contributions from the community
  • 16. Machine Learning Pipeline Typical machine learning workflows are complex ! Coming in next iterations of MLLib
  • 17. ● H20 is a fast (really fast), statistics, Machine Learning and maths engine on the JVM. ● Edited by 0xdata (commercial entity) and focus on bringing robust and highly performant machine learning algorithms to popular Big Data workloads. ● Has APIs in R, Java, Scala and Python and integrates to third parties tools like Tableau and Excel.
  • 18.
  • 19. Example in R library(h2o) localH2O = h2o.init(ip = 'localhost', port = 54321) irisPath = system.file("extdata", "iris.csv", package="h2o") iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") iris.data.frame <- as.data.frame(iris.hex) > colnames(iris.hex) [1] "C1" "C2" "C3" "C4" "C5" >
  • 20. Simple Logistic Regressioon to predict prostate cancer outcomes: > prostate.hex = h2o.importFile(localH2O, path="https://raw.github.com/0xdata/h2o/../prostate.csv", key = "prostate.hex") > prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"), data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5) > prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
  • 21. > (prostate.fit) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 predict X0 X1 1 0 0.7452267 0.2547732 2 1 0.3969807 0.6030193 3 1 0.4120950 0.5879050 4 1 0.3726134 0.6273866 5 1 0.6465137 0.3534863 6 1 0.4331880 0.5668120
  • 22. Sparkling Water Transparent use of H2O data and algorithms with the Spark API. Provides a custom RDD : H2ORDD
  • 23.
  • 24.
  • 25. val sqlContext = new SQLContext(sc) import sqlContext._ airlinesTable.registerTempTable("airlinesTable") //H20 methods val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ val result = sql(query) result.count
  • 26. Same but with Spark API // H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc) import h2oContext._ // Create RDD wrapper around DataFrame val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData) airlinesTable.count // And use Spark RDD API directly val flightsOnlyToSF = airlinesTable.filter(f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count
  • 27. Build a model import hex.deeplearning._ import hex.deeplearning.DeepLearningModel.DeepLearningParameters val dlParams = new DeepLearningParameters() dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance,‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name // Create a new model builder val dl = new DeepLearning(dlParams) val dlModel = dl.train.get
  • 28. Predict // Use model to score data val prediction = dlModel.score(result)(‘predict) // Collect predicted values via the RDD API val predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
  • 29.

Hinweis der Redaktion

  1. c’est où le chat ?