scalable machine learning

me:
Sam Bessalah
Software Engineer, Freelance
Big Data, Distributed Computing, Machine Learning
Paris Data Geek Co-organizer
@samklr @DataParis

Machine Learning Land
VOWPAL WABBIT

Some Observations in Big Data Land
● New use cases push towards faster execution platforms and real
time predictions engines.
● Traditional MapReduce on Hadoop is fading away, especially for
Machine Learning
● Apache Spark has become the darling of the Big Data world,
thanks to its high level API and performances.
● Rise of Machine Learning public APIs to easily integrate models
into application and other data processing workflows.

● Used to be the only Hadoop MapReduce Framework
● Moved from MapReduce towards modern and faster
backends, namely
● Now provide a fluent DSL that integrates with Scala and
Spark

Mahout Example
Simple Co-occurence analysis in Mahout
val A =
drmFromHDFS (“ hdfs://nivdul/babygirl.txt“)
val cooccurencesMatrix = A.t %*% A
val numInteractions =
drmBroadcast(A.colsums)
val I = C.mapBlock(){
case (keys, block) =>
val indicatorBlock = sparse(row, col)
for (r <- block )
indicatorBlock = computeLLR (row, nbInt)
keys <- indicatorblock
}

Dataflow system, materialized by immutable and lazy, in-memory distributed
collections suited for iterative and complex transformations, like in most Machine
Learning algorithms.
Those in-memory collections are called Resilient Distributed Datasets (RDD)
They provide :
● Partitioned data
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …)
● No side effects
● Fault recovery via lineage

MLlib
Machine Learning library within Spark :
● Provides an integrated predictive and data analysis
workflow
● Broad collections of algorithms and applications
● Integrates with the whole Spark Ecosystem
Three APIs in :

Example: Clustering via K-means
// Load and parse data
val data = sc.textFile(“hdfs://bbgrl/dataset.txt”)
val parsedData = data.map { x =>
Vectors.dense(x.split(“ “).map.(_.toDouble ))
}.cache()
//Cluster data into 5 classes using K-means
val clusters = Kmeans.train(parsedData, k=5, numIterations=20 )
//Evaluate model error
val cost = clusters.computeCost(parsedData)

Coming to Spark 1.2
● Ensembles of decision trees : Random Forests
● Boosting
● Topic modeling
● Streaming Kmeans
● A pipeline interface for machine workflows
A lot of contributions from the community

Machine Learning Pipeline
Typical machine learning workflows are complex !
Coming in next iterations of MLLib

● H20 is a fast (really fast), statistics, Machine Learning
and maths engine on the JVM.
● Edited by 0xdata (commercial entity) and focus on
bringing robust and highly performant machine learning
algorithms to popular Big Data workloads.
● Has APIs in R, Java, Scala and Python and integrates
to third parties tools like Tableau and Excel.

Example in R
library(h2o)
localH2O = h2o.init(ip = 'localhost', port = 54321)
irisPath = system.file("extdata", "iris.csv", package="h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
iris.data.frame <- as.data.frame(iris.hex)
> colnames(iris.hex)
[1] "C1" "C2" "C3" "C4" "C5"
>

Simple Logistic Regressioon to predict prostate cancer outcomes:
> prostate.hex = h2o.importFile(localH2O,
path="https://raw.github.com/0xdata/h2o/../prostate.csv",
key = "prostate.hex")
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"),
data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5)
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)

> (prostate.fit)
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9
predict X0 X1
1 0 0.7452267 0.2547732
2 1 0.3969807 0.6030193
3 1 0.4120950 0.5879050
4 1 0.3726134 0.6273866
5 1 0.6465137 0.3534863
6 1 0.4331880 0.5668120

Sparkling Water
Transparent use of H2O data and algorithms with the Spark API.
Provides a custom RDD : H2ORDD

val sqlContext = new SQLContext(sc)
import sqlContext._
airlinesTable.registerTempTable("airlinesTable") //H20 methods
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest
LIKE 'SJC' OR Dest LIKE 'OAK'“
val result = sql(query)
result.count

Same but with Spark API
// H2O Context provide useful implicits for conversions
val h2oContext = new H2OContext(sc)
import h2oContext._
// Create RDD wrapper around DataFrame
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
airlinesTable.count
// And use Spark RDD API directly
val flightsOnlyToSF = airlinesTable.filter(f =>
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK")
)
flightsOnlyToSF.count

Build a model
import hex.deeplearning._
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek,
'CRSDepTime, 'CRSArrTime,'UniqueCarrier,
FlightNum, 'TailNum, 'CRSElapsedTime,
'Origin, 'Dest,'Distance,‘IsDepDelayed)
dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder
val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get

Predict
// Use model to score data
val prediction = dlModel.score(result)(‘predict)
// Collect predicted values via the RDD API
val predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse("NaN") )

Slides: http://speakerdeck.com/samklr/

scalable machine learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie scalable machine learning

Ähnlich wie scalable machine learning (20)

Mehr von Samir Bessalah

Mehr von Samir Bessalah (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

scalable machine learning

Hinweis der Redaktion