4. “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.” — Gartner
Mateusz Dymczyk Prague, 23rd October 2015
3Vs
NSA Baidu
10-100pb
eBay
100pb
Google
100pb
* Estimated data processed per day, circa 2014
11. “The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
ML is extremely broad and involves several domains:
• computer science
• probability and statistics
• optimisation
• linear algebra
Mateusz Dymczyk Prague, 23rd October 2015
Machine learning
12. • Observation - object which is used for learning or evaluation (eg. a house)
• Features - representation of the observation (eg. square meters, number of rooms, location)
• Labels - a value assigned to an observation (not always used)
• System - set of related objects forming a complex whole (eg. set of observations)
• Model (math) - description of a system using mathematical concepts/language
• Data:
• training gets us our candidate parameters =>
• validation (optional) gets us optimal parameter set =>
• test checks how good the model is
Mateusz Dymczyk Prague, 23rd October 2015
Basic terminology
13. Mateusz Dymczyk Prague, 23rd October 2015
eg. regression,
when you want to
predict a real
number
eg. clustering,
when you want to
cluster or have
too much data
eg. classification,
when you want to
assign to a category
eg. association analysis,
when you want to find
relations between data
16. • Lack of distributed/scalable solutions
• Not enough data and/or computing power
• False conviction that we:
• Need to read hard research papers
• Use “weird” programming languages
Mateusz Dymczyk Prague, 23rd October 2015
So what’s the problem?
19. Mateusz Dymczyk Prague, 23rd October 2015
Still not good enough…
•Not designed for big data
•Didn’t fit machine learning computation models
vs
20. ML, JVM
and a (iterative) distribution?
Mateusz Dymczyk Prague, 23rd October 2015
21. Mateusz Dymczyk Prague, 23rd October 2015
New (distributed) kids on the block
•MLlib (+Spark)
• TridentML (+Storm)
• Apache FlinkML (+Flink)
• Mahout Samsara
• Mahout R-like DSL
• Mahout on Spark
• H2O
• back-end agnostic (but with native APIs)
• open-source machine learning platform
22. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
• Distributed, fast, in-memory computational framework
• Based RDD (Resilient Distributed Dataset: abstract, immutable, distributed, easily
rebuilt data format)
• Support for Scala, Java, Python and R
• Focuses on well known methods
(map(), flatMap(), filter(), reduce() …)
23. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
val conf = new SparkConf().setAppName("Spark App")
val sc = new SparkContext(conf)
val textFile: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(s"Found ${counts.count()}")
counts.saveAsTextFile("hdfs://...")
24. Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
SparkSQL
Spark
Streaming
MLlib GraphX
Apache Spark (core)
Mesos/Yarn/Standalone
(cluster management)
25. Mateusz Dymczyk Prague, 23rd October 2015
What is MLlib?
•Machine learning library for Spark (scalable by definition)
•Since September 2013, initially created at AMPLab (UC Berkeley)
•Contains common, well established machine learning algorithms and
utilities
26. Mateusz Dymczyk Prague, 23rd October 2015
Is it for me?
PROS
• extensive community, part of Spark
(Databricks support)
• Java, Scala, Python, R APIs
• solid implementation of most popular
algorithms
• easy to use, well documented, multitude of
examples
• fast and robust
CONS
• only Spark
• very young, still missing algorithms
• still pretty “low level”
27. Mateusz Dymczyk Prague, 23rd October 2015
Any problems left?
•Young projects, still require a lot of work
•Plenty of ML algorithms are not good for distribution by definition
•Simply throwing more machines won’t always work (eg. too
much data movement, too many operations)
28. Mateusz Dymczyk Prague, 23rd October 2015
What can we do?
•Go to Spark’s JIRA
•Add a ticket to MLlib
•Relax
1
2
3
29. Mateusz Dymczyk Prague, 23rd October 2015
Go smart(er)
•Compromise:
•Approximate
•Lambda architecture
•Compose algorithms:
•eg. clustering + actual similarity check
•User different algorithms
•for instance instead of closed form solution use iterative solutions
•Come up with new algorithms :-)
Data in
Serving layer
31. Mateusz Dymczyk Prague, 23rd October 2015
What we’ll see
•End to end example: similarity search
•Built-in algorithm/util examples:
•Clustering
•Recommender systems (collaborative filtering)
•Logistic regression
•Model evaluation
32. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search
•Problem: given an object (document, image) find all objects which
are similar to it in a given set.
•Solution: similarity is a well research topic in
mathematics!
•Why:
•Find most popular objects.
•Aggregate similar objects to declutter view.
•Find k most similar objects.
34. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - pipeline
Data preprocessing
(eg. tokenization,
text normalization)
Input data
Vectorization
Similarity
check
Result
“This’s a Short test” [“short”, “test”]
“This’s a not so [“long”, “test”]
long Test”
[1,1,0], [1,0,1] …
Similarity
algorithm
35. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - distributed pipeline
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Similarity
check
Node
Similarity
check
Node
Cluster
36. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
•Brute force solution:
•pre-process text
•vectorize (in our case TF-IDF)
•compute all possible pairs
•compute cosine similarity between each pair
37. Mateusz Dymczyk Prague, 23rd October 2015
Vectorization: TF-IDF
• Term Frequency–Inverse Document Frequency:
• how important a word is for a document in a collection
• higher when the word occurrence is big in a document
• smaller when the word is also common in the whole collection
“This’s a Short test” [“short”, “test”]
“This’s a not so long Test” [“long”, “test”]
[1/6, 1/3, 0], [1/6 , 0, 1/3] …
38. Mateusz Dymczyk Prague, 23rd October 2015
TF-IDF
val documents: RDD[Seq[String]] = sc.textFile("...")
.map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
39. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.vectorize(normalized).cache()
// Brute-force similarity
val cartesian = vectorized
.cartesian(vectorized)
.filter { case (doc1, doc2) => doc1._1.id < doc2._1.id }
.map {
case (doc1, doc2) =>
val similarity: Double = cosine(doc1._2, doc2._2)
Seq(
(doc1._2, (doc2._2, similarity)),
(doc2._2, (doc1._2, similarity))
)
}
.flatMap(identity)
.combineByKey[Seq[(RantTuple, Double)]](
(x: (RantTuple, Double)) => Seq(x),
(acc: Seq[(RantTuple, Double)], y: (RantTuple, Double)) => acc.+:(y),
(acc1: Seq[(RantTuple, Double)], acc2: Seq[(RantTuple, Double)]) => acc1.++:(acc2)
)
}
“This’s a Short test” [“short”, “test”]
[1,1,0], [1,0,1] …
40. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I - problems
•Compute all-pairs similarity:
•O(n^2) comparisons
•10^6 documents
•~5*10^11 comparisons =>
•~6 days (10^3 comp/ms)
•Data shuffle size O(nL^2)
•Largest reduce-key: O(n)
n — # docs, L — # of unique words in a doc
Similarity
check
Node
Similarity
check
Node
Cluster
41. Mateusz Dymczyk Prague, 23rd October 2015
Why is data shuffle so bad?
50 GB/s
100MB/s
100-600MB/s
1 GB/s
0.3 GB/s
42. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Cluster
Similarity
check
Node
Similarity
check
Node
Cluster
Group by
feature(s)
43. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
•Problems:
•What if no features to group by
•What if it produces too big clusters?
•Solution: cluster anyway but smart!
44. Mateusz Dymczyk Prague, 23rd October 2015
Locality sensitive hashing
•Similar objects = same bucket (maximizes the % of collisions)
•Group of algorithms (different similarity measures):
•random projection for cosine
•min-hash for jaccard
•…
•Problems:
•possibility of false positives and false negatives
•double check the former, minimize the latter
•might produce duplicates pairs!
45. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search III
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.extract(normalized).cache()
val lsh = new LSH(data=vectorized, p=65537, m=1000, numRows=1000, numBands=25, minClusterSize=2)
val model = lsh.run
var clusters : RDD[(Long, Iterable[SparseVector])] = model.clusters
clusters.map { case (id, cluster) => cosines(cluster) }
}
• Sample implementations:
• https://github.com/mrsqueeze/spark-hash (min-hash)
• https://github.com/marufaytekin/lsh-spark (Charikar’s LSH for cosine)
46. Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - results
INPUT
”パウダーファンデーションのパフがすぐに汚れ
てしまう。” (“Powder foundation’s puff gets dirty really fast”)
OUTPUT
0.80 “パウダーをつけるパフがすぐに汚れる。”
(“The puff gets dirty really fast after applying the powder.”)
0.53 “パフがすぐに汚くなってしまう。” (“The
puff gets dirty really fast.”)
0.30 “パウダリーファンデーションをつけるた
めのスポンジというかパフ、すぐに汚れて、ファ
ンデをつける時にきれいに伸ばせなくなる。”
(“The sponge for applying the powdery foundation gets dirty really
fast, when using the foundation it doesn’t spread nicely.”)
48. Mateusz Dymczyk Prague, 23rd October 2015
Clustering
val data = sc.textFile("...")
val parsedData = data.map(_.split(' ').map(_.toDouble)).cache()
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
val prediction = clusters.predict(point)
•unsupervised learning problem which tries to group
subsets of objects with one another based on some notion
of similarity.
•supported algorithms: K-means, Gaussian mixture, Power
iteration clustering (PIC), Latent Dirichlet allocation (LDA)
49. Mateusz Dymczyk Prague, 23rd October 2015
Recommender systems
val data = sc.textFile(“…”)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
• Collaborative filtering
• User/product matrix predictions
50. Mateusz Dymczyk Prague, 23rd October 2015
(Logistic) Regression
•iterative algorithm - greatly benefits from caching
•often used for binary classification (can be
generalised)
// <label> <idx1>:<val1> <idx2>:<val2> ...
val data = MLUtils.loadLibSVMFile(sc, “…”).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(data)
model.predict(pointToPredict)
52. Mateusz Dymczyk Prague, 23rd October 2015
Supervised learning workflow
Raw
data
Cleaned/
scaled
data
Training
set
Validating
set
Model
creation
Final
model
Validation
Incoming
new
data
53. Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
•Certain ML algorithms create models
•How do we know if the model we got is good (enough)?
•Different types of evaluation depending on the ML algorithm type:
•classification: prediction and recall (based on true/false positive/
negative)
•regression: different methods based on the difference of evaluation
and validation data
54. Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
val data = MLUtils.loadLibSVMFile(sc, "...")
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
model.clearThreshold
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold: $t, Precision: $p")
}
val recall = metrics.recallByThreshold
recall.foreach { case (t, r) =>
println(s"Threshold: $t, Recall: $r")
}
56. Mateusz Dymczyk Prague, 23rd October 2015
Common pitfalls
1. Try to avoid groupByKey()
•instead try reduceByKey()
2.Don’t collect all the data in the driver:
•collect() will copy all the elements to the driver node
•instead persist it (file, DB)
3.Use cache()/persist() where necessary (use Sparks WebUI)!
4.Code for failure and handle malformed input!
5.Remember about Serializable!
57. Mateusz Dymczyk Prague, 23rd October 2015
Performance recap
1. Parallelising (not concurrency!) makes us faster
2.Network traffic makes us (really) slow
1. keep data close to the processing units (stay local)
2.take note of operation order
3.don’t iterate more than necessary
3.In-memory computation/caching helps a lot (especially in case of
iterative machine learning!)
58. Mateusz Dymczyk Prague, 23rd October 2015
Where to go from here
• Get ideas: https://www.kaggle.com/wiki/DataScienceUseCases
• Get started with Spark:
• http://spark.apache.org/docs/latest/quick-start.html
• https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
• Get started with MLlib:
• http://spark.apache.org/docs/latest/mllib-guide.html
• https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
• Try out other frameworks and courses:
• https://github.com/h2oai/sparkling-water
• https://www.coursera.org/course/mmds
• Learn the basics:
• https://www.coursera.org/learn/machine-learning
• Practical books:
• “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media
• “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
61. Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
•Linear models (regression) can be trained in a streaming fashion (1.1+)
•Clustering can be done on streams (with k-means)
•what if data over time changes? — mllib supports “forgetfulness”
62. Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
val trainingData = ssc.textFileStream("...").map(Vectors.parse)
val testData = ssc.textFileStream("...").map(LabeledPoint.parse)
val model = new StreamingKMeans()
.setK(2)
.setDecayFactor(1.0)
.setRandomCenters(3, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
64. Mateusz Dymczyk Prague, 23rd October 2015
Seldon.io
• open predictive platform
• provides content
recommendation and
predictive functionality
65. Mateusz Dymczyk Prague, 23rd October 2015
Prediction.io
• open source ML server for building predictive engines
• event collection, algorithms, evaluation and querying predictive results via REST
• uses Hadoop, HBase, Spark and Elasticsearch