1. LARGE-SCALE ANALYTICS WITH
APACHE SPARK
THOMSON REUTERS R&D
TWIN CITIES HADOOP USER GROUP
FRANK SCHILDER
SEPTEMBER 22, 2014
2. THOMSON REUTERS
⢠The Thomson Reuters Corporation
â 50,000+ employees
â 2,000+ journalists at news desks world wide
â Offices in more than 1,000 countries
â $12 billion dollars revenue/year
⢠Products: intelligent information for professionals and enterprises
â Legal: WestlawNext legal search engine
â Financial: Eikon financial platform; Datastream real-time share price data
â News: REUTERS news
â Science: Endnote, ISI journal impact factor, Derwent World Patent Index
â Tax & Accounting: OneSource tax information
⢠Corporate R&D
â Around 40 researchers and developers (NLP, IR, ML)
â Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC
and London
â We are hiring⌠email me at frank.schilder@thomsonreuters.com
3. OVERVIEW
⢠Speed
â Data locality, scalability, fault tolerance
⢠Ease of Use
â Scala, interactive Shell
⢠Generality
â SparkSQL, MLLib
⢠Comparing ML frameworks
â Vowpal Wabbit (VW)
â Sparkling Water
⢠The Future
4. WHAT IS SPARK?
Apache Spark is a fast and general engine
for large-scale data processing.
⢠Speed: allows to run iterative Map-Reduce
faster because of in-Memory computation:
Resilient Distributed Datasets (RDD)
⢠Ease of use: enables interactive data analysis
in Scala, Python, or Java; interactive Shell
⢠Generality: offers libraries for SQL, Streaming
and large-scale analytics (graph processing
and machine learning)
⢠Integrated with Hadoop: runs on Hadoop 2âs
YARN cluster
5. ACKNOWLEDGMENTS
⢠Matei Zaharia and ampLab and databricks team for
fantastic learning material and tutorials on Spark
⢠Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry
Heinze for Spark and Scala support and running
experiments
⢠Adam Glaser for his time as a TSAP intern
⢠Mahadev Wudali and Mike Edwards for letting us
play in the âsandboxâ (cluster)
7. PRIMARY GOALS OF SPARK
⢠Extend the MapReduce model to better support
two common classes of analytics apps:
â Iterative algorithms (machine learning, graphs)
â Interactive data mining (R, Python)
⢠Enhance programmability:
â Integrate into Scala programming language
â Allow interactive use from Scala interpreter
â Make Spark easily accessible from other
languages (Python, Java)
8. MOTIVATION
⢠Acyclic data flow is inefficient for
applications that repeatedly reuse a working
set of data:
â Iterative algorithms (machine learning, graphs)
â Interactive data mining tools (R, Python)
⢠With current frameworks, apps reload data
from stable storage on each query
10. SOLUTION: Resilient
Distributed Datasets (RDDs)
⢠Allow apps to keep working sets in memory for
efficient reuse
⢠Retain the attractive properties of MapReduce
â Fault tolerance, data locality, scalability
⢠Support a wide range of applications
11. PROGRAMMING MODEL
Resilient distributed datasets (RDDs)
â Immutable, partitioned collections of objects
â Created through parallel transformations (map, filter,
groupBy, join, âŚ) on data in stable storage
â Functions follow the same patterns as Scala operations
on lists
â Can be cached for efficient reuse
80+ Actions on RDDs
â count, reduce, save, take, first, âŚ
12. EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
Base RDD
Transformed RDD
Val lines = spark.textFile(âhdfs://...â)
Val errors = lines.filter(_.startsWith(âERRORâ))
Val messages = errors.map(_.split(âtâ)(2))
Val cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
results
Worker
Worker
Driver
cachedMsgs.filter(_.contains(âtimeoutâ)).count
cachedMsgs.filter(_.contains(âlicenseâ)).count
. . .
tasks
Cache 1
Cache 2
Cache 3
Action
Result: scaled to 1 TB data in 5-7 sec
Result: full-text search of Wikipedia in <1 sec
(vs 170 sec for on-disk data)
(vs 20 sec for on-disk data)
13. BEHAVIOR WITH NOT ENOUGH RAM
68.8
58.1
40.7
29.7
11.5
100
80
60
40
20
0
Cache
disabled
25%
50%
75%
Fully
cached
Iteration
time
(s)
%
of
working
set
in
memory
14. RDD Fault Tolerance
RDDs maintain lineage information that can be used
to reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(âERRORâ))
.map(_.split(âtâ)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func
=
_.contains(...))
map
(func
=
_.split(...))
15. Fault Recovery Results
119
No
Failure
Failure
in
the
6th
Iteration
57
56
58
58
81
57
59
57
59
140
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
Iteratrion
time
(s)
Iteration
17. INTERACTIVE SHELL
⢠Data analysis can be done in the interactive shell.
â Start from local machine or cluster
â Access multi-core processor with local[n]
â Spark context is already set up for you: SparkContext sc
⢠Load data from anywhere (local, HDFS,
Cassandra, Amazon S3 etc.):
⢠Start analyzing your data:
Processing
starts here
Local data file
18. ANALYZE YOUR DATA
⢠Word count in one line:
⢠List the word counts:
⢠Broadcast variables (e.g. dictionary, stop word list)
because local variables need to distributed to the workers:
20. PYTHON SHELL & IPYTHON
⢠The interactive shell can also be started as Python
shell called pySpark:
⢠Start analyzing your data in python now:
⢠Since itâs Python, you may want to use iPython
â (command shell for interactive programming in your
brower) :
21. IPYTHON AND SPARK
⢠The iPython notebook environment and pySpark:
â Document data analysis results
â Carry out machine learning experiments
â Visualize results with matplotlib or other visualization
libraries
â Combine with NLP libraries such as NLTK
⢠PySpark does not offer the full functionality of
Spark Shell in Scala (yet)
⢠Some bugs (e.g. problems with unicode)
22.
23. PROJECTS AT R&D USING SPARK
⢠Entity linking
â Alternative name extraction from
Wikipedia, Freebase, free text, ClueWeb12;
several TB large web collection (planned)
⢠Large-scale text data analysis:
â creating fingerprints for entities/events
â Temporal slot filling: Assigning a begin and end time
stamp to a slot filler (e.g. A is employee of company B
from BEGIN to END)
â Large-Scale text classification of Reuters News Archive
articles (10 years)
⢠Language model computation used for search
query analysis
24. SPARK MODULES
⢠Spark streaming:
â Processing real-time data streams
⢠Spark SQL:
â Support for structured data (JSON, Parquet) and
relational queries (SQL)
⢠MLlib:
â Machine learning library
⢠GraphX:
â New graph processing API
26. SPARK SQL
⢠Relational queries expressed in
â SQL
â HiveQL
â Scala Domain specific language (DSL)
⢠New type of RDD: SchemaRDD :
â RDD composed of Row objects
â Schema definition or inferred from a Parquet file, JSON
data set, or data store in Hive
⢠SPARK SQL is in alpha: API may change in the
future!
29. MLLIB
⢠A machine learning module that comes with Spark
⢠Shipped since Spark 0.8.0
⢠Provides various machine learning algorithms for
classification and clustering
⢠Sparse vector representation since 1.0.0
⢠New features in recently released version 1.1.0:
â Includes a standard statistics library (e.g. correlation,
Hypothesis testing, sampling)
â More algorithms ported to Java and Python
â More feature engineering: TF-IDF, Singular Value
Decomposition (SVD)
30. MLLIB
⢠Provides various machine learning algorithms:
â Classification:
⢠Logistic regression, support vector machine (SVM), naïve
Bayes, decision trees
â Regression:
⢠Linear regression, regression trees
â Collaborative Filtering:
⢠Alternative least square (ALS)
â Clustering:
⢠K-means
â Decomposition
⢠Singular value decomposition (SVD), Principal component
analysis (PCA)
31. OTHER ML FRAMEWORKS
⢠Mahout
⢠LIBLINEAR
⢠MatLAB
⢠Scikit-learn
⢠GraphLab
⢠R
⢠Weka
⢠Vowpal Wabbit
⢠BigML
32. LARGE-SCALE ML INFRASTRUCTURE
⢠More data implies bigger training sets and richer
feature sets.
⢠More data with simple ML algorithm often beats
small data with complicated ML algorithm
⢠Large-scale ML requires big data infrastructure:
â Faster processing: Hadoop, Spark
â Feature engineering: Principal Component Analysis,
Hashing trick, Word2Vec
34. PREDICTIVE ANALYTICS WITH MLLIB
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-
data-using-spark-2.html
35. VW AND MLLIB COMPARISON
⢠We compared Vowpal Wabbit and MLlib in
December 2013 (work with Tom Vacek)
⢠Vowpal Wabbit (VW) is a large-scale ML tool
developed by John Langford (Microsoft)
⢠Task: binary text classification task on Reuters
articles
â Ease of implementation
â Feature Extraction
â Parameter tuning
â Speed
â Accessibility of programming languages
36. VW VS. MLLIB
⢠Ease of implementation
â VW: user tool designed for ML, not programming language
â MLlib: programming language, some support now (e.g. regularization)
⢠Feature Extraction
â VW: specific capabilities for bi-grams, prefix etc.
â MLlib: no limit in terms of creating features
⢠Parameter tuning
â VW: no parameter search capability, but multiple parameters can be hand-tuned
â MLlib: offers cross-validation
⢠Speed
â VW: highly optimized, very fast even on a single machine with multiple cores
â MLlib: fast with lots of machines
⢠Accessibility of programming languages
â VW: written in C++, a few wrappers (e.g. Python)
â MLlib: Scala, Python, Java
⢠Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at
least some of the areas (e.g. sparse feature representation)
37. FINDINGS SO FAR
⢠Large-scale extraction is a great fit for Spark when
working with large data sets (> 1GB)
⢠Ease of use makes Spark an ideal framework for
rapid prototyping.
⢠MLlib is a fast growing ML library, but âunder
developmentâ
⢠Vowpal Wabbit has been shown to crunch even
large data sets with ease.
250
200
150
100
50
0
vw liblinear Spark
local[4]
0/1 loss
time
38. OTHER ML FRAMEWORKS
⢠Internship by Adam Glaser compared various ML
frameworks with 5 standard data sets (NIPS)
â Mass-spectrometric data (cancer), handwritten digit
detection, Reuters news classification, synthetic data sets
â Data sets were not very big, but had up to 1.000.000
features
⢠Evaluated accuracy of the generated models and
speed for training time
⢠H20, GraphLab and Microsoft Azure showed strong
performances in terms of accuracy and training
time.
41. WHAT IS NEXT?
⢠Oxdata plans to release Sparkling Water in October
2014:
⢠Microsoft Azure also offers a strong platform with
multiple ML algorithm and an intuitive user interface
⢠GraphLab has GraphLab Canvas ⢠for visualizing your
data and plans to incorporate more ML algorithms.
44. CONCLUSIONS
⢠Apache Spark is the most active project in the Hadoop
eco system
⢠Spark offers speed and ease of use because of
â RDDs
â Interactive shell and
â Easy integration of Scala, Java, Python scripts
⢠Integrated in Spark are modules for
â Easy data access via SparkSQL
â Large-scale analytics via MLlib
⢠Other ML frameworks enable analytics as well
⢠Evaluate which framework is the best fit for your data
problem
45. THE FUTURE?
⢠Apache Spark will be a unified platform to run
under various work loads:
â Batch
â Streaming
â Interactive
⢠And connect with different runtime systems
â Hadoop
â Cassandra
â Mesos
â Cloud
â âŚ
46. THE FUTURE?
⢠Spark will extend its offering of large-scale
algorithms for doing complex analytics:
â Graph processing
â Classification
â Clustering
â âŚ
⢠Other frameworks will continue to offer similar
capabilities.
⢠If you canât beat them, join them.
49. Example: Logistic Regression
Goal: find best line separating two sets of points
+
â
â
+
+
+ + +
+
+ +
â
â â
â
â
â â
+
target
â
random
initial
line
50. Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
51. Logistic Regression Performance
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1 5 10 20 30
Running Time (s)
Number of Iterations
127
s
/
iteration
Hadoop
Spark
first
iteration
174
s
further
iterations
6
s
52. Spark Scheduler
Dryad-like DAGs
Pipelines functions
within a stage
Cache-aware work
reuse & locality
Partitioning-aware
to avoid shuffles
join
groupBy
union
map
Stage
3
A:
Stage
1
Stage
2
B:
C:
D:
E:
F:
G:
=
cached
data
partition
53. Spark Operations
Transformations
(define a new
RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey