This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.
2. What is Apache Spark?
Fast and general cluster computing engine that extends
Google’s MapReduce model
Improves efficiency through:
– In-memory data sharing
– General computation graphs
Improves usability through:
– Rich APIs in Java, Scala, Python
– Interactive shell
Up to 100× faster
2-5× less code
4. About Databricks
Founded by creators of Spark and remains largest contributor
Offers a hosted service, Databricks Cloud
– Spark on EC2 with notebooks, dashboards, scheduled jobs
6. Spark Programming Model
Write programs in terms of parallel transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
– Collections of objects that can be stored in memory or disk
across a cluster
– Built via parallel transformations (map, filter, …)
– Automatically rebuilt on failure
7. Example: Log Mining
Load error messages from a log into memory, then interactively
search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
9. Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data.map(lambda p:
(1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x
).reduce(lambda x, y: x + y)
w -= gradient
10. 0
1000
2000
3000
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
later iterations 1 s
Example: Logistic Regression
15. Machine Learning Workflow
Machine learning isn’t just about training a model!
– In many cases most of the work is in feature preparation
– Important to test ideas interactively
– Must then evaluate model and use it in production
Spark includes tools to perform this whole workflow
15
16. Machine Learning Workflow
Traditional Spark
Feature preparation MapReduce, Hive RDDs, Spark SQL
Model training
Mahout, custom
code
MLlib
Model evaluation Custom code MLlib
Production use
Export (e.g. to
Storm)
model.predict()
16All operate on RDDs
17. Short Example
// Load data using SQL
ctx.jsonFile(“tweets.json”).registerTempTable(“tweets”)
points = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
21. Goal for 2015
Augment Spark with higher-level data science APIs
similar to single-machine libraries
DataFrames, ML Pipelines, R interface
21
22. 22
DataFrames
Collections of structured
data similar to R, pandas
Automatically optimized
via Spark SQL
– Columnar storage
– Code-gen. execution
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame
RunningTime
Out now in Spark 1.3
23. 23
Machine Learning Pipelines
High-level API similar to
SciKit-Learn
Operates on DataFrames
Grid search and cross
validation to tune params
tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
modelDataFrame
Out now in Spark 1.3
24. 24
Spark R Interface
Exposes DataFrames and
ML pipelines in R
Parallelize calls to R code
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
Target: Spark 1.4 (June)
25. To Learn More
Downloads & docs: spark.apache.org
Try Spark in Databricks Cloud: databricks.com
Spark Summit: spark-summit.org
25