Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
why an Opensea Clone Script might be your perfect match.pdf
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
1. Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture with
DB Tsai
dbtsai@alpinenow.com
Machine Learning Engineering Lead @ Alpine Data Labs
Next.ML Conference
Jan 17, 2015
2. Learn more about Advanced Analytics at http://www.alpinenow.com
• Batch Layer, managing all available big dataset which is an
immutable, append-only set of raw data using distributed
processing system.
• Speed layer, processing data in streaming fashion with low
latency, and the real-time views are provided by the most
recent data.
• Serving layer, the result from batch layer and speed layer
will be stored here, and it responds to queries in a low-
latency and ad-hoc way.
Lambda Architecture
3. Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture
https://www.mapr.com/developercentral/lambda-architecture
4. Learn more about Advanced Analytics at http://www.alpinenow.com
• Different technologies are used in batch layer and speed
layer traditionally.
• If your batch system is implemented with Apache Pig, and
your speed layer is implemented with Apache Storm, you
have to write and maintain the same logics in SQL and in
Java/Scala
• This will very quickly becomes a maintenance nightmare.
Traditional Lambda Architecture
5. Learn more about Advanced Analytics at http://www.alpinenow.com
Unified Development Framework
6. Learn more about Advanced Analytics at http://www.alpinenow.com
Batch Layer
• Empower users to iterate
through the data by utilizing
the in-memory cache.
• Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
• We’re able to train exact
models without doing any
approximation.
7. Learn more about Advanced Analytics at http://www.alpinenow.com
Apache Spark Utilizing in-memory Cache for M/R job
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
8. Learn more about Advanced Analytics at http://www.alpinenow.com
Speed Layer
• An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data stream.
• Spark Streaming receives streaming input, and divides the data
into batches which are then processed by Spark engine.
• As a result, developers can maintain the same Java/Scala code
in Batch and Speed layer.
9. Learn more about Advanced Analytics at http://www.alpinenow.com
MapReduce Review
• MapReduce – Simplified Data Processing on Large
Clusters, 2004.
• Scales Linearly
• Data Locality
• Fault Tolerance in Data and Computation
10. Learn more about Advanced Analytics at http://www.alpinenow.com
Hard Disks Failures from Google’s 2007 Study
• 1.7% of disks failed in the first
year of their life.
• Three-year-old disks were
failing at a rate of 8.6%.
• For the hypothetical eight-disk server, the probability that
none of disks fail in first year will be 81%.
• The key contributions of the MapReduce framework are not
the actual map and reduce functions, but the scalability and
fault-tolerance achieved with commodity hardware.
11. Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop MapReduce Review
• Mapper: Loads the data and emits a set of key-value pairs
• Reducer: Collects the key-value pairs with the same key to
process, and output the result.
• Combiner: Can reduce shuffle traffic by combining key-value
pairs locally before going to reducer.
• Good: Built in fault tolerance, scalable, and production proven
in industry.
• Bad: Optimized for disk IO without leveraging memory well;
iterative algorithms go through disk IO again and again;
primitive API is not easy and clean to develop.
12. Learn more about Advanced Analytics at http://www.alpinenow.com
Spark MapReduce
• Spark also uses MapReduce as a programming model but
with much richer APIs in Java Scala, and Python.
• With Scala expressive APIs, 5-10x less code.
• Not just a distributed computation framework, Spark provides
several pre-built components empowering users to implement
application faster and easier.
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
- GraphX (Graph Processing)
13. Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop M/R vs Spark M/R
• Hadoop
• Spark
14. Learn more about Advanced Analytics at http://www.alpinenow.com
Supervised Learning
• Binary Classification: linear SVMs (SGD), logistic regression (L-
BFGS and SGD), decision trees, random forests (Spark 1.2), and
naïve Bayes.
• Multiclass Classification: Decision trees, naïve Bayes (coming
soon - multinomial logistic regression in GLMNET)
• Regression: linear least squares (SGD), Lasso (SGD + soft-
threshold), ridge regression (SGD), decision trees, and random
forests (Spark 1.2)
• Currently, the regularization in linear model will penalize all the
weights including the intercept which is not desired in some use-
cases. Alpine has GLMNET implementation using OWLQN which
can exactly reproduce R’s GLMNET package result with scalability.
We’re in the process of merging it into MLlib community.
15. Learn more about Advanced Analytics at http://www.alpinenow.com
Unsupervised Learning
• K-Means,
• Collaborative filtering (ALS)
• SVD
• PCA
• Feature extraction and transformation
http://spark.apache.org/docs/1.2.0/mllib-guide.html
16. Learn more about Advanced Analytics at http://www.alpinenow.com
Resilient Distributed Datasets (RDDs)
• RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
• RDDs can be created by parallelizing an existing
collection in your driver program, or referencing a dataset
in an external storage system, such as a shared
filesystem, HDFS, HBase, HIVE, or any data source
offering a Hadoop InputFormat.
• RDDs can be cached in memory or on disk
17. Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Persistence/Cache
• RDD can be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it
will be kept in memory on the nodes. Spark’s cache is
fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations
that originally created it.
• Persisted RDD can be stored using a different storage
level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects
(to save space), replicate it across nodes, or store it off-
heap in Tachyon.
18. Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Operations - two types of operations
• Transformations: Creates a new dataset from an existing
one. They are lazy, in that they do not compute their
results right away. By default, each transformed RDD may
be recomputed each time you run an action on it. You
may also persist an RDD in memory using the persist (or
cache) method, in which case Spark will keep the
elements around on the cluster for much faster access
the next time you query it. (PS, after transformations, the
dataset can be imbalanced in each executor, and this can
be addressed by repartition.)
• Actions: Returns a value to the driver program after
running a computation on the dataset.
19. Learn more about Advanced Analytics at http://www.alpinenow.com
Transformations
• map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
• filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
• flatMap(func) - Similar to map, but each input item can be
mapped to 0 or more output items (so func should return a Seq
rather than a single item).
• mapPartitions(func) - Similar to map, but runs separately on
each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
http://spark.apache.org/docs/latest/programming-
guide.html#transformations
20. Learn more about Advanced Analytics at http://www.alpinenow.com
Actions
• reduce(func) - Aggregate the elements of the dataset
using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.
• collect() - Return all the elements of the dataset as an
array at the driver program. This is usually useful after a
filter or other operation that returns a sufficiently small
subset of the data.
• count(), first(), take(n), saveAsTextFile(path), etc.
http://spark.apache.org/docs/latest/programming-
guide.html#actions
21. Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the mean of data
24. Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 1)
25. Learn more about Advanced Analytics at http://www.alpinenow.com
Spark Streaming: Discretized Streams
• DStream is the basic abstraction provided by Spark
Streaming over Spark’s RDDs.
• Each RDD in a DStream contains data from a certain
interval. Any operation applied on a DStream translates
to operations on the underlying RDDs internally.
26. Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Batch Processing
27. Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Streaming Processing
28. Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
29. Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
• Need another bash shell in docker to run Netcat as a
data server.
• In production, people often use Kafka as data server.
• docker ps // to find the current docker PID
• docker exec –it <PID> bash // to lunch a new shell
30. Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
31. Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
The updateStateByKey operation allows you to maintain
arbitrary state while continuously updating it with new
information.
• Define the state - The state can be of arbitrary data type.
• Define the state update function - Specify with a function
how to update the state using the previous state and the
new values from input stream.
32. Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
33. Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data
• Current sum and count at time t has to be accessible
at time (t + 1) to compute new mean of stream.
• Without UpdateSateByKey, the operations at time t
and (t + 1) are independent.
• Checkpoint directory has to be configured for
persistence of the state at different time.
34. Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data