SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture with
DB Tsai
dbtsai@alpinenow.com
Machine Learning Engineering Lead @ Alpine Data Labs
Next.ML Conference
Jan 17, 2015
Learn more about Advanced Analytics at http://www.alpinenow.com
•  Batch Layer, managing all available big dataset which is an
immutable, append-only set of raw data using distributed
processing system.
•  Speed layer, processing data in streaming fashion with low
latency, and the real-time views are provided by the most
recent data.
•  Serving layer, the result from batch layer and speed layer
will be stored here, and it responds to queries in a low-
latency and ad-hoc way.
Lambda Architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
Lambda Architecture
https://www.mapr.com/developercentral/lambda-architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
•  Different technologies are used in batch layer and speed
layer traditionally.
•  If your batch system is implemented with Apache Pig, and
your speed layer is implemented with Apache Storm, you
have to write and maintain the same logics in SQL and in
Java/Scala
•  This will very quickly becomes a maintenance nightmare.
Traditional Lambda Architecture
Learn more about Advanced Analytics at http://www.alpinenow.com
Unified Development Framework
Learn more about Advanced Analytics at http://www.alpinenow.com
Batch Layer
•  Empower users to iterate
through the data by utilizing
the in-memory cache.
•  Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
•  We’re able to train exact
models without doing any
approximation.
Learn more about Advanced Analytics at http://www.alpinenow.com
Apache Spark Utilizing in-memory Cache for M/R job
Iterative algorithms
scan through the
data each time
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
Learn more about Advanced Analytics at http://www.alpinenow.com
Speed Layer
•  An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data stream.
•  Spark Streaming receives streaming input, and divides the data
into batches which are then processed by Spark engine.
•  As a result, developers can maintain the same Java/Scala code
in Batch and Speed layer.
Learn more about Advanced Analytics at http://www.alpinenow.com
MapReduce Review
•  MapReduce – Simplified Data Processing on Large
Clusters, 2004.
•  Scales Linearly
•  Data Locality
•  Fault Tolerance in Data and Computation
Learn more about Advanced Analytics at http://www.alpinenow.com
Hard Disks Failures from Google’s 2007 Study
•  1.7% of disks failed in the first
year of their life.
•  Three-year-old disks were
failing at a rate of 8.6%.
•  For the hypothetical eight-disk server, the probability that
none of disks fail in first year will be 81%.
•  The key contributions of the MapReduce framework are not
the actual map and reduce functions, but the scalability and
fault-tolerance achieved with commodity hardware.
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop MapReduce Review
•  Mapper: Loads the data and emits a set of key-value pairs
•  Reducer: Collects the key-value pairs with the same key to
process, and output the result.
•  Combiner: Can reduce shuffle traffic by combining key-value
pairs locally before going to reducer.
•  Good: Built in fault tolerance, scalable, and production proven
in industry.
•  Bad: Optimized for disk IO without leveraging memory well;
iterative algorithms go through disk IO again and again;
primitive API is not easy and clean to develop.
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark MapReduce
•  Spark also uses MapReduce as a programming model but
with much richer APIs in Java Scala, and Python.
•  With Scala expressive APIs, 5-10x less code.
•  Not just a distributed computation framework, Spark provides
several pre-built components empowering users to implement
application faster and easier.
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
- GraphX (Graph Processing)
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop M/R vs Spark M/R
•  Hadoop
•  Spark
Learn more about Advanced Analytics at http://www.alpinenow.com
Supervised Learning
•  Binary Classification: linear SVMs (SGD), logistic regression (L-
BFGS and SGD), decision trees, random forests (Spark 1.2), and
naïve Bayes.
•  Multiclass Classification: Decision trees, naïve Bayes (coming
soon - multinomial logistic regression in GLMNET)
•  Regression: linear least squares (SGD), Lasso (SGD + soft-
threshold), ridge regression (SGD), decision trees, and random
forests (Spark 1.2)
•  Currently, the regularization in linear model will penalize all the
weights including the intercept which is not desired in some use-
cases. Alpine has GLMNET implementation using OWLQN which
can exactly reproduce R’s GLMNET package result with scalability.
We’re in the process of merging it into MLlib community.
Learn more about Advanced Analytics at http://www.alpinenow.com
Unsupervised Learning
•  K-Means,
•  Collaborative filtering (ALS)
•  SVD
•  PCA
•  Feature extraction and transformation
http://spark.apache.org/docs/1.2.0/mllib-guide.html
Learn more about Advanced Analytics at http://www.alpinenow.com
Resilient Distributed Datasets (RDDs)
•  RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
•  RDDs can be created by parallelizing an existing
collection in your driver program, or referencing a dataset
in an external storage system, such as a shared
filesystem, HDFS, HBase, HIVE, or any data source
offering a Hadoop InputFormat.
•  RDDs can be cached in memory or on disk
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Persistence/Cache
•  RDD can be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it
will be kept in memory on the nodes. Spark’s cache is
fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations
that originally created it.
•  Persisted RDD can be stored using a different storage
level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects
(to save space), replicate it across nodes, or store it off-
heap in Tachyon.
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Operations - two types of operations
•  Transformations: Creates a new dataset from an existing
one. They are lazy, in that they do not compute their
results right away. By default, each transformed RDD may
be recomputed each time you run an action on it. You
may also persist an RDD in memory using the persist (or
cache) method, in which case Spark will keep the
elements around on the cluster for much faster access
the next time you query it. (PS, after transformations, the
dataset can be imbalanced in each executor, and this can
be addressed by repartition.)
•  Actions: Returns a value to the driver program after
running a computation on the dataset.
Learn more about Advanced Analytics at http://www.alpinenow.com
Transformations
•  map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•  filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•  flatMap(func) - Similar to map, but each input item can be
mapped to 0 or more output items (so func should return a Seq
rather than a single item).
•  mapPartitions(func) - Similar to map, but runs separately on
each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
http://spark.apache.org/docs/latest/programming-
guide.html#transformations
Learn more about Advanced Analytics at http://www.alpinenow.com
Actions
•  reduce(func) - Aggregate the elements of the dataset
using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.
•  collect() - Return all the elements of the dataset as an
array at the driver program. This is usually useful after a
filter or other operation that returns a sufficiently small
subset of the data.
•  count(), first(), take(n), saveAsTextFile(path), etc.
http://spark.apache.org/docs/latest/programming-
guide.html#actions
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the mean of data
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 1)
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark Streaming: Discretized Streams
•  DStream is the basic abstraction provided by Spark
Streaming over Spark’s RDDs.
•  Each RDD in a DStream contains data from a certain
interval. Any operation applied on a DStream translates
to operations on the underlying RDDs internally.
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Batch Processing
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count in Streaming Processing
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
•  Need another bash shell in docker to run Netcat as a
data server.
•  In production, people often use Kafka as data server.
•  docker ps // to find the current docker PID
•  docker exec –it <PID> bash // to lunch a new shell
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 2)
Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
The updateStateByKey operation allows you to maintain
arbitrary state while continuously updating it with new
information.
•  Define the state - The state can be of arbitrary data type.
•  Define the state update function - Specify with a function
how to update the state using the previous state and the
new values from input stream.
Learn more about Advanced Analytics at http://www.alpinenow.com
UpdateStateByKey Operation
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data
•  Current sum and count at time t has to be accessible
at time (t + 1) to compute new mean of stream.
•  Without UpdateSateByKey, the operations at time t
and (t + 1) are independent.
•  Checkpoint directory has to be configured for
persistence of the state at different time.
Learn more about Advanced Analytics at http://www.alpinenow.com
Computing the Mean of Streaming Data
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Lab 3)
Learn more about Advanced Analytics at http://www.alpinenow.com
Online Learning Example
Learn more about Advanced Analytics at http://www.alpinenow.com
Thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleMichael Mueller
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 

Was ist angesagt? (20)

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Andere mochten auch

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big DataPradeeban Kathiravelu, Ph.D.
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint febimu409
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...Pradeeban Kathiravelu, Ph.D.
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data miningbalbeerrawat
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningPritesh Ranjan
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsOmar Shaya
 
Enterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen EngagementEnterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen EngagementSAP Asia Pacific
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsPradeeban Kathiravelu, Ph.D.
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive OverviewRCG Global Services
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection amiable_indian
 

Andere mochten auch (20)

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
ViTeNA: An SDN-Based Virtual Network Embedding Algorithm for Multi-Tenant Dat...
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Intrusion detection using data mining
Intrusion detection using data miningIntrusion detection using data mining
Intrusion detection using data mining
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Ids presentation
Ids presentationIds presentation
Ids presentation
 
Analysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data MiningAnalysis and Design for Intrusion Detection System Based on Data Mining
Analysis and Design for Intrusion Detection System Based on Data Mining
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Using Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection SystemsUsing Machine Learning in Networks Intrusion Detection Systems
Using Machine Learning in Networks Intrusion Detection Systems
 
Enterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen EngagementEnterprise Mobility Transforming Public Service and Citizen Engagement
Enterprise Mobility Transforming Public Service and Citizen Engagement
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive Overview
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
Data Mining and Intrusion Detection
Data Mining and Intrusion Detection Data Mining and Intrusion Detection
Data Mining and Intrusion Detection
 

Ähnlich wie 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to SparkDB Tsai
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0alpinedatalabs
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 

Ähnlich wie 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference (20)

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark2014-08-14 Alpine Innovation to Spark
2014-08-14 Alpine Innovation to Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 

Mehr von DB Tsai

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...DB Tsai
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 

Mehr von DB Tsai (6)

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit Ea...
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 

Kürzlich hochgeladen

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Kürzlich hochgeladen (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference

  • 1. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture with DB Tsai dbtsai@alpinenow.com Machine Learning Engineering Lead @ Alpine Data Labs Next.ML Conference Jan 17, 2015
  • 2. Learn more about Advanced Analytics at http://www.alpinenow.com •  Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system. •  Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data. •  Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low- latency and ad-hoc way. Lambda Architecture
  • 3. Learn more about Advanced Analytics at http://www.alpinenow.com Lambda Architecture https://www.mapr.com/developercentral/lambda-architecture
  • 4. Learn more about Advanced Analytics at http://www.alpinenow.com •  Different technologies are used in batch layer and speed layer traditionally. •  If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala •  This will very quickly becomes a maintenance nightmare. Traditional Lambda Architecture
  • 5. Learn more about Advanced Analytics at http://www.alpinenow.com Unified Development Framework
  • 6. Learn more about Advanced Analytics at http://www.alpinenow.com Batch Layer •  Empower users to iterate through the data by utilizing the in-memory cache. •  Logistic regression runs up to 100x faster than Hadoop M/R in memory. •  We’re able to train exact models without doing any approximation.
  • 7. Learn more about Advanced Analytics at http://www.alpinenow.com Apache Spark Utilizing in-memory Cache for M/R job Iterative algorithms scan through the data each time With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 8. Learn more about Advanced Analytics at http://www.alpinenow.com Speed Layer •  An extension of the core Spark API that enables scalable, high- throughput, fault-tolerant stream processing of live data stream. •  Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine. •  As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
  • 9. Learn more about Advanced Analytics at http://www.alpinenow.com MapReduce Review •  MapReduce – Simplified Data Processing on Large Clusters, 2004. •  Scales Linearly •  Data Locality •  Fault Tolerance in Data and Computation
  • 10. Learn more about Advanced Analytics at http://www.alpinenow.com Hard Disks Failures from Google’s 2007 Study •  1.7% of disks failed in the first year of their life. •  Three-year-old disks were failing at a rate of 8.6%. •  For the hypothetical eight-disk server, the probability that none of disks fail in first year will be 81%. •  The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved with commodity hardware.
  • 11. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pairs •  Reducer: Collects the key-value pairs with the same key to process, and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value pairs locally before going to reducer. •  Good: Built in fault tolerance, scalable, and production proven in industry. •  Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.
  • 12. Learn more about Advanced Analytics at http://www.alpinenow.com Spark MapReduce •  Spark also uses MapReduce as a programming model but with much richer APIs in Java Scala, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides several pre-built components empowering users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)
  • 13. Learn more about Advanced Analytics at http://www.alpinenow.com Hadoop M/R vs Spark M/R •  Hadoop •  Spark
  • 14. Learn more about Advanced Analytics at http://www.alpinenow.com Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L- BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes. •  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET) •  Regression: linear least squares (SGD), Lasso (SGD + soft- threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2) •  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use- cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.
  • 15. Learn more about Advanced Analytics at http://www.alpinenow.com Unsupervised Learning •  K-Means, •  Collaborative filtering (ALS) •  SVD •  PCA •  Feature extraction and transformation http://spark.apache.org/docs/1.2.0/mllib-guide.html
  • 16. Learn more about Advanced Analytics at http://www.alpinenow.com Resilient Distributed Datasets (RDDs) •  RDD is a fault-tolerant collection of elements that can be operated on in parallel. •  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat. •  RDDs can be cached in memory or on disk
  • 17. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Persistence/Cache •  RDD can be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. •  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off- heap in Tachyon.
  • 18. Learn more about Advanced Analytics at http://www.alpinenow.com RDD Operations - two types of operations •  Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away. By default, each transformed RDD may be recomputed each time you run an action on it. You may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. (PS, after transformations, the dataset can be imbalanced in each executor, and this can be addressed by repartition.) •  Actions: Returns a value to the driver program after running a computation on the dataset.
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com Transformations •  map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). •  mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. http://spark.apache.org/docs/latest/programming- guide.html#transformations
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com Actions •  reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. •  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. •  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming- guide.html#actions
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the mean of data
  • 22. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 23. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 24. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 1)
  • 25. Learn more about Advanced Analytics at http://www.alpinenow.com Spark Streaming: Discretized Streams •  DStream is the basic abstraction provided by Spark Streaming over Spark’s RDDs. •  Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs internally.
  • 26. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Batch Processing
  • 27. Learn more about Advanced Analytics at http://www.alpinenow.com Word Count in Streaming Processing
  • 28. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  • 29. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2) •  Need another bash shell in docker to run Netcat as a data server. •  In production, people often use Kafka as data server. •  docker ps // to find the current docker PID •  docker exec –it <PID> bash // to lunch a new shell
  • 30. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 2)
  • 31. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. •  Define the state - The state can be of arbitrary data type. •  Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.
  • 32. Learn more about Advanced Analytics at http://www.alpinenow.com UpdateStateByKey Operation
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data •  Current sum and count at time t has to be accessible at time (t + 1) to compute new mean of stream. •  Without UpdateSateByKey, the operations at time t and (t + 1) are independent. •  Checkpoint directory has to be configured for persistence of the state at different time.
  • 34. Learn more about Advanced Analytics at http://www.alpinenow.com Computing the Mean of Streaming Data
  • 35. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 36. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 37. Learn more about Advanced Analytics at http://www.alpinenow.com Lab 3)
  • 38. Learn more about Advanced Analytics at http://www.alpinenow.com Online Learning Example
  • 39. Learn more about Advanced Analytics at http://www.alpinenow.com Thank you.