SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Why Spark
Secrets of Apache Spark’s Success
This Talk
● RDDs
● Spark vs InMemory Data Grids (IMDG)
● Programming model
● Reuse - DRY
● Higher level abstraction
● Scala
● Interactive Shells
● Other
● Future
About Me
● Solution Architect/Dev Manager/Developer/Market Risk
SME at a tier 1 investment bank
● 20 years of JVM experience
● 2011 - Hadoop + Map Reduce
● 2012 - Hive, then Shark
● 2013 - Spark, Scala, Play and Spray
● 2014 - Spark Streaming, Spark as a compute grid,
Spark ML
● 2015 - Independent Apache Spark consultant
Map Reduce
● Good
○ High level abstraction (Map and Reduce)
○ Distribution and fault tolerance
● Not so Good
● Lack abstractions for leveraging distributed memory
● Not efficient for iterative algorithms and interactive
data mining (SQL)
Solution - use shared memory
Challenges
● not abstracted for general use
● fault tolerant and resiliency
Existing In memory solutions
● Distributed shared memory (Coherence, key
value stores, database, etc)
● Allow fine grained updated to mutable state
● Fault tolerance hard to achieve - requires
replication, logging and checkpointing
● network bandwidth < memory bandwidth
● substantial storage overheads
Spark RDDs - what’s different
● RDD is a read-only, partitioned collection of records
● Interface based on coarse grained transformations
(map, filter and join)
● Fault tolerance using lineage rather than actual data
● if a partition is lost, the RDD has enough information to
recreate it from other RDDs to recompute the partition
without requiring replication
● Immutable RDDs
Spark - what’s different from IMDGs
. Property RDDs IM Data Grids
Reads Coarse- or fine-grained Fine-grained
Writes Coarse-grained Fine-grained
Fault recovery Fine-grained and low
overhead using lineage
Requires checkpoints and
program rollback
Straggler mitigation Possible using backup tasks Difficult
Work placement Automatic based on data
locality
Up to app (runtimes aim for
transparency)
Behavior if not enough RAM Similar to existing data flow
systems
Poor performance
(swapping?)
RDDs - what’s different
● only the lost partitions of an RDD need to be
recomputed upon failure, and they can be
recomputed in parallel on different nodes,
without having to roll back the whole
program.
RDDs - Straggler Migration
● A second benefit of RDDs is that their
immutable nature lets a system mitigate slow
nodes (stragglers) by running backup copies
of slow tasks as in MapReduce. Backup
tasks would be hard to implement with DSM,
as the two copies of a task would access the
same memory locations and interfere with
each other’s updates..
RDD Representation
● Set of partitions (“splits”)
● List of dependencies on parent RDDs
○ narrow, e.g. map, filter
○ wide, e.g. groupBy, require shuffle
● Function to compute a partition given parents
● Optional preferred locations
● Optional partitioning information
Hadoop RDD
partitions : one per HDFS block
dependencies : none
compute(partitions) : read corresponding block
preferred locations : HDFS block locations
partitioner : none
Filtered RDD
partitions : same as parent RDD
dependencies : “one-to-one” on parent
compute(partition) : compute parent and filter it
preferred locations(part) : none (ask parent)
partitioner = none
Joined RDD
partitions : one per reduce task
dependencies : shuffle on each parent
compute(partition) : read and join shuffled data
preferred locations(part) : none
partitioner = HashPartitioner(numTasks)
RDDs - Memory not essential
RDDs degrade gracefully when there is not
enough memory to store them, as long as they
are only being used in scan-based operations.
Partitions that do not fit in RAM can be stored
on disk and will provide similar performance to
current data-parallel systems.
RDDs - generic abstraction
● Coarse grained transformation only are a good fit for
many parallel applications
● RDDs efficiently express many programming models -
Map Reduce, SQL, Graph, MLLib
● many parallel programs naturally apply the same
operation to many records, making them easy to
express
● immutability of RDDs is not an obstacle because one
can create multiple RDDs to represent versions of the
same dataset
RDDs - persistence and partitioning
● Users can control two other aspects of RDDs:
persistence and partitioning. Users can indicate which
RDDs they will reuse and choose a storage strategy for
them (e.g., in-memory storage). They can also ask that
an RDD’s elements be partitioned across machines
based on a key in each record. This is useful for
placement optimizations, such as ensuring that two
datasets that will be joined together are hash-partitioned
in the same way
● inspects RDD’s lineage graph to build a DAG of stages
to execute. Each stage contains as many pipelined
transformations with narrow dependencies as possible.
● The boundaries of the stages are the shuffle operations
required for wide dependencies, or any already
computed partitions that can shortcircuit the
computation of a parent RDD. The scheduler then
launches tasks to compute missing partitions from each
stage until it has computed the target RDD.
● Cached RDDs not recomputed.
Scheduler
Dont reinvent the wheel
● reuse Hadoop APIs - InputOutput formats,
codecs
● Hive QL and data types (Serdes)
● Hive Server
● Spark’s scheduler uses our representation of
RDDs, making it fault tolerant and scalable
● Productivity - Spark Shell
● Compatible with JVM ecosystem. Massive
legacy codebase in big data
● DSL support Newer Spark API’s are
effectively DSL’s
● Concise syntax
● Rapid prototyping, but still type safe
● Thinking functionally Encourages
immutability and good practices
Written in Scala
● Smart team
● Dont bite more than what you can chew
(Spark Core, ML + SQL, Streaming next)
● Open
● Community
● Process driven - build automation, test
coverage, api compatibility checks
Other secrets
Don’t use Spark when you need -
● asynchronous finegrained updates to shared
state, such as a storage system for a web
application or an incremental web crawler.
For these applications, it is more efficient to
use systems that perform traditional update
logging and data checkpointing, such as
databases
More to come - Project Tungsten
● Project Tungsten (overcome JVM limitations)
○ Memory management and binary processing leveraging application
semantics to manage memory explicitly and eliminate the overhead of
JVM object model and garbage collection
○ Cache-aware computation: algorithms and data structures to exploit
memory hierarchy
○ Code generation: using code generation to exploit modern compilers
and CPUs
● Data Frames
○ write less code
○ read less data (predicate push down)
More to come - Data Frames
● write less code
● read less data
○ convert to efficient formats
○ columnar formats
○ use partitioning
○ skip data using statistics
○ predicate pushdown
● let the optimiser (Catalyst) do the hard work

Weitere ähnliche Inhalte

Was ist angesagt?

Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachChoosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachDATAVERSITY
 
MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)Frazer Clement
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Clustrix
 
Lambda usecase
Lambda usecaseLambda usecase
Lambda usecaseDavid Tung
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-trainingconline training
 
Learning from google megastore (Part-1)
Learning from google megastore (Part-1)Learning from google megastore (Part-1)
Learning from google megastore (Part-1)Schubert Zhang
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresqlZaid Shabbir
 
Srimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBASrimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBASRIMANTA MAJI
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Map Reduce
Map ReduceMap Reduce
Map Reduceopenak
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 

Was ist angesagt? (20)

Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot ApproachChoosing the Right Big Data Tools for the Job - A Polyglot Approach
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
 
MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)MySQL Cluster Schema management (2014)
MySQL Cluster Schema management (2014)
 
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack Database Architecture & Scaling Strategies, in the Cloud & on the Rack
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
 
Spark 101
Spark 101Spark 101
Spark 101
 
Lambda usecase
Lambda usecaseLambda usecase
Lambda usecase
 
Cassandra
CassandraCassandra
Cassandra
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-training
 
Rdd transformations
Rdd transformationsRdd transformations
Rdd transformations
 
Learning from google megastore (Part-1)
Learning from google megastore (Part-1)Learning from google megastore (Part-1)
Learning from google megastore (Part-1)
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
Srimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBASrimanta_Maji_Oracle_DBA
Srimanta_Maji_Oracle_DBA
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Apache Cassandra
Apache CassandraApache Cassandra
Apache Cassandra
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 

Ähnlich wie Secrets of Spark's success - Deenar Toraskar, Think Reactive

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersKnoldus Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 

Ähnlich wie Secrets of Spark's success - Deenar Toraskar, Think Reactive (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Spark
SparkSpark
Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All Developers
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 

Mehr von huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 

Mehr von huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Secrets of Spark's success - Deenar Toraskar, Think Reactive

  • 1. Why Spark Secrets of Apache Spark’s Success
  • 2. This Talk ● RDDs ● Spark vs InMemory Data Grids (IMDG) ● Programming model ● Reuse - DRY ● Higher level abstraction ● Scala ● Interactive Shells ● Other ● Future
  • 3. About Me ● Solution Architect/Dev Manager/Developer/Market Risk SME at a tier 1 investment bank ● 20 years of JVM experience ● 2011 - Hadoop + Map Reduce ● 2012 - Hive, then Shark ● 2013 - Spark, Scala, Play and Spray ● 2014 - Spark Streaming, Spark as a compute grid, Spark ML ● 2015 - Independent Apache Spark consultant
  • 4. Map Reduce ● Good ○ High level abstraction (Map and Reduce) ○ Distribution and fault tolerance ● Not so Good ● Lack abstractions for leveraging distributed memory ● Not efficient for iterative algorithms and interactive data mining (SQL)
  • 5. Solution - use shared memory Challenges ● not abstracted for general use ● fault tolerant and resiliency
  • 6. Existing In memory solutions ● Distributed shared memory (Coherence, key value stores, database, etc) ● Allow fine grained updated to mutable state ● Fault tolerance hard to achieve - requires replication, logging and checkpointing ● network bandwidth < memory bandwidth ● substantial storage overheads
  • 7. Spark RDDs - what’s different ● RDD is a read-only, partitioned collection of records ● Interface based on coarse grained transformations (map, filter and join) ● Fault tolerance using lineage rather than actual data ● if a partition is lost, the RDD has enough information to recreate it from other RDDs to recompute the partition without requiring replication ● Immutable RDDs
  • 8. Spark - what’s different from IMDGs . Property RDDs IM Data Grids Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Fault recovery Fine-grained and low overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using backup tasks Difficult Work placement Automatic based on data locality Up to app (runtimes aim for transparency) Behavior if not enough RAM Similar to existing data flow systems Poor performance (swapping?)
  • 9. RDDs - what’s different ● only the lost partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different nodes, without having to roll back the whole program.
  • 10. RDDs - Straggler Migration ● A second benefit of RDDs is that their immutable nature lets a system mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce. Backup tasks would be hard to implement with DSM, as the two copies of a task would access the same memory locations and interfere with each other’s updates..
  • 11. RDD Representation ● Set of partitions (“splits”) ● List of dependencies on parent RDDs ○ narrow, e.g. map, filter ○ wide, e.g. groupBy, require shuffle ● Function to compute a partition given parents ● Optional preferred locations ● Optional partitioning information
  • 12. Hadoop RDD partitions : one per HDFS block dependencies : none compute(partitions) : read corresponding block preferred locations : HDFS block locations partitioner : none
  • 13. Filtered RDD partitions : same as parent RDD dependencies : “one-to-one” on parent compute(partition) : compute parent and filter it preferred locations(part) : none (ask parent) partitioner = none
  • 14. Joined RDD partitions : one per reduce task dependencies : shuffle on each parent compute(partition) : read and join shuffled data preferred locations(part) : none partitioner = HashPartitioner(numTasks)
  • 15. RDDs - Memory not essential RDDs degrade gracefully when there is not enough memory to store them, as long as they are only being used in scan-based operations. Partitions that do not fit in RAM can be stored on disk and will provide similar performance to current data-parallel systems.
  • 16. RDDs - generic abstraction ● Coarse grained transformation only are a good fit for many parallel applications ● RDDs efficiently express many programming models - Map Reduce, SQL, Graph, MLLib ● many parallel programs naturally apply the same operation to many records, making them easy to express ● immutability of RDDs is not an obstacle because one can create multiple RDDs to represent versions of the same dataset
  • 17.
  • 18.
  • 19. RDDs - persistence and partitioning ● Users can control two other aspects of RDDs: persistence and partitioning. Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). They can also ask that an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way
  • 20. ● inspects RDD’s lineage graph to build a DAG of stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible. ● The boundaries of the stages are the shuffle operations required for wide dependencies, or any already computed partitions that can shortcircuit the computation of a parent RDD. The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD. ● Cached RDDs not recomputed. Scheduler
  • 21.
  • 22. Dont reinvent the wheel ● reuse Hadoop APIs - InputOutput formats, codecs ● Hive QL and data types (Serdes) ● Hive Server ● Spark’s scheduler uses our representation of RDDs, making it fault tolerant and scalable ● Productivity - Spark Shell
  • 23. ● Compatible with JVM ecosystem. Massive legacy codebase in big data ● DSL support Newer Spark API’s are effectively DSL’s ● Concise syntax ● Rapid prototyping, but still type safe ● Thinking functionally Encourages immutability and good practices Written in Scala
  • 24. ● Smart team ● Dont bite more than what you can chew (Spark Core, ML + SQL, Streaming next) ● Open ● Community ● Process driven - build automation, test coverage, api compatibility checks Other secrets
  • 25. Don’t use Spark when you need - ● asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler. For these applications, it is more efficient to use systems that perform traditional update logging and data checkpointing, such as databases
  • 26. More to come - Project Tungsten ● Project Tungsten (overcome JVM limitations) ○ Memory management and binary processing leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection ○ Cache-aware computation: algorithms and data structures to exploit memory hierarchy ○ Code generation: using code generation to exploit modern compilers and CPUs ● Data Frames ○ write less code ○ read less data (predicate push down)
  • 27. More to come - Data Frames ● write less code ● read less data ○ convert to efficient formats ○ columnar formats ○ use partitioning ○ skip data using statistics ○ predicate pushdown ● let the optimiser (Catalyst) do the hard work