SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
®
© 2016 MapR Technologies 1
®
© 2016 MapR Technologies
Is Spark Replacing MapReduce? Hadoop?
Keys Botzum, Senior Principal Technologist
March 2016
Last update: March 29, 2016
®
© 2016 MapR Technologies 2
Companies With Spark on MapR In Production
Fortune 500
Global Telecom
Fortune 500
Health Care
Global Financial
Services
®
© 2016 MapR Technologies 3
Cisco: Security Intelligence Operations
Sensor data lands in Hadoop
Streaming for real time
detection and threat alerts
Data next processed on GraphX
and Mahout to build threat
detection models and
accelerated reporting
Additional SQL querying for end
customer reporting and threat
detection
®
© 2016 MapR Technologies 4
Circa 2014 …
®
© 2016 MapR Technologies 5
Next-Gen Genomics
Existing process takes several weeks to
align chemical compounds with genes
ADAM on Spark allows
realignment in a few hours
Geneticists can minimize
engineering dependency
®
© 2016 MapR Technologies 6
Is replacing ?
®
© 2016 MapR Technologies 7
How about Prod. Mgr’s favorite tool –checkbox list!
DAG
Persistent Store
Machine Learning
Graph
Streaming
Batch SQL
Interactive SQL
Security
Resource Management
Multitenancy
Others
®
© 2016 MapR Technologies 8
Pluggable data parallel
framework
HDFS and HBase API based
Persistent Store
•  Proven MapReduce, Hive, Pig
•  YARN introduces pluggability
•  Allows for multiple frameworks
•  Standard for scale out big data store
•  Stores data as files and tables
•  Secure
•  Includes resource management
Wait. What’s Hadoop?
®
© 2016 MapR Technologies 9
Spark and MapReduce are …
•  Scalable frameworks for executing custom code on a cluster
•  Nodes in the cluster work independently to process fragments of
data and also combine those fragments together when
appropriate to yield a final result
•  Can tolerate loss of a node during a computation
•  Require a distributed storage layer for common data view
®
© 2016 MapR Technologies 10
What’s MapReduce
•  Map
–  Loading of the data and defining a set of keys
•  Reduce
–  Collects the organized key-based data to process and output
•  Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)
®
© 2016 MapR Technologies 11
MapReduce Processing Model
•  Define mappers
•  Shuffling is automatic
•  Define reducers
•  For complex work, chain jobs together
®
© 2016 MapR Technologies 12
MapReduce: The Good
•  Built in fault tolerance
•  Optimized IO path
•  Scalable
•  Developer focuses on Map/Reduce, not infrastructure
•  simple? API
®
© 2016 MapR Technologies 13
MapReduce: The Bad
•  Batch oriented
•  Optimized for disk IO
–  Doesn’t leverage memory well
–  Iterative algorithms go through disk IO path again and again
•  Primitive API
–  Developer’s have to build on very simple abstraction
–  Key/Value in/out
–  Even basic things like join require extensive code
•  Result often many files that need to be combined appropriately
®
© 2016 MapR Technologies 14
Batch Interactive Streaming
Framework
Pluggable Persistent Store
•  Powerful API
•  Leverages memory aggressively
•  Batch and streaming
•  MapR-FS, HDFS
•  MapR-DB, HBase, Cassandra
•  MapR-Streams, Kafka
•  S3
What’s Spark?
®
© 2016 MapR Technologies 15
Apache Spark
•  spark.apache.org
•  Originally developed in 2009 in
UC Berkeley’s AMP Lab
•  Fully open sourced in 2010 –
now at Apache Software
Foundation
®
© 2016 MapR Technologies 16
Spark: Ease of Use and Performance
•  Easy to Develop
–  Rich APIs in Java, Scala,
Python, R
–  Interactive shell
•  Fast to Run
–  General execution graphs
–  In-memory storage
Less code, simpler code
®
© 2016 MapR Technologies 17
Resilient Distributed Datasets (RDD)
•  Spark revolves around RDDs
•  Fault-tolerant read only collection of elements that can be
operated on in parallel
•  Cached in memory or on disk
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Newer API based around DataFrames but for this presentation difference isn’t important
®
© 2016 MapR Technologies 18
RDD Operations - Expressive
•  Transformations
–  Creation of a new RDD dataset from an existing
•  map, filter, distinct, union, sample, groupByKey, join, reduce, etc…
•  Actions
–  Return a value after running a computation
•  collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
®
© 2016 MapR Technologies 19
•  Spark Scala
Easy: Example – Word Count
•  Spark Java•  Hadoop MapReduce Java
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}}
JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String,
Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer,
Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("hdfs://...");
Source: http://spark.apache.org/examples.html#
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
®
© 2016 MapR Technologies 20
Faster for Iterative: PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iterationtime(s)
Number of machines
Hadoop
Spark
®
© 2016 MapR Technologies 21
Spark vs. MapReduce
•  Spark is faster than MR for iterative algorithms that fit data in
memory
•  Spark code is easier to write and easier to understand than MR
–  Your programming is closer to the correct abstraction
•  Spark supports batch and streaming model
•  Advantage Spark
–  Caution: not all applications run faster on Spark and Spark may have
limitations for some scenarios
®
© 2016 MapR Technologies 22
Is replacing ?
Is replacing MapReduce?
Quite possibly….with time...with caveats
®
© 2016 MapR Technologies 23
Unified Easy Batch Interactive
Streaming Framework
Pluggable Persistent Store
Pluggable data parallel
framework
HDFS and HBase Persistent
Store
Hadoop is more than MapReduce
Needs a resource manager Includes a resource manager (YARN)
®
© 2016 MapR Technologies 24
Hadoop Supports so Much
•  Alternative batch models: Pig, Cascading, Spark
•  Machine learning: Mahout, SparkML
•  SQL: Hive, Drill, Hive on Tez, Impala, SparkSQL
•  Stream processing: Storm, Flink, Spark, DataTorrent
•  ETL: Sqoop, Flume
•  Storage: file (HDFS/MapR-FS), table (HBase/MapR-DB/Accumulo),
messaging (Kafka/MapR-Streams)
•  Data exploration: Hue
•  And too many excellent commercial tools to list
•  Hypothesis:
–  Infrastructure and data tend to be sticky while execution frameworks evolve rapidly
–  Hadoop’s infrastructure and storage supports a vigorous and growing ecosystem of
“competing” execution engines
®
© 2016 MapR Technologies 25
Perspective
Unified Easy Batch Interactive
Streaming Framework
Pluggable data parallel framework
HDFS and HBase Persistent Store
Interactive SQL
(Drill, Impala,
Hive.next)
Streaming
(Flink, Storm
DataTorrent)
RDBMS
(e.g
SpliceMachine)
Ecosystem
SLA (YARN resource reservation, distro mgmt tools, Pepperdata, …)
Security (Drill Views, Ranger, Sentry, BlueTalon…)
Data Wrangling, discovery and governance (Trifacta, Paxata, Waterline…)
®
© 2016 MapR Technologies 26
Unified Easy Batch Interactive Streaming Framework
Pluggable Persistent Store
Perspective
Ecosystem/Environments
Resource Management – YARN, Mesos, Kubernetes
Deployment – Private OpenStack, Public Cloud, Hybrid
NoSQL/Search
(Cassandra, ES)
In Mem
(SAP
Hana,MemSQL)
RDBMS
(mySQL, Oracle,
etc)
Hadoop
(Hbase, HDFS)
®
© 2016 MapR Technologies 27
Which is More Realistic?
What about
classic
applications and
data sharing?
Spark becomes primary execution framework
Hadoop remains primary storage and execution framework
®
© 2016 MapR Technologies 28
Is replacing ?
Is replacing MapReduce?
Quite possibly….with time...with caveats
Seems improbable
Hadoop grows to embrace new execution frameworks
®
© 2016 MapR Technologies 29
®
© 2016 MapR Technologies 30
MapR Platform Services: Open API Architecture
Assures Interoperability, Avoids Lock-in
HDFS
API
POSIX
NFS
SQL,
Hbase
API
JSON
API
Kafka
API
®
© 2016 MapR Technologies 31
Q&A
maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
®
© 2016 MapR Technologies 32
References
•  Spark vs. MapReduce:
–  https://www.mapr.com/blog/apache-spark-vs-mapreduce-whiteboard-
walkthrough
–  http://www.vldb.org/pvldb/vol8/p2110-shi.pdf
–  http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/
•  Spark: http://spark.apache.org/
•  Spark on MapR:
http://maprdocs.mapr.com/51/index.html#Spark/
Spark_26984599.html

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

Was ist angesagt? (20)

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 

Ähnlich wie Is Spark Replacing Hadoop

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 

Ähnlich wie Is Spark Replacing Hadoop (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Is Spark Replacing Hadoop

  • 1. ® © 2016 MapR Technologies 1 ® © 2016 MapR Technologies Is Spark Replacing MapReduce? Hadoop? Keys Botzum, Senior Principal Technologist March 2016 Last update: March 29, 2016
  • 2. ® © 2016 MapR Technologies 2 Companies With Spark on MapR In Production Fortune 500 Global Telecom Fortune 500 Health Care Global Financial Services
  • 3. ® © 2016 MapR Technologies 3 Cisco: Security Intelligence Operations Sensor data lands in Hadoop Streaming for real time detection and threat alerts Data next processed on GraphX and Mahout to build threat detection models and accelerated reporting Additional SQL querying for end customer reporting and threat detection
  • 4. ® © 2016 MapR Technologies 4 Circa 2014 …
  • 5. ® © 2016 MapR Technologies 5 Next-Gen Genomics Existing process takes several weeks to align chemical compounds with genes ADAM on Spark allows realignment in a few hours Geneticists can minimize engineering dependency
  • 6. ® © 2016 MapR Technologies 6 Is replacing ?
  • 7. ® © 2016 MapR Technologies 7 How about Prod. Mgr’s favorite tool –checkbox list! DAG Persistent Store Machine Learning Graph Streaming Batch SQL Interactive SQL Security Resource Management Multitenancy Others
  • 8. ® © 2016 MapR Technologies 8 Pluggable data parallel framework HDFS and HBase API based Persistent Store •  Proven MapReduce, Hive, Pig •  YARN introduces pluggability •  Allows for multiple frameworks •  Standard for scale out big data store •  Stores data as files and tables •  Secure •  Includes resource management Wait. What’s Hadoop?
  • 9. ® © 2016 MapR Technologies 9 Spark and MapReduce are … •  Scalable frameworks for executing custom code on a cluster •  Nodes in the cluster work independently to process fragments of data and also combine those fragments together when appropriate to yield a final result •  Can tolerate loss of a node during a computation •  Require a distributed storage layer for common data view
  • 10. ® © 2016 MapR Technologies 10 What’s MapReduce •  Map –  Loading of the data and defining a set of keys •  Reduce –  Collects the organized key-based data to process and output •  Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 11. ® © 2016 MapR Technologies 11 MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together
  • 12. ® © 2016 MapR Technologies 12 MapReduce: The Good •  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not infrastructure •  simple? API
  • 13. ® © 2016 MapR Technologies 13 MapReduce: The Bad •  Batch oriented •  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again and again •  Primitive API –  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code •  Result often many files that need to be combined appropriately
  • 14. ® © 2016 MapR Technologies 14 Batch Interactive Streaming Framework Pluggable Persistent Store •  Powerful API •  Leverages memory aggressively •  Batch and streaming •  MapR-FS, HDFS •  MapR-DB, HBase, Cassandra •  MapR-Streams, Kafka •  S3 What’s Spark?
  • 15. ® © 2016 MapR Technologies 15 Apache Spark •  spark.apache.org •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  Fully open sourced in 2010 – now at Apache Software Foundation
  • 16. ® © 2016 MapR Technologies 16 Spark: Ease of Use and Performance •  Easy to Develop –  Rich APIs in Java, Scala, Python, R –  Interactive shell •  Fast to Run –  General execution graphs –  In-memory storage Less code, simpler code
  • 17. ® © 2016 MapR Technologies 17 Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Newer API based around DataFrames but for this presentation difference isn’t important
  • 18. ® © 2016 MapR Technologies 18 RDD Operations - Expressive •  Transformations –  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join, reduce, etc… •  Actions –  Return a value after running a computation •  collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list
  • 19. ® © 2016 MapR Technologies 19 •  Spark Scala Easy: Example – Word Count •  Spark Java•  Hadoop MapReduce Java public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }} JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://..."); Source: http://spark.apache.org/examples.html# val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 20. ® © 2016 MapR Technologies 20 Faster for Iterative: PageRank Performance 171 80 23 14 0 50 100 150 200 30 60 Iterationtime(s) Number of machines Hadoop Spark
  • 21. ® © 2016 MapR Technologies 21 Spark vs. MapReduce •  Spark is faster than MR for iterative algorithms that fit data in memory •  Spark code is easier to write and easier to understand than MR –  Your programming is closer to the correct abstraction •  Spark supports batch and streaming model •  Advantage Spark –  Caution: not all applications run faster on Spark and Spark may have limitations for some scenarios
  • 22. ® © 2016 MapR Technologies 22 Is replacing ? Is replacing MapReduce? Quite possibly….with time...with caveats
  • 23. ® © 2016 MapR Technologies 23 Unified Easy Batch Interactive Streaming Framework Pluggable Persistent Store Pluggable data parallel framework HDFS and HBase Persistent Store Hadoop is more than MapReduce Needs a resource manager Includes a resource manager (YARN)
  • 24. ® © 2016 MapR Technologies 24 Hadoop Supports so Much •  Alternative batch models: Pig, Cascading, Spark •  Machine learning: Mahout, SparkML •  SQL: Hive, Drill, Hive on Tez, Impala, SparkSQL •  Stream processing: Storm, Flink, Spark, DataTorrent •  ETL: Sqoop, Flume •  Storage: file (HDFS/MapR-FS), table (HBase/MapR-DB/Accumulo), messaging (Kafka/MapR-Streams) •  Data exploration: Hue •  And too many excellent commercial tools to list •  Hypothesis: –  Infrastructure and data tend to be sticky while execution frameworks evolve rapidly –  Hadoop’s infrastructure and storage supports a vigorous and growing ecosystem of “competing” execution engines
  • 25. ® © 2016 MapR Technologies 25 Perspective Unified Easy Batch Interactive Streaming Framework Pluggable data parallel framework HDFS and HBase Persistent Store Interactive SQL (Drill, Impala, Hive.next) Streaming (Flink, Storm DataTorrent) RDBMS (e.g SpliceMachine) Ecosystem SLA (YARN resource reservation, distro mgmt tools, Pepperdata, …) Security (Drill Views, Ranger, Sentry, BlueTalon…) Data Wrangling, discovery and governance (Trifacta, Paxata, Waterline…)
  • 26. ® © 2016 MapR Technologies 26 Unified Easy Batch Interactive Streaming Framework Pluggable Persistent Store Perspective Ecosystem/Environments Resource Management – YARN, Mesos, Kubernetes Deployment – Private OpenStack, Public Cloud, Hybrid NoSQL/Search (Cassandra, ES) In Mem (SAP Hana,MemSQL) RDBMS (mySQL, Oracle, etc) Hadoop (Hbase, HDFS)
  • 27. ® © 2016 MapR Technologies 27 Which is More Realistic? What about classic applications and data sharing? Spark becomes primary execution framework Hadoop remains primary storage and execution framework
  • 28. ® © 2016 MapR Technologies 28 Is replacing ? Is replacing MapReduce? Quite possibly….with time...with caveats Seems improbable Hadoop grows to embrace new execution frameworks
  • 29. ® © 2016 MapR Technologies 29
  • 30. ® © 2016 MapR Technologies 30 MapR Platform Services: Open API Architecture Assures Interoperability, Avoids Lock-in HDFS API POSIX NFS SQL, Hbase API JSON API Kafka API
  • 31. ® © 2016 MapR Technologies 31 Q&A maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies
  • 32. ® © 2016 MapR Technologies 32 References •  Spark vs. MapReduce: –  https://www.mapr.com/blog/apache-spark-vs-mapreduce-whiteboard- walkthrough –  http://www.vldb.org/pvldb/vol8/p2110-shi.pdf –  http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ •  Spark: http://spark.apache.org/ •  Spark on MapR: http://maprdocs.mapr.com/51/index.html#Spark/ Spark_26984599.html