SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Shark: SQL and Rich
Analytics at Scale	

Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott
Shenker, Ion Stoica	

	

AMPLab, UC Berkeley	

	

June 25 @ SIGMOD 2013
Challenges	

Data size growing	

» Processing has to scale out over large
clusters	

» Faults and stragglers complicate DB design	

	

Complexity of analysis increasing	

» Massive ETL (web crawling)	

» Machine learning, graph processing	

» Leads to long running jobs
The Rise of MapReduce
What’s good about
MapReduce?	

1.  Scales out to thousands of nodes in a fault-
tolerant manner	

2.  Good for analyzing semi-structured data and
complex analytics	

3.  Elasticity (cloud computing)	

4.  Dynamic, multi-tenant resource sharing
“parallel relational database systems are
signiïŹcantly faster than those that rely on the
use of MapReduce for their query engines”	

“I totally agree.”
This Research	

1.  Shows MapReduce model can be extended to
support SQL efïŹciently	

»  Started from a powerful MR-like engine (Spark)	

»  Extended the engine in various ways	

2.  The artifact: Shark, a fast engine on top of MR	

»  Performant SQL	

»  Complex analytics in the same engine	

»  Maintains MR beneïŹts, e.g. fault-tolerance
MapReduce Fundamental Properties?	

Data-parallel operations	

» Apply the same operations on a deïŹned set of data	

	

Fine-grained, deterministic tasks	

» Enables fault-tolerance  straggler mitigation
Why Were Databases Faster?	

Data representation	

» Schema-aware, column-oriented, etc	

» Co-partition  co-location of data	

Execution strategies	

» Scheduling/task launching overhead (~20s in Hadoop)	

» Cost-based optimization	

» Indexing	

Lack of mid-query fault tolerance	

» MR’s pull model costly compared to DBMS “push”	

See Pavlo 2009, Xin 2013.
Why Were Databases Faster?	

Data representation	

» Schema-aware, column-oriented, etc	

» Co-partition  co-location of data	

Execution strategies	

» Scheduling/task launching overhead (~20s in Hadoop)	

» Cost-based optimization	

» Indexing	

Lack of mid-query fault tolerance	

» MR’s pull model costly compared to DBMS “push”	

See Pavlo 2009, Xin 2013.	

Not fundamental to
“MapReduce”	

Can be
surprisingly
cheap
Introducing Shark	

MapReduce-based architecture	

» Uses Spark as the underlying execution engine	

» Scales out and tolerate worker failures	

Performant	

» Low-latency, interactive queries	

» (Optionally) in-memory query processing	

Expressive and ïŹ‚exible	

» Supports both SQL and complex analytics	

» Hive compatible (storage, UDFs, types, metadata, etc)
Spark Engine	

Fast MapReduce-like engine	

» In-memory storage for fast iterative computations	

» General execution graphs	

» Designed for low latency (~100ms jobs)	

Compatible with Hadoop storage APIs	

» Read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc	

Growing open source platform	

» 17 companies contributing code
More Powerful MR Engine	

General task DAG	

Pipelines functions
within a stage	

Cache-aware data
locality  reuse	

Partitioning-aware
to avoid shufïŹ‚es	

join	
  
union	
  
groupBy	
  
map	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
A:	
   B:	
  
C:	
   D:	
  
E:	
  
F:	
  
G:	
  
=	
  previously	
  computed	
  partition	
  
Client	

 CLI	

 JDBC	

Hive Architecture	

Meta
store	

Hadoop Storage (HDFS, S3, 
)	

Driver	

SQL
Parser	

Query
Optimizer	

Physical Plan	

Execution	

MapReduce
Client	

 CLI	

 JDBC	

Shark Architecture	

Meta
store	

Hadoop Storage (HDFS, S3, 
)	

Driver	

SQL
Parser	

Spark	

Cache Mgr.	

Physical Plan	

Execution	

Query
Optimizer
Extending Spark for SQL	

Columnar memory store	

Dynamic query optimization	

Miscellaneous other optimizations (distributed
top-K, partition statistics  pruning a.k.a. coarse-
grained indexes, co-partitioned joins, 
)
Columnar Memory Store	

Simply caching records as JVM objects is inefïŹcient
(huge overhead in MR’s record-oriented model)	

Shark employs column-oriented storage, a
partition of columns is one MapReduce “record”.	

1	
  
Column	
  Storage	
  
2	
   3	
  
john	
   mike	
   sally	
  
4.1	
   3.5	
   6.4	
  
Row	
  Storage	
  
1	
   john	
   4.1	
  
2	
   mike	
   3.5	
  
3	
   sally	
   6.4	
  
BeneïŹt: compact representation, CPU efïŹcient
compression, cache locality.
How do we optimize:

SELECT * FROM table1 a JOIN table2 b ON a.key=b.key
WHERE my_crazy_udf(b.field1, b.field2) = true;
Hard to estimate cardinality!
Partial DAG Execution (PDE)	

Lack of statistics for fresh data and the prevalent
use of UDFs necessitate dynamic approaches to
query optimization.	

	

PDE allows dynamic alternation of query plans
based on statistics collected at run-time.
ShufïŹ‚e Join	

Stage 3Stage 2
Stage 1
Join
Result
Stage 1
Stage 2
Join
Result
Map Join (Broadcast Join)	

minimizes network trafïŹc
PDE Statistics	

Gather customizable statistics at per-partition
granularities while materializing map output.	

» partition sizes, record counts (skew detection)	

Â»â€Żâ€œheavy hitters”	

» approximate histograms	

Can alter query plan based on such statistics	

» map join vs shufïŹ‚e join	

» symmetric vs non-symmetric hash join	

» skew handling
Complex Analytics Integration	

UniïŹed system for SQL,
machine learning	

	

Both share the same set
of workers and caches	

def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ = 2 * rand.nextDouble - 1)
for (i - 1 to ITERATIONS) {
val gradient = points.map { p =
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd(SELECT * FROM user u
JOIN comment c ON c.uid=u.uid)
val features = users.mapRows { row =
new Vector(extractFeature1(row.getInt(age)),
extractFeature2(row.getStr(country)),
...)}
val trainedVector = logRegress(features.cache())
Pavlo Benchmark	

Selection
0 22.5 45 67.5 90
Shark Shark5(disk) Hive
1.1
0 150 300 450 600
Aggregation
1K5Groups
32
Hive
Shark5(disk)
Shark
Shark5Copartitioned
0 500 1000 1500 2000
Runtime5(seconds)
Machine Learning Performance	

KMeans(Clustering
0 36 72 108 144 180
157
4.1
Logistic(Regression
0 24 48 72 96 120
110
0.96
Shark Hadoop
Runtime per iteration (secs)
Real Warehouse Benchmark	

0
25
50
75
100
Q1 Q2 Q3 Q4
Runtime0(seconds)
Shark Shark0(disk) Hive
1.1 0.8 0.7 1.0
1.7TB Real Warehouse Data on 100 EC2 nodes
New Benchmark	

Impala
Impala(mem)
Redshift
Shark(disk)
Shark(mem)
0 5 10 15 20
Runtime(seconds)
http://tinyurl.com/bigdata-benchmark
Other beneïŹts of MapReduce	

Elasticity	

» Query processing can scale up and down dynamically	

StragglerTolerance	

Schema-on-read  Easier ETL	

Engineering	

» MR handles task scheduling / dispatch / launch	

» Simpler query processing code base (~10k LOC)
Berkeley Data Analytics Stack	

Spark	

Shark	

SQL	

HDFS / Hadoop Storage	

Mesos Resource Manager	

Spark
Streaming	

GraphX	

 MLBase
Community	

3000 people attended
online training	

800 meetup members	

17 companies contributing
Conclusion	

Leveraging a modern MapReduce engine and
techniques from databases, Shark supports both
SQL and complex analytics efïŹciently, while
maintaining fault-tolerance.	

Growing open source community	

» Users observe similar speedups in real use cases	

» http://shark.cs.berkeley.edu	

» http://www.spark-project.org
MapReduce	

 DBMSs	

Shark

Weitere Àhnliche Inhalte

Was ist angesagt?

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2Sreejith P
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 

Was ist angesagt? (20)

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 

Andere mochten auch

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Amir Payberah
 
Hive and Shark
Hive and SharkHive and Shark
Hive and SharkAmir Payberah
 
P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
 

Andere mochten auch (8)

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1
 
Hive and Shark
Hive and SharkHive and Shark
Hive and Shark
 
P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 

Ähnlich wie Shark SQL and Rich Analytics at Scale

Shark
SharkShark
SharkAlex Ivy
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz GarcĂ­a
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.pptNouhaElhaji1
 
Scalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesScalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesGlobus
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸ
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸŰč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸ
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸdatastack
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 

Ähnlich wie Shark SQL and Rich Analytics at Scale (20)

Shark
SharkShark
Shark
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Scalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesScalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized Services
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸ
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸŰč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸ
Űč۔۱ Ú©Ù„Ű§Ù† ŰŻŰ§ŰŻÙ‡ŰŒ چ۱ۧ و Ú†ÚŻÙˆÙ†Ù‡ŰŸ
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 

Mehr von DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

KĂŒrzlich hochgeladen

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

KĂŒrzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Shark SQL and Rich Analytics at Scale

  • 1. Shark: SQL and Rich Analytics at Scale Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica AMPLab, UC Berkeley June 25 @ SIGMOD 2013
  • 2. Challenges Data size growing » Processing has to scale out over large clusters » Faults and stragglers complicate DB design Complexity of analysis increasing » Massive ETL (web crawling) » Machine learning, graph processing » Leads to long running jobs
  • 3. The Rise of MapReduce
  • 4. What’s good about MapReduce? 1.  Scales out to thousands of nodes in a fault- tolerant manner 2.  Good for analyzing semi-structured data and complex analytics 3.  Elasticity (cloud computing) 4.  Dynamic, multi-tenant resource sharing
  • 5.
  • 6. “parallel relational database systems are signiïŹcantly faster than those that rely on the use of MapReduce for their query engines” “I totally agree.”
  • 7.
  • 8. This Research 1.  Shows MapReduce model can be extended to support SQL efïŹciently »  Started from a powerful MR-like engine (Spark) »  Extended the engine in various ways 2.  The artifact: Shark, a fast engine on top of MR »  Performant SQL »  Complex analytics in the same engine »  Maintains MR beneïŹts, e.g. fault-tolerance
  • 9. MapReduce Fundamental Properties? Data-parallel operations » Apply the same operations on a deïŹned set of data Fine-grained, deterministic tasks » Enables fault-tolerance straggler mitigation
  • 10.
  • 11. Why Were Databases Faster? Data representation » Schema-aware, column-oriented, etc » Co-partition co-location of data Execution strategies » Scheduling/task launching overhead (~20s in Hadoop) » Cost-based optimization » Indexing Lack of mid-query fault tolerance » MR’s pull model costly compared to DBMS “push” See Pavlo 2009, Xin 2013.
  • 12. Why Were Databases Faster? Data representation » Schema-aware, column-oriented, etc » Co-partition co-location of data Execution strategies » Scheduling/task launching overhead (~20s in Hadoop) » Cost-based optimization » Indexing Lack of mid-query fault tolerance » MR’s pull model costly compared to DBMS “push” See Pavlo 2009, Xin 2013. Not fundamental to “MapReduce” Can be surprisingly cheap
  • 13. Introducing Shark MapReduce-based architecture » Uses Spark as the underlying execution engine » Scales out and tolerate worker failures Performant » Low-latency, interactive queries » (Optionally) in-memory query processing Expressive and ïŹ‚exible » Supports both SQL and complex analytics » Hive compatible (storage, UDFs, types, metadata, etc)
  • 14. Spark Engine Fast MapReduce-like engine » In-memory storage for fast iterative computations » General execution graphs » Designed for low latency (~100ms jobs) Compatible with Hadoop storage APIs » Read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc Growing open source platform » 17 companies contributing code
  • 15. More Powerful MR Engine General task DAG Pipelines functions within a stage Cache-aware data locality reuse Partitioning-aware to avoid shufïŹ‚es join   union   groupBy   map   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   G:   =  previously  computed  partition  
  • 16. Client CLI JDBC Hive Architecture Meta store Hadoop Storage (HDFS, S3, 
) Driver SQL Parser Query Optimizer Physical Plan Execution MapReduce
  • 17. Client CLI JDBC Shark Architecture Meta store Hadoop Storage (HDFS, S3, 
) Driver SQL Parser Spark Cache Mgr. Physical Plan Execution Query Optimizer
  • 18. Extending Spark for SQL Columnar memory store Dynamic query optimization Miscellaneous other optimizations (distributed top-K, partition statistics pruning a.k.a. coarse- grained indexes, co-partitioned joins, 
)
  • 19. Columnar Memory Store Simply caching records as JVM objects is inefïŹcient (huge overhead in MR’s record-oriented model) Shark employs column-oriented storage, a partition of columns is one MapReduce “record”. 1   Column  Storage   2   3   john   mike   sally   4.1   3.5   6.4   Row  Storage   1   john   4.1   2   mike   3.5   3   sally   6.4   BeneïŹt: compact representation, CPU efïŹcient compression, cache locality.
  • 20.
  • 21. How do we optimize: SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_crazy_udf(b.field1, b.field2) = true; Hard to estimate cardinality!
  • 22. Partial DAG Execution (PDE) Lack of statistics for fresh data and the prevalent use of UDFs necessitate dynamic approaches to query optimization. PDE allows dynamic alternation of query plans based on statistics collected at run-time.
  • 23. ShufïŹ‚e Join Stage 3Stage 2 Stage 1 Join Result Stage 1 Stage 2 Join Result Map Join (Broadcast Join) minimizes network trafïŹc
  • 24. PDE Statistics Gather customizable statistics at per-partition granularities while materializing map output. » partition sizes, record counts (skew detection) Â»â€Żâ€œheavy hitters” » approximate histograms Can alter query plan based on such statistics » map join vs shufïŹ‚e join » symmetric vs non-symmetric hash join » skew handling
  • 25. Complex Analytics Integration UniïŹed system for SQL, machine learning Both share the same set of workers and caches def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ = 2 * rand.nextDouble - 1) for (i - 1 to ITERATIONS) { val gradient = points.map { p = val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd(SELECT * FROM user u JOIN comment c ON c.uid=u.uid) val features = users.mapRows { row = new Vector(extractFeature1(row.getInt(age)), extractFeature2(row.getStr(country)), ...)} val trainedVector = logRegress(features.cache())
  • 26. Pavlo Benchmark Selection 0 22.5 45 67.5 90 Shark Shark5(disk) Hive 1.1 0 150 300 450 600 Aggregation 1K5Groups 32 Hive Shark5(disk) Shark Shark5Copartitioned 0 500 1000 1500 2000 Runtime5(seconds)
  • 27. Machine Learning Performance KMeans(Clustering 0 36 72 108 144 180 157 4.1 Logistic(Regression 0 24 48 72 96 120 110 0.96 Shark Hadoop Runtime per iteration (secs)
  • 28. Real Warehouse Benchmark 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime0(seconds) Shark Shark0(disk) Hive 1.1 0.8 0.7 1.0 1.7TB Real Warehouse Data on 100 EC2 nodes
  • 29. New Benchmark Impala Impala(mem) Redshift Shark(disk) Shark(mem) 0 5 10 15 20 Runtime(seconds) http://tinyurl.com/bigdata-benchmark
  • 30. Other beneïŹts of MapReduce Elasticity » Query processing can scale up and down dynamically StragglerTolerance Schema-on-read Easier ETL Engineering » MR handles task scheduling / dispatch / launch » Simpler query processing code base (~10k LOC)
  • 31. Berkeley Data Analytics Stack Spark Shark SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
  • 32. Community 3000 people attended online training 800 meetup members 17 companies contributing
  • 33. Conclusion Leveraging a modern MapReduce engine and techniques from databases, Shark supports both SQL and complex analytics efïŹciently, while maintaining fault-tolerance. Growing open source community » Users observe similar speedups in real use cases » http://shark.cs.berkeley.edu » http://www.spark-project.org
  • 34.