"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Â
Shark SQL and Rich Analytics at Scale
1. Shark: SQL and Rich
Analytics at Scale
Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott
Shenker, Ion Stoica
AMPLab, UC Berkeley
June 25 @ SIGMOD 2013
2. Challenges
Data size growing
»âŻProcessing has to scale out over large
clusters
»âŻFaults and stragglers complicate DB design
Complexity of analysis increasing
»âŻMassive ETL (web crawling)
»âŻMachine learning, graph processing
»âŻLeads to long running jobs
4. Whatâs good about
MapReduce?
1.⯠Scales out to thousands of nodes in a fault-
tolerant manner
2.⯠Good for analyzing semi-structured data and
complex analytics
3.⯠Elasticity (cloud computing)
4.⯠Dynamic, multi-tenant resource sharing
5.
6. âparallel relational database systems are
signiïŹcantly faster than those that rely on the
use of MapReduce for their query enginesâ
âI totally agree.â
7.
8. This Research
1.⯠Shows MapReduce model can be extended to
support SQL efïŹciently
»⯠Started from a powerful MR-like engine (Spark)
»⯠Extended the engine in various ways
2.⯠The artifact: Shark, a fast engine on top of MR
»⯠Performant SQL
»⯠Complex analytics in the same engine
»⯠Maintains MR beneïŹts, e.g. fault-tolerance
9. MapReduce Fundamental Properties?
Data-parallel operations
»âŻApply the same operations on a deïŹned set of data
Fine-grained, deterministic tasks
»âŻEnables fault-tolerance straggler mitigation
10.
11. Why Were Databases Faster?
Data representation
»âŻSchema-aware, column-oriented, etc
»âŻCo-partition co-location of data
Execution strategies
»âŻScheduling/task launching overhead (~20s in Hadoop)
»âŻCost-based optimization
»âŻIndexing
Lack of mid-query fault tolerance
»âŻMRâs pull model costly compared to DBMS âpushâ
See Pavlo 2009, Xin 2013.
12. Why Were Databases Faster?
Data representation
»âŻSchema-aware, column-oriented, etc
»âŻCo-partition co-location of data
Execution strategies
»âŻScheduling/task launching overhead (~20s in Hadoop)
»âŻCost-based optimization
»âŻIndexing
Lack of mid-query fault tolerance
»âŻMRâs pull model costly compared to DBMS âpushâ
See Pavlo 2009, Xin 2013.
Not fundamental to
âMapReduceâ
Can be
surprisingly
cheap
13. Introducing Shark
MapReduce-based architecture
»âŻUses Spark as the underlying execution engine
»âŻScales out and tolerate worker failures
Performant
»âŻLow-latency, interactive queries
»âŻ(Optionally) in-memory query processing
Expressive and ïŹexible
»âŻSupports both SQL and complex analytics
»âŻHive compatible (storage, UDFs, types, metadata, etc)
14. Spark Engine
Fast MapReduce-like engine
»âŻIn-memory storage for fast iterative computations
»âŻGeneral execution graphs
»âŻDesigned for low latency (~100ms jobs)
Compatible with Hadoop storage APIs
»âŻRead/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
Growing open source platform
»âŻ17 companies contributing code
15. More Powerful MR Engine
General task DAG
Pipelines functions
within a stage
Cache-aware data
locality reuse
Partitioning-aware
to avoid shufïŹes
join
Â
union
Â
groupBy
Â
map
Â
Stage
 3
Â
Stage
 1
Â
Stage
 2
Â
A:
 B:
Â
C:
 D:
Â
E:
Â
F:
Â
G:
Â
=
 previously
 computed
 partition
Â
16. Client
CLI
JDBC
Hive Architecture
Meta
store
Hadoop Storage (HDFS, S3, âŠ)
Driver
SQL
Parser
Query
Optimizer
Physical Plan
Execution
MapReduce
17. Client
CLI
JDBC
Shark Architecture
Meta
store
Hadoop Storage (HDFS, S3, âŠ)
Driver
SQL
Parser
Spark
Cache Mgr.
Physical Plan
Execution
Query
Optimizer
18. Extending Spark for SQL
Columnar memory store
Dynamic query optimization
Miscellaneous other optimizations (distributed
top-K, partition statistics pruning a.k.a. coarse-
grained indexes, co-partitioned joins, âŠ)
19. Columnar Memory Store
Simply caching records as JVM objects is inefïŹcient
(huge overhead in MRâs record-oriented model)
Shark employs column-oriented storage, a
partition of columns is one MapReduce ârecordâ.
1
Â
Column
 Storage
Â
2
 3
Â
john
 mike
 sally
Â
4.1
 3.5
 6.4
Â
Row
 Storage
Â
1
 john
 4.1
Â
2
 mike
 3.5
Â
3
 sally
 6.4
Â
BeneïŹt: compact representation, CPU efïŹcient
compression, cache locality.
20.
21. How do we optimize:
SELECT * FROM table1 a JOIN table2 b ON a.key=b.key
WHERE my_crazy_udf(b.field1, b.field2) = true;
Hard to estimate cardinality!
22. Partial DAG Execution (PDE)
Lack of statistics for fresh data and the prevalent
use of UDFs necessitate dynamic approaches to
query optimization.
PDE allows dynamic alternation of query plans
based on statistics collected at run-time.
24. PDE Statistics
Gather customizable statistics at per-partition
granularities while materializing map output.
»âŻpartition sizes, record counts (skew detection)
»âŻâheavy hittersâ
»âŻapproximate histograms
Can alter query plan based on such statistics
»âŻmap join vs shufïŹe join
»âŻsymmetric vs non-symmetric hash join
»âŻskew handling
25. Complex Analytics Integration
UniïŹed system for SQL,
machine learning
Both share the same set
of workers and caches
def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ = 2 * rand.nextDouble - 1)
for (i - 1 to ITERATIONS) {
val gradient = points.map { p =
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd(SELECT * FROM user u
JOIN comment c ON c.uid=u.uid)
val features = users.mapRows { row =
new Vector(extractFeature1(row.getInt(age)),
extractFeature2(row.getStr(country)),
...)}
val trainedVector = logRegress(features.cache())
30. Other beneïŹts of MapReduce
Elasticity
»âŻQuery processing can scale up and down dynamically
StragglerTolerance
Schema-on-read Easier ETL
Engineering
»âŻMR handles task scheduling / dispatch / launch
»âŻSimpler query processing code base (~10k LOC)
33. Conclusion
Leveraging a modern MapReduce engine and
techniques from databases, Shark supports both
SQL and complex analytics efïŹciently, while
maintaining fault-tolerance.
Growing open source community
»âŻUsers observe similar speedups in real use cases
»âŻhttp://shark.cs.berkeley.edu
»âŻhttp://www.spark-project.org