SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
A look under the hood at Apache
Spark's API and engine evolutions
Reynold Xin @rxin
2017-02-08, Amsterdam Meetup
About Databricks
Founded by creators of Spark
Cloud data platform
- Spark
- Interactive analysis
- Cluster management
- Production pipelines
- Data governance & security
Databricks Amsterdam R&D Center
Started in January
Hiring distributed systems &
database engineers!
Email me: rxin@databricks.com
SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram
Frontend
(user facing APIs)
Backend
(execution)
Spark stack diagram
(a different take)
Frontend
(RDD, DataFrame, ML pipelines, …)
Backend
(scheduler, shuffle, operators, …)
Spark stack diagram
(a different take)
Today’s Talk
Some archaeology
- IMS, relational databases
- MapReduce
- data frames
Last 6 years of Spark evolution
Databases
IBM IMS hierarchical database (1966)
Image from https://stratechery.com/2016/oracles-cloudy-future/
Hierarchical Database
- Improvement over file system: query language & catalog
- Lack of flexibility
- Difficult to query items in different parts of the hierarchy
- Relationships are pre-determined and difficult to change
“Future users of large data banks must be protected from having to
know how the data is organized in the machine. …
most application programs should remain unaffected when the
internal representation of data is changed and even when some
aspects of the external representation are changed.”
Era of relational databases (late 60s)
Two “new” important ideas
Physical Data Independence: The ability to change the physical data
layout without having to change the logical schema.
Declarative Query Language: Programmer specifies “what” rather than
“how”.
Why?
Business applications outlive the environments they were created in:
- New requirements might surface
- Underlying hardware might change
- Require physical layout changes (indexing, different storage medium, etc)
Enabled tremendous amount of innovation:
- Indexes, compression, column stores, etc
Relational Database Pros vs Cons
- Declarative and data independent
- SQL is the universal interface everybody knows
- Difficult to compose & build complex applications
- Too opinionated and inflexible
- Require data modeling before putting any data in
- SQL is the only programming language
Big Data, MapReduce,
Hadoop
Challenges Google faced
Data size growing (volume & velocity)
- Processing has to scale out over large clusters
Complexity of analysis increasing (variety)
- Massive ETL (web crawling)
- Machine learning, graph processing
The Big Data Problem
Semi-/Un-structured data doesn’t fit well with databases
Single machine can no longer process or even store all the data!
Only solution is to distribute general storage & processing over
clusters.
Google Datacenter
How do we program this thing?
19
Data-Parallel Models
Restrict the programming interface so that the system can do more
automatically
“Here’s an operation, run it on all of the data”
- I don’t care where it runs (you schedule that)
- In fact, feel free to run it twice on different nodes
- Siimlar to “declarative programming” in databases
MapReduce Pros vs Cons
- Massively parallel
- Very flexible programming model & schema-on-read
- Extremely verbose & difficult to learn
- Most real applications require multiple MR steps
- 21 MR steps -> 21 mapper and reducer classes
- Lots of boilerplate code per step
- Bad performance
R, Python, data frame
Data frames in R / Python
> head(filter(df, df$waiting < 50)) # an example in R
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Developed by stats community & concise syntax for ad-hoc analysis
Procedural (not declarative)
R data frames Pros and Cons
- Easy to learn
- Pretty fast on a laptop (or one server)
- No parallelism & doesn’t work well on big data
- Lack sophisticated query optimization
“Are you going to talk
about Spark at all
tonight!?”
Which one is better?
Databases, R, MapReduce?
Declarative, procedural, data independence?
Spark’s initial focus: a better MapReduce
Language-integrated API (RDD): similar to Scala’s collection library
using functional programming; incredibly powerful and composable
lines = spark.textFile(“hdfs://...”) // RDD[String]
points = lines.map(line => parsePoint(line)) // RDD[Point]
points.filter(p => p.x > 100).count()
Better performance: through a more general DAG abstraction, faster
scheduling, and in-memory caching
Programmability
WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark
Challenge with Functional API
Looks high-level, but hides many semantics of computation
• Functions are arbitrary blocks of Java bytecode
• Data stored is arbitrary Java objects
Users can mix APIs in suboptimal ways
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
groupByKey
Which Operator Causes Most Tickets?
Example Problem
pairs = data.map(word => (word, 1))
groups = pairs.groupByKey()
groups.map((k, vs) => (k, vs.sum))
Physical API:
Materializes all groups
as Seq[Int] objects
Then promptly
aggregates them
Challenge: Data Representation
Java objects often many times larger than underlying fields
class User(name: String, friends: Array[Int])
new User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5
Recap: two primary issues
1. Many APIs specifies the “physical” behavior, rather than the
“logical” intent, aka not declarative enough.
2. Closures (user-defined functions and types) are opaque to the
engine, an as a result little room for improvement.
Sort Benchmark
Originally sponsored by Jim Gray to measure advancements in
software and hardware in 1987
Participants often used purpose-built hardware/software to compete
• Large companies: IBM, Microsoft, Yahoo, …
• Academia: UC Berkeley, UCSD, MIT, …
Sort Benchmark
• Past winners: Microsoft, Yahoo, Samsung, UCSD, …
1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
Winning Attempt
Built on low-level Spark API and:
- Put all data in off-heap memory using sun.misc.Unsafe
- Use tight low level while loops rather than iterators
~ 3000 lines of low level code on Spark written by Reynold
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
How do we enable the
average users to win a
world record, using a few
lines of code?
Goals of last two year’s API evolution
1. Simpler APIs bridging the gap between big data engineering and
data science.
2. Higher level, declarative APIs that are future proof (engine can
optimize programs automatically).
Taking the best ideas from databases, big data, and data science
Structured APIs:
DataFrames + Spark SQL
DataFrames and Spark SQL
Efficient library for structured data (data with a known schema)
• Two interfaces: SQL for analysts + apps, DataFrames for programmers
Optimized computation and storage, similar to RDBMS
SIGMOD 2015
Execution Steps
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
Code
Generator
Data
Frames
DataFrame API
DataFrames hold rows with a known schema and offer relational
operations on them through a DSL
val users = spark.sql(“select * from users”)
val massUsers = users(users(“country”) === “NL”)
massUsers.count()
massUsers.groupBy(“name”).avg(“age”)
Expression AST
Spark RDD Execution
Java/Scala
frontend
JVM
backend
Python
frontend
Python
backend
opaque closures
(user-defined functions)
Spark DataFrame Execution
DataFrame
frontend
Logical Plan
Physical
execution
Catalyst
optimizer
Intermediate representation for computation
Spark DataFrame Execution
Python
DF
Logical Plan
Physical
execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representation for computation
Simple wrappers to create logical plan
Structured API Example
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
SCAN logs SCAN users
JOIN
AGG
FILTER
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...
Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)
Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
What are Spark’s structured APIs?
Combination of:
- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- flexibility & parallelism from MapReduce -- massively scalable & flexible
Future possibilities
Spark as a fast, multi-core data collection library
Spark as a performant streaming engine
Spark as a GPU computation framework
All using the same API
Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM SIMD GPUs
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…
Recap
We learn from previous generation systems to understand what works,
and what can be improved on, and evolve Spark
Latest APIs take the best ideas out of earlier systems
- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- flexibility & parallelism from MapReduce -- massively scalable & flexible
Dank je wel
@rxin

Weitere ähnliche Inhalte

Was ist angesagt?

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 

Was ist angesagt? (20)

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Ähnlich wie A look under the hood at Apache Spark's API and engine evolutions

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 

Ähnlich wie A look under the hood at Apache Spark's API and engine evolutions (20)

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 

Mehr von Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 

Kürzlich hochgeladen (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

A look under the hood at Apache Spark's API and engine evolutions

  • 1. A look under the hood at Apache Spark's API and engine evolutions Reynold Xin @rxin 2017-02-08, Amsterdam Meetup
  • 2. About Databricks Founded by creators of Spark Cloud data platform - Spark - Interactive analysis - Cluster management - Production pipelines - Data governance & security
  • 3. Databricks Amsterdam R&D Center Started in January Hiring distributed systems & database engineers! Email me: rxin@databricks.com
  • 4. SQL Streaming MLlib Spark Core (RDD) GraphX Spark stack diagram
  • 5. Frontend (user facing APIs) Backend (execution) Spark stack diagram (a different take)
  • 6. Frontend (RDD, DataFrame, ML pipelines, …) Backend (scheduler, shuffle, operators, …) Spark stack diagram (a different take)
  • 7. Today’s Talk Some archaeology - IMS, relational databases - MapReduce - data frames Last 6 years of Spark evolution
  • 9. IBM IMS hierarchical database (1966) Image from https://stratechery.com/2016/oracles-cloudy-future/
  • 10. Hierarchical Database - Improvement over file system: query language & catalog - Lack of flexibility - Difficult to query items in different parts of the hierarchy - Relationships are pre-determined and difficult to change
  • 11.
  • 12. “Future users of large data banks must be protected from having to know how the data is organized in the machine. … most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
  • 13. Era of relational databases (late 60s) Two “new” important ideas Physical Data Independence: The ability to change the physical data layout without having to change the logical schema. Declarative Query Language: Programmer specifies “what” rather than “how”.
  • 14. Why? Business applications outlive the environments they were created in: - New requirements might surface - Underlying hardware might change - Require physical layout changes (indexing, different storage medium, etc) Enabled tremendous amount of innovation: - Indexes, compression, column stores, etc
  • 15. Relational Database Pros vs Cons - Declarative and data independent - SQL is the universal interface everybody knows - Difficult to compose & build complex applications - Too opinionated and inflexible - Require data modeling before putting any data in - SQL is the only programming language
  • 17. Challenges Google faced Data size growing (volume & velocity) - Processing has to scale out over large clusters Complexity of analysis increasing (variety) - Massive ETL (web crawling) - Machine learning, graph processing
  • 18. The Big Data Problem Semi-/Un-structured data doesn’t fit well with databases Single machine can no longer process or even store all the data! Only solution is to distribute general storage & processing over clusters.
  • 19. Google Datacenter How do we program this thing? 19
  • 20. Data-Parallel Models Restrict the programming interface so that the system can do more automatically “Here’s an operation, run it on all of the data” - I don’t care where it runs (you schedule that) - In fact, feel free to run it twice on different nodes - Siimlar to “declarative programming” in databases
  • 21.
  • 22. MapReduce Pros vs Cons - Massively parallel - Very flexible programming model & schema-on-read - Extremely verbose & difficult to learn - Most real applications require multiple MR steps - 21 MR steps -> 21 mapper and reducer classes - Lots of boilerplate code per step - Bad performance
  • 24. Data frames in R / Python > head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 Developed by stats community & concise syntax for ad-hoc analysis Procedural (not declarative)
  • 25. R data frames Pros and Cons - Easy to learn - Pretty fast on a laptop (or one server) - No parallelism & doesn’t work well on big data - Lack sophisticated query optimization
  • 26. “Are you going to talk about Spark at all tonight!?”
  • 27. Which one is better? Databases, R, MapReduce? Declarative, procedural, data independence?
  • 28. Spark’s initial focus: a better MapReduce Language-integrated API (RDD): similar to Scala’s collection library using functional programming; incredibly powerful and composable lines = spark.textFile(“hdfs://...”) // RDD[String] points = lines.map(line => parsePoint(line)) // RDD[Point] points.filter(p => p.x > 100).count() Better performance: through a more general DAG abstraction, faster scheduling, and in-memory caching
  • 29. Programmability WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark
  • 30. Challenge with Functional API Looks high-level, but hides many semantics of computation • Functions are arbitrary blocks of Java bytecode • Data stored is arbitrary Java objects Users can mix APIs in suboptimal ways
  • 32. Example Problem pairs = data.map(word => (word, 1)) groups = pairs.groupByKey() groups.map((k, vs) => (k, vs.sum)) Physical API: Materializes all groups as Seq[Int] objects Then promptly aggregates them
  • 33. Challenge: Data Representation Java objects often many times larger than underlying fields class User(name: String, friends: Array[Int]) new User(“Bobby”, Array(1, 2)) User 0x… 0x… String 3 0 1 2 Bobby 5 0x… int[] char[] 5
  • 34. Recap: two primary issues 1. Many APIs specifies the “physical” behavior, rather than the “logical” intent, aka not declarative enough. 2. Closures (user-defined functions and types) are opaque to the engine, an as a result little room for improvement.
  • 35. Sort Benchmark Originally sponsored by Jim Gray to measure advancements in software and hardware in 1987 Participants often used purpose-built hardware/software to compete • Large companies: IBM, Microsoft, Yahoo, … • Academia: UC Berkeley, UCSD, MIT, …
  • 36. Sort Benchmark • Past winners: Microsoft, Yahoo, Samsung, UCSD, … 1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
  • 37. Winning Attempt Built on low-level Spark API and: - Put all data in off-heap memory using sun.misc.Unsafe - Use tight low level while loops rather than iterators ~ 3000 lines of low level code on Spark written by Reynold
  • 38. On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 39. How do we enable the average users to win a world record, using a few lines of code?
  • 40. Goals of last two year’s API evolution 1. Simpler APIs bridging the gap between big data engineering and data science. 2. Higher level, declarative APIs that are future proof (engine can optimize programs automatically). Taking the best ideas from databases, big data, and data science
  • 42. DataFrames and Spark SQL Efficient library for structured data (data with a known schema) • Two interfaces: SQL for analysts + apps, DataFrames for programmers Optimized computation and storage, similar to RDBMS SIGMOD 2015
  • 44. DataFrame API DataFrames hold rows with a known schema and offer relational operations on them through a DSL val users = spark.sql(“select * from users”) val massUsers = users(users(“country”) === “NL”) massUsers.count() massUsers.groupBy(“name”).avg(“age”) Expression AST
  • 46. Spark DataFrame Execution DataFrame frontend Logical Plan Physical execution Catalyst optimizer Intermediate representation for computation
  • 47. Spark DataFrame Execution Python DF Logical Plan Physical execution Catalyst optimizer Java/Scala DF R DF Intermediate representation for computation Simple wrappers to create logical plan
  • 48. Structured API Example events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code SCAN logs SCAN users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  • 49. Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, …)
  • 50. Performance 0 2 4 6 8 10 Java/Scala Python Runtime for an example aggregation workload RDD
  • 51. Benefit of Logical Plan: Performance Parity Across Languages 0 2 4 6 8 10 Java/Scala Python Java/Scala Python R SQL Runtime for an example aggregation workload (secs) DataFrame RDD
  • 52. What are Spark’s structured APIs? Combination of: - data frame from R as the “interface” – easy to learn - declarativity & data independence from databases -- easy to optimize & future-proof - flexibility & parallelism from MapReduce -- massively scalable & flexible
  • 53. Future possibilities Spark as a fast, multi-core data collection library Spark as a performant streaming engine Spark as a GPU computation framework All using the same API
  • 54. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM SIMD GPUs Unified API, One Engine, Automatically Optimized Tungsten backend language frontend …
  • 55. Recap We learn from previous generation systems to understand what works, and what can be improved on, and evolve Spark Latest APIs take the best ideas out of earlier systems - data frame from R as the “interface” – easy to learn - declarativity & data independence from databases -- easy to optimize & future-proof - flexibility & parallelism from MapReduce -- massively scalable & flexible