SlideShare ist ein Scribd-Unternehmen logo
1 von 87
Building a modern Application
w/ DataFrames
Meetup @ [24]7 in Campbell, CA
Sept 8, 2015
Who am I?
Sameer
Farooqui
• Trainer @ Databricks
• 150+ trainings on Hadoop, C*,
HBase, Couchbase, NoSQL, etc
Google: “spark newcircle foundations” / code: SPARK-
MEETUPS-15
Who are you?
1) I have used Spark hands on before…
2) I have used DataFrames before (in any language)…
Agenda
• Be able to smartly use DataFrames tomorrow!
+ Intro + Advanced
Demo!• Spark
Overview
• Catalyst
Internals
• DataFrames (10 mins)
The Databricks team contributed more than 75% of the code added to Spark in the
past year
6
{JSON}
Data Sources
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
RDD API
DataFrames API
7
Goal: unified engine across data sources,
workloads and environments
Spark – 100% open source and mature
Used in production by over 500 organizations. From fortune 100 to small innovators
0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
9
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
10
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk 100 TB sort record
12
On-Disk Sort Record:
Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014
Record:
Spark
207
machines
23 minutes
Spark Driver
Executor
Task Task
Executor
Task Task
Executor
Task Task
Executor
Task Task
Spark Physical Cluster
JVM
JVM JVM JVM JVM
Spark Data Model
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
RDD with 4 partitions
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Spark Data Model
item-1
item-2
item-3
item-4
item-5
item-6
item-6
item-8
item-9
item-10
Ex
RD
DRD
D
Ex
RD
DRD
D
Ex
RD
D
more partitions = more parallelism
RDD
16
DataFrame APIs
Spark Data Model
DataFrame with 4 partitions
logLinesDF
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Warn ts msg2
Error ts msg1
Type Time Msg
(Str
)
(Int
)
(Str
)
Info ts msg7
Warn ts msg2
Error ts msg9
Type Time Msg
(Str
)
(Int
)
(Str
)
Warn ts msg0
Warn ts msg2
Info ts msg11
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Error ts msg3
Error ts msg1
df.rdd.partitions.size = 4
Spark Data Model
- -
-
Ex
DF
DF
Ex
DF
DF
Ex
DF
more partitions = more parallelism
E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DataFrame
19
DataFrame Benefits
• Easier to program
• Significantly fewer Lines of Code
• Improved performance
• via intelligent optimizations and code-generation
Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
20
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Full API Docs
• Python
• Scala
• Java
• R
21
22
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
23
DataFrames are evaluated lazily
Distributed Storage
or
Catalyst +
Execute DAG!
24
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
Transformation examples Action examples
Transformations, Actions, Laziness
count
collect
show
head
take
filter
select
drop
intersect
join
25
DataFrames are lazy. Transformations contribute
to the query plan, but they don't execute
anything.
Actions cause the execution of the query.
3 Fundamental transformations on DataFrames
- mapPartitions()
- New ShuffledRDD
- ZipPartitions()
Graduate
d from
Alpha in
1.3
Spark SQL
– Part of the core distribution since Spark 1.0 (April
2014)
SQL
27
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
27
28
Which context?
SQLContext
• Basic functionality
HiveContext
• More advanced
• Superset of SQLContext
• More complete HiveQL parser
• Can read from Hive metastore
+ tables
• Access to Hive UDFs
Improved
multi-version
support in
1.4
Construct a DataFrame
29
# Construct a DataFrame from a "users" table in Hive.
df = sqlContext.read.table("users")
# Construct a DataFrame from a log file in S3.
df = sqlContext.read.json("s3n://someBucket/path/to/data.json",
"json")
val people = sqlContext.read.parquet("...")
DataFrame people = sqlContext.read().parquet("...")
Use DataFrames
30
# Create a new DataFrame that contains only "young" users
young = users.filter(users["age"] < 21)
# Alternatively, using a Pandas-like syntax
young = users[users.age < 21]
# Increment everybody's age by 1
young.select(young["name"], young["age"] + 1)
# Count the number of young users by gender
young.groupBy("gender").count()
# Join young users with another DataFrame, logs
young.join(log, logs["userId"] == users["userId"], "left_outer")
DataFrames and Spark SQL
31
young.registerTempTable("young")
sqlContext.sql("SELECT count(*) FROM young")
Actions on a DataFrame
Functions on a DataFrame
Functions on a DataFrame
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Queries on a DataFrame
Operations on a DataFrame
Creating DataFrames
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
E, T, M
E, T, M
RD
D
E, T, M
E, T, M
E, T, M
E, T, M
DF
Data Sources
39
Data Sources API
• Provides a pluggable mechanism for accessing structured data
through Spark SQL
• Tight optimizer integration means filtering and column pruning
can often be pushed all the way down to data sources
• Supports mounting external sources as temp tables
• Introduced in Spark 1.2 via SPARK-3247
40
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames
using a variety of formats.
40
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org
41
Spark Packages
Supported Data Sources:
• Avro
• Redshift
• CSV
• MongoDB
• Cassandra
• Cloudant
• Couchbase
• ElasticSearch
• Mainframes (IBM z/OS)
• Many More!
42
DataFrames: Reading from JDBC
1.3
• Supports any JDBC compatible RDBMS: MySQL, PostGres, H2, etc
• Unlike the pure RDD implementation (JdbcRDD), this supports
predicate pushdown and auto-converts the data into a DataFrame
• Since you get a DataFrame back, it’s usable in Java/Python/R/Scala.
• JDBC server allows multiple users to share one Spark cluster
Read Less Data
The fastest way to process big data is to never read
it.
Spark SQL can help you read less data
automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 43
• Converting to more efficient formats
• Using columnar formats (i.e. parquet)
• Using partitioning (i.e., /year=2014/month=02/…)1
• Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
Fall 2012: &
July 2013: 1.0 release
May 2014: Apache Incubator, 40+
contributors
• Limits I/O: Scans/Reads only the columns that are needed
• Saves Space: Columnar layout compresses better
Logical table
representation
Row Layout
Column Layout
Source: parquet.apache.org
Reading:
• Readers are
first read the file
metadata to find all
column chunks they
interested in.
• The columns chunks
should then be read
sequentially.
Writing:
• Metadata is written
after the data to
allow for single pass
writing.
Parquet Features
1. Metadata merging
• Allows developers to easily add/remove columns in data files
• Spark will scan all metadata for files and merge the schemas
2. Auto-discover data that has been partitioned into
folders
• And then prune which folders are scanned based on
predicates
So, you can greatly speed up
queries simply by breaking up data
into folders:
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
47
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
read and write
functions create new
builders for doing I/O
48
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
49
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
load(…), save(…)
or saveAsTable(…)
finish the I/O
specification
df = sqlContext.read 
.format("json") 
.option("samplingRatio", "0.1") 
.load("/home/michael/data.json")
df.write 
.format("parquet") 
.mode("append") 
.partitionBy("year") 
.saveAsTable("fasterData")
50
51
How are statistics used to improve DataFrames performance?
• Statistics are logged when caching
• During reads, these statistics can be used to skip some
cached partitions
• InMemoryColumnarTableScan can now skip partitions that cannot
possibly contain any matching rows
- - -
9 x x
8 x x
- - -
4 x x
7 x x
- - -
8 x x
2 x x
DF
max(a)=
9
max(a)=
7
max(a)=
8
Predicate: a = 8
Reference:
• https://github.com/apache/spark/pull/1883
• https://github.com/apache/spark/pull/2188
Filters Supported:
• =, <, <=, >, >=
DataFrame # of Partitions after Shuffle
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
2
sqlContex.setConf(key, value)
spark.sql.shuffle.partititions
defaults to 200
Spark 1.6: Adaptive
Shuffle
Shuffle
Caching a DataFrame
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
Spark SQL will re-encode the data into byte buffers before
calling caching so that there is less pressure on the GC.
.cache()
Demo!
Schema Inference
What if your data file doesn’t have a schema? (e.g., You’re reading a
CSV file or a plain text file.)
You can create an RDD of a particular type and let Spark infer the
schema from that type. We’ll see how to do that in a moment.
You can use the API to specify the schema programmatically.
(It’s better to use a schema-oriented input source if you can, though.)
Schema Inference Example
Suppose you have a (text) file that looks like
this:
56
The file has no schema,
but it’s obvious there is
one:
First name:string
Last name: string
Gender: string
Age: integer
Let’s see how to get Spark to infer the schema.
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
Schema Inference :: Scala
57
import sqlContext.implicits._
case class Person(firstName: String,
lastName: String,
gender: String,
age: Int)
val rdd = sc.textFile("people.csv")
val peopleRDD = rdd.map { line =>
val cols = line.split(",")
Person(cols(0), cols(1), cols(2), cols(3).toInt)
}
val df = peopleRDD.toDF
// df: DataFrame = [firstName: string, lastName: string,
gender: string, age: int]
A brief look at spark-csv
Let’s assume our data file has a header:
58
first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
A brief look at spark-csv
With spark-csv, we can simply create a DataFrame
directly from our CSV file.
59
// Scala
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
load("people.csv")
# Python
df = sqlContext.read.format("com.databricks.spark.csv").
load("people.csv", header="true")
60
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
61
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
DataFrame Operations
Selected
Physical
Plan
Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source,
so irrelevant data can be skipped
• Parquet: skip entire blocks, turn
comparisons on strings into cheaper
integer comparisons via dictionary
encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose between broadcast
joins and shuffle joins to reduce network
traffic
• Lower level optimizations: eliminate
expensive object allocations and reduce
virtual function calls
Not Just Less Code: Faster Implementations
63
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
Catalyst Goals
64
1) Make it easy to add new optimization techniques and features
to Spark SQL
2) Enable developers to extend the optimizer
• For example, to add data source specific rules that can push filtering
or aggregation into external storage systems
• Or to support new data types
Catalyst: Trees
65
• Tree: Main data type in Catalyst
• Tree is made of node objects
• Each node has type and 0 or
more children
• New node types are defined as
subclasses of TreeNode class
• Nodes are immutable and are
manipulated via functional
transformations
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two
expressions.
Imagine we have the following 3 node classes for a very simple
expression language:
Build a tree for the expression: x + (1+2)
In Scala code: Add(Attribute(x), Add(Literal(1),
Literal(2)))
Catalyst: Rules
66
• Rules: Trees are manipulated
using rules
• A rule is a function from a tree to
another tree
• Commonly, Catalyst will use a set
of pattern matching functions to
find and replace subtrees
• Trees offer a transform method
that applies a pattern matching
function recursively on all nodes
of the tree, transforming the ones
that match each pattern to a
result
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
}
Let’s implement a rule that folds Add operations between
constants:
Apply this to the tree: x + (1+2)
Yields: x + 3
• The rule may only match a subset of all possible input trees
• Catalyst tests which parts of a tree a given rule may apply to,
and skips over or descends into subtrees that do not match
• Rules don’t need to be modified as new types of operators are
added
Catalyst: Rules
67
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
Rules can match multiple patterns in the same transform call:
Apply this to the tree: x + (1+2)
Still yields: x + 3
Apply this to the tree: (x+0) + (3+3)
Now yields: x + 6
Catalyst: Rules
68
• Rules may need to execute multiple times to fully transform a
tree
• Rules are grouped into batches
• Each batch is executed to a fixed point (until tree stops
changing)
Example:
• Constant fold larger trees
Example:
• First batch analyzes an expression to assign
types to all attributes
• Second batch uses the new types to do
constant folding
• Rule conditions and their bodies contain arbitrary Scala code
• Takeaway: Functional transformations on immutable trees (easy to reason &
debug)
• Coming soon: Enable parallelization in the optimizer
69
Using Catalyst in Spark SQL
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzing a logical plan to resolve references
Logical Optimization: logical plan optimization
Physical Planning: Physical planning
Code Generation: Compile parts of the query to Java
bytecode
Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog- - - - - -
DF • Relation may contain unresolved attribute
references or relations
• Example: “SELECT col FROM sales”
• Type of col is unknown
• Even if it’s a valid col name is unknown (till we look up the
table)
Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
• Attribute is unresolved if:
• Catalyst doesn’t know its type
• Catalyst has not matched it to an input table
• Catalyst will use rules and a Catalog object (which tracks all the
tables in all data sources) to resolve these attributes
Step 1: Build “unresolved logical plan”
Step 2: Apply rules
Analysis Rules
• Look up relations by name in Catalog
• Map named attributes (like col) to the
input
• Determine which attributes refer to the
same value to give them a unique ID (for
later optimizations)
• Propagate and coerce types through
expressions
• We can’t know return type of 1 + col until we
have resolved col
Catalyst: Analyer.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
< 500 lines of code
Catalyst: Logical Optimizations
73
Logical Plan
Optimized
Logical Plan
Logical
Optimization • Applies rule-based optimizations to the logical
plan:
• Constant folding
• Predicate pushdown
• Projection pruning
• Null propagation
• Boolean expression simplification
• [Others]
• Example: a 12-line rule optimizes LIKE
expressions with simple regular expressions
into String.startsWith or String.contains calls.
Catalyst: Optimizer.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
< 700 lines of code
Catalyst: Physical Planning
75
• Spark SQL takes a logical plan and generations one or more
physical plans using physical operators that match the Spark
Execution engine:
1. mapPartitions()
2. new ShuffledRDD
3. zipPartitions()
• Currently cost-based optimization is only used to select a join
algorithm
• Broadcast join
• Traditional join
• Physical planner also performs rule-based physical
optimizations like pipelining projections or filters into one Spark
map operation
• It can also push operations from logical plan into data sources
(predicate pushdown)
Optimized
Logical Plan
Physical
Planning
Physical
Plans
Catalyst: SparkStrategies.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/core/src
/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
< 400 lines of code
Catalyst: Code Generation
77
• Generates Java bytecode to run on each machine
• Catalyst relies on janino to make code generation simple
• (FYI - It used to be quasiquotes, but now is janino)RDDs
Selected
Physical
Plan
Code
Generation
This code gen function converts an expression
like (x+y) + 1 to a
Scala AST:
Catalyst: CodeGenerator.scala
https://github.com/apache/spark/blob/fedbfc7074dd6d38dc5301d66d1ca097bc2a21e0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
< 700 lines of code
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity = udf(lambda zipCode: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events 
.join(u, events.user_id == u.user_id) 
.withColumn("city", zipToCity(df.zip))
Augments
any
DataFrame
that contains
user_id 79
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
80
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events 
.where(events.city == "San Francisco") 
.select(events.timestamp) 
.collect()
81
def add_demographics(events):
u = sqlCtx.table("users") # Load Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
81
82
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
82
Spark 1.5 –Speed / Robustness
Project Tungsten
– Tightly packed binary
structures
– Fully-accounted
memory with automatic
spilling
– Reduced serialization
costs
83
0
20
40
60
80
100
120
140
160
180
200
1x 2x 4x 8x 16x
Average
GC
time per
node
(seconds)
Data set size (relative)
Default Code Gen
Tungsten onheap Tungsten offheap
100+ native functions with
optimized codegen
implementations
– String manipulation –
concat, format_string,
lower, lpad
– Data/Time –
current_timestamp,
date_format, date_add
– Math – sqrt, randn
– Other –
monotonicallyIncreasingId,
sparkPartitionId
84
Spark 1.5 – Improved Function Library
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Window Functions
Before Spark 1.4:
- 2 kinds of functions in Spark that could return a single
value:
• Built-in functions or UDFs (round)
• take values from a single row as input, and they
generate a single return value for every input row
• Aggregate functions (sum or max)
• operate on a group of rows and calculate a single
return value for every group
New with Spark 1.4:
• Window Functions (moving avg, cumulative sum)
• operate on a group of rows while still returning a single
value for every input row.
Streaming DataFrames
Umbrella ticket to track what's needed to
make streaming DataFrame a reality:
https://issues.apache.org/jira/browse/SPARK-8360

Weitere ähnliche Inhalte

Was ist angesagt?

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to dfMohit Jaggi
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 

Was ist angesagt? (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 

Andere mochten auch

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning晨揚 施
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcareTaposh Roy
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 

Andere mochten auch (20)

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
 
Spark sql
Spark sqlSpark sql
Spark sql
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 

Ähnlich wie Building a modern Application with DataFrames

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 

Ähnlich wie Building a modern Application with DataFrames (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 

Kürzlich hochgeladen (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Building a modern Application with DataFrames

  • 1. Building a modern Application w/ DataFrames Meetup @ [24]7 in Campbell, CA Sept 8, 2015
  • 2. Who am I? Sameer Farooqui • Trainer @ Databricks • 150+ trainings on Hadoop, C*, HBase, Couchbase, NoSQL, etc Google: “spark newcircle foundations” / code: SPARK- MEETUPS-15
  • 3. Who are you? 1) I have used Spark hands on before… 2) I have used DataFrames before (in any language)…
  • 4. Agenda • Be able to smartly use DataFrames tomorrow! + Intro + Advanced Demo!• Spark Overview • Catalyst Internals • DataFrames (10 mins)
  • 5. The Databricks team contributed more than 75% of the code added to Spark in the past year
  • 7. 7 Goal: unified engine across data sources, workloads and environments
  • 8. Spark – 100% open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators
  • 9. 0 20 40 60 80 100 120 140 2011 2012 2013 2014 2015 Contributors per Month to Spark Most active project in big data 9
  • 10. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 10
  • 11. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record
  • 12. 12 On-Disk Sort Record: Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  • 13. Spark Driver Executor Task Task Executor Task Task Executor Task Task Executor Task Task Spark Physical Cluster JVM JVM JVM JVM JVM
  • 14. Spark Data Model Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 RDD with 4 partitions Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 logLinesRDD
  • 17. Spark Data Model DataFrame with 4 partitions logLinesDF Type Time Msg (Str ) (Int ) (Str ) Error ts msg1 Warn ts msg2 Error ts msg1 Type Time Msg (Str ) (Int ) (Str ) Info ts msg7 Warn ts msg2 Error ts msg9 Type Time Msg (Str ) (Int ) (Str ) Warn ts msg0 Warn ts msg2 Info ts msg11 Type Time Msg (Str ) (Int ) (Str ) Error ts msg1 Error ts msg3 Error ts msg1 df.rdd.partitions.size = 4
  • 18. Spark Data Model - - - Ex DF DF Ex DF DF Ex DF more partitions = more parallelism E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DataFrame
  • 19. 19 DataFrame Benefits • Easier to program • Significantly fewer Lines of Code • Improved performance • via intelligent optimizations and code-generation
  • 20. Write Less Code: Compute an Average private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) } data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [x.[1], 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() 20
  • 21. Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Full API Docs • Python • Scala • Java • R 21
  • 22. 22 DataFrames are evaluated lazily - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - E T E T - - E T E T - - E T E T DF- 2 - - E T E T - - E T E T - - E T E T DF- 3 Distributed Storage or
  • 23. 23 DataFrames are evaluated lazily Distributed Storage or Catalyst + Execute DAG!
  • 24. 24 DataFrames are evaluated lazily - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - E T E T - - E T E T - - E T E T DF- 2 - - E T E T - - E T E T - - E T E T DF- 3 Distributed Storage or
  • 25. Transformation examples Action examples Transformations, Actions, Laziness count collect show head take filter select drop intersect join 25 DataFrames are lazy. Transformations contribute to the query plan, but they don't execute anything. Actions cause the execution of the query.
  • 26. 3 Fundamental transformations on DataFrames - mapPartitions() - New ShuffledRDD - ZipPartitions()
  • 27. Graduate d from Alpha in 1.3 Spark SQL – Part of the core distribution since Spark 1.0 (April 2014) SQL 27 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 27
  • 28. 28 Which context? SQLContext • Basic functionality HiveContext • More advanced • Superset of SQLContext • More complete HiveQL parser • Can read from Hive metastore + tables • Access to Hive UDFs Improved multi-version support in 1.4
  • 29. Construct a DataFrame 29 # Construct a DataFrame from a "users" table in Hive. df = sqlContext.read.table("users") # Construct a DataFrame from a log file in S3. df = sqlContext.read.json("s3n://someBucket/path/to/data.json", "json") val people = sqlContext.read.parquet("...") DataFrame people = sqlContext.read().parquet("...")
  • 30. Use DataFrames 30 # Create a new DataFrame that contains only "young" users young = users.filter(users["age"] < 21) # Alternatively, using a Pandas-like syntax young = users[users.age < 21] # Increment everybody's age by 1 young.select(young["name"], young["age"] + 1) # Count the number of young users by gender young.groupBy("gender").count() # Join young users with another DataFrame, logs young.join(log, logs["userId"] == users["userId"], "left_outer")
  • 31. DataFrames and Spark SQL 31 young.registerTempTable("young") sqlContext.sql("SELECT count(*) FROM young")
  • 32. Actions on a DataFrame
  • 33. Functions on a DataFrame
  • 34. Functions on a DataFrame
  • 36.
  • 37. Operations on a DataFrame
  • 38. Creating DataFrames - - -E T ME T M - - -E T ME T M - - -E T ME T M E, T, M E, T, M RD D E, T, M E, T, M E, T, M E, T, M DF Data Sources
  • 39. 39 Data Sources API • Provides a pluggable mechanism for accessing structured data through Spark SQL • Tight optimizer integration means filtering and column pruning can often be pushed all the way down to data sources • Supports mounting external sources as temp tables • Introduced in Spark 1.2 via SPARK-3247
  • 40. 40 Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 40 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org
  • 41. 41 Spark Packages Supported Data Sources: • Avro • Redshift • CSV • MongoDB • Cassandra • Cloudant • Couchbase • ElasticSearch • Mainframes (IBM z/OS) • Many More!
  • 42. 42 DataFrames: Reading from JDBC 1.3 • Supports any JDBC compatible RDBMS: MySQL, PostGres, H2, etc • Unlike the pure RDD implementation (JdbcRDD), this supports predicate pushdown and auto-converts the data into a DataFrame • Since you get a DataFrame back, it’s usable in Java/Python/R/Scala. • JDBC server allows multiple users to share one Spark cluster
  • 43. Read Less Data The fastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 43 • Converting to more efficient formats • Using columnar formats (i.e. parquet) • Using partitioning (i.e., /year=2014/month=02/…)1 • Skipping data using statistics (i.e., min, max)2 • Pushing predicates into storage systems (i.e., JDBC)
  • 44. Fall 2012: & July 2013: 1.0 release May 2014: Apache Incubator, 40+ contributors • Limits I/O: Scans/Reads only the columns that are needed • Saves Space: Columnar layout compresses better Logical table representation Row Layout Column Layout
  • 45. Source: parquet.apache.org Reading: • Readers are first read the file metadata to find all column chunks they interested in. • The columns chunks should then be read sequentially. Writing: • Metadata is written after the data to allow for single pass writing.
  • 46. Parquet Features 1. Metadata merging • Allows developers to easily add/remove columns in data files • Spark will scan all metadata for files and merge the schemas 2. Auto-discover data that has been partitioned into folders • And then prune which folders are scanned based on predicates So, you can greatly speed up queries simply by breaking up data into folders:
  • 47. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 47
  • 48. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") read and write functions create new builders for doing I/O 48
  • 49. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 49
  • 50. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…) finish the I/O specification df = sqlContext.read .format("json") .option("samplingRatio", "0.1") .load("/home/michael/data.json") df.write .format("parquet") .mode("append") .partitionBy("year") .saveAsTable("fasterData") 50
  • 51. 51 How are statistics used to improve DataFrames performance? • Statistics are logged when caching • During reads, these statistics can be used to skip some cached partitions • InMemoryColumnarTableScan can now skip partitions that cannot possibly contain any matching rows - - - 9 x x 8 x x - - - 4 x x 7 x x - - - 8 x x 2 x x DF max(a)= 9 max(a)= 7 max(a)= 8 Predicate: a = 8 Reference: • https://github.com/apache/spark/pull/1883 • https://github.com/apache/spark/pull/2188 Filters Supported: • =, <, <=, >, >=
  • 52. DataFrame # of Partitions after Shuffle - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 2 sqlContex.setConf(key, value) spark.sql.shuffle.partititions defaults to 200 Spark 1.6: Adaptive Shuffle Shuffle
  • 53. Caching a DataFrame - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M - - -E T ME T M DF- 1 Spark SQL will re-encode the data into byte buffers before calling caching so that there is less pressure on the GC. .cache()
  • 54. Demo!
  • 55. Schema Inference What if your data file doesn’t have a schema? (e.g., You’re reading a CSV file or a plain text file.) You can create an RDD of a particular type and let Spark infer the schema from that type. We’ll see how to do that in a moment. You can use the API to specify the schema programmatically. (It’s better to use a schema-oriented input source if you can, though.)
  • 56. Schema Inference Example Suppose you have a (text) file that looks like this: 56 The file has no schema, but it’s obvious there is one: First name:string Last name: string Gender: string Age: integer Let’s see how to get Spark to infer the schema. Erin,Shannon,F,42 Norman,Lockwood,M,81 Miguel,Ruiz,M,64 Rosalita,Ramirez,F,14 Ally,Garcia,F,39 Claire,McBride,F,23 Abigail,Cottrell,F,75 José,Rivera,M,59 Ravi,Dasgupta,M,25 …
  • 57. Schema Inference :: Scala 57 import sqlContext.implicits._ case class Person(firstName: String, lastName: String, gender: String, age: Int) val rdd = sc.textFile("people.csv") val peopleRDD = rdd.map { line => val cols = line.split(",") Person(cols(0), cols(1), cols(2), cols(3).toInt) } val df = peopleRDD.toDF // df: DataFrame = [firstName: string, lastName: string, gender: string, age: int]
  • 58. A brief look at spark-csv Let’s assume our data file has a header: 58 first_name,last_name,gender,age Erin,Shannon,F,42 Norman,Lockwood,M,81 Miguel,Ruiz,M,64 Rosalita,Ramirez,F,14 Ally,Garcia,F,39 Claire,McBride,F,23 Abigail,Cottrell,F,75 José,Rivera,M,59 Ravi,Dasgupta,M,25 …
  • 59. A brief look at spark-csv With spark-csv, we can simply create a DataFrame directly from our CSV file. 59 // Scala val df = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). load("people.csv") # Python df = sqlContext.read.format("com.databricks.spark.csv"). load("people.csv", header="true")
  • 60. 60 DataFrames: Under the hood SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 61. 61 DataFrames: Under the hood SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan CostModel Physical Plans Catalog DataFrame Operations Selected Physical Plan
  • 62. Catalyst Optimizations Logical Optimizations Create Physical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  • 63. Not Just Less Code: Faster Implementations 63 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs) https://gist.github.com/rxin/c1592c133e4bccf515dd
  • 64. Catalyst Goals 64 1) Make it easy to add new optimization techniques and features to Spark SQL 2) Enable developers to extend the optimizer • For example, to add data source specific rules that can push filtering or aggregation into external storage systems • Or to support new data types
  • 65. Catalyst: Trees 65 • Tree: Main data type in Catalyst • Tree is made of node objects • Each node has type and 0 or more children • New node types are defined as subclasses of TreeNode class • Nodes are immutable and are manipulated via functional transformations • Literal(value: Int): a constant value • Attribute(name: String): an attribute from an input row, e.g.,“x” • Add(left: TreeNode, right: TreeNode): sum of two expressions. Imagine we have the following 3 node classes for a very simple expression language: Build a tree for the expression: x + (1+2) In Scala code: Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 66. Catalyst: Rules 66 • Rules: Trees are manipulated using rules • A rule is a function from a tree to another tree • Commonly, Catalyst will use a set of pattern matching functions to find and replace subtrees • Trees offer a transform method that applies a pattern matching function recursively on all nodes of the tree, transforming the ones that match each pattern to a result tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } Let’s implement a rule that folds Add operations between constants: Apply this to the tree: x + (1+2) Yields: x + 3 • The rule may only match a subset of all possible input trees • Catalyst tests which parts of a tree a given rule may apply to, and skips over or descends into subtrees that do not match • Rules don’t need to be modified as new types of operators are added
  • 67. Catalyst: Rules 67 tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right } Rules can match multiple patterns in the same transform call: Apply this to the tree: x + (1+2) Still yields: x + 3 Apply this to the tree: (x+0) + (3+3) Now yields: x + 6
  • 68. Catalyst: Rules 68 • Rules may need to execute multiple times to fully transform a tree • Rules are grouped into batches • Each batch is executed to a fixed point (until tree stops changing) Example: • Constant fold larger trees Example: • First batch analyzes an expression to assign types to all attributes • Second batch uses the new types to do constant folding • Rule conditions and their bodies contain arbitrary Scala code • Takeaway: Functional transformations on immutable trees (easy to reason & debug) • Coming soon: Enable parallelization in the optimizer
  • 69. 69 Using Catalyst in Spark SQL SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzing a logical plan to resolve references Logical Optimization: logical plan optimization Physical Planning: Physical planning Code Generation: Compile parts of the query to Java bytecode
  • 70. Catalyst: Analysis SQL AST DataFrame Unresolved Logical Plan Logical Plan Analysis Catalog- - - - - - DF • Relation may contain unresolved attribute references or relations • Example: “SELECT col FROM sales” • Type of col is unknown • Even if it’s a valid col name is unknown (till we look up the table)
  • 71. Catalyst: Analysis SQL AST DataFrame Unresolved Logical Plan Logical Plan Analysis Catalog • Attribute is unresolved if: • Catalyst doesn’t know its type • Catalyst has not matched it to an input table • Catalyst will use rules and a Catalog object (which tracks all the tables in all data sources) to resolve these attributes Step 1: Build “unresolved logical plan” Step 2: Apply rules Analysis Rules • Look up relations by name in Catalog • Map named attributes (like col) to the input • Determine which attributes refer to the same value to give them a unique ID (for later optimizations) • Propagate and coerce types through expressions • We can’t know return type of 1 + col until we have resolved col
  • 73. Catalyst: Logical Optimizations 73 Logical Plan Optimized Logical Plan Logical Optimization • Applies rule-based optimizations to the logical plan: • Constant folding • Predicate pushdown • Projection pruning • Null propagation • Boolean expression simplification • [Others] • Example: a 12-line rule optimizes LIKE expressions with simple regular expressions into String.startsWith or String.contains calls.
  • 75. Catalyst: Physical Planning 75 • Spark SQL takes a logical plan and generations one or more physical plans using physical operators that match the Spark Execution engine: 1. mapPartitions() 2. new ShuffledRDD 3. zipPartitions() • Currently cost-based optimization is only used to select a join algorithm • Broadcast join • Traditional join • Physical planner also performs rule-based physical optimizations like pipelining projections or filters into one Spark map operation • It can also push operations from logical plan into data sources (predicate pushdown) Optimized Logical Plan Physical Planning Physical Plans
  • 77. Catalyst: Code Generation 77 • Generates Java bytecode to run on each machine • Catalyst relies on janino to make code generation simple • (FYI - It used to be quasiquotes, but now is janino)RDDs Selected Physical Plan Code Generation This code gen function converts an expression like (x+y) + 1 to a Scala AST:
  • 79. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity = udf(lambda zipCode: <custom logic here>) def add_demographics(events): u = sqlCtx.table("users") events .join(u, events.user_id == u.user_id) .withColumn("city", zipToCity(df.zip)) Augments any DataFrame that contains user_id 79
  • 80. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 80 events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events .where(events.city == "San Francisco") .select(events.timestamp) .collect()
  • 81. 81 def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect() Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users) 81
  • 82. 82 def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect() Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 82
  • 83. Spark 1.5 –Speed / Robustness Project Tungsten – Tightly packed binary structures – Fully-accounted memory with automatic spilling – Reduced serialization costs 83 0 20 40 60 80 100 120 140 160 180 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap
  • 84. 100+ native functions with optimized codegen implementations – String manipulation – concat, format_string, lower, lpad – Data/Time – current_timestamp, date_format, date_add – Math – sqrt, randn – Other – monotonicallyIncreasingId, sparkPartitionId 84 Spark 1.5 – Improved Function Library from pyspark.sql.functions import * yesterday = date_sub(current_date(), 1) df2 = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(), 1) val df2 = df.filter(df("created_at") > yesterday)
  • 85. Window Functions Before Spark 1.4: - 2 kinds of functions in Spark that could return a single value: • Built-in functions or UDFs (round) • take values from a single row as input, and they generate a single return value for every input row • Aggregate functions (sum or max) • operate on a group of rows and calculate a single return value for every group New with Spark 1.4: • Window Functions (moving avg, cumulative sum) • operate on a group of rows while still returning a single value for every input row.
  • 86.
  • 87. Streaming DataFrames Umbrella ticket to track what's needed to make streaming DataFrame a reality: https://issues.apache.org/jira/browse/SPARK-8360

Hinweis der Redaktion

  1. This saturated both disk and network layers
  2. Old Spark API (T&A) is based on Java/Python objects - this makes it hard for the engine to store compactly (java objects in memory have a lot of extra space for what classes, pointers to various things, etc) - cannot understand semantics of user functions - so if you run a map function over just one field of the data, it still has to read the entire object into memory. Spark doesn't know you only cared about one field.
  3. DataFrames were inspired by previous distributed data frame efforts, including Adatao’s DDF and Ayasdi’s BigDF. However, the main difference from these projects is that DataFrames go through the Catalyst optimizer, enabling optimized execution similar to that of Spark SQL queries.
  4. a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.  I’d say that DataFrame is a result of transformation of any other RDD. Your input RDD might contains strings and numbers. But as a result of transformation you end up with RDD that contains GenericRowWithSchema, which is what DataFrame actually is. So, I’d say that DataFrame is just sort of wrapper around simple RDD, which provides some additional and pretty useful stuff.
  5. To compute an average. I have a dataset that is a list of names and ages. Want to figure out the average age for a given name. So, age distribution for a name…
  6. Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
  7. Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
  8. SparkSQL is the only project (1.4+) can read from multiple version of Hive. Spark 1.5 can read from 0.12 – 1.2 A lot of the hive functionality is useful even if you don’t have a Hive installation! Spark will automatically create a local copy of the Hive metastore so use can use Window functions, Hive UDFS, create persistent tables, To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build.  The specific variant of SQL that is used to parse queries can also be selected using the spark.sql.dialect option. This parameter can be changed using either the setConf method on a SQLContext or by using a SET key=valuecommand in SQL. For a SQLContext, the only dialect available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the default is “hiveql”, though “sql” is also available. 
  9. The following example shows how to construct DataFrames in Python. A similar API is available in Scala and Java.
  10. Once built, DataFrames provide a domain-specific language for distributed data manipulation.  Here is an example of using DataFrames to manipulate the demographic data of a large population of users:
  11. You can also incorporate SQL while working with DataFrames, using Spark SQL. This example counts the number of users in the young DataFrame.
  12.  But its not the same as if you called .cache() on an RDD[Row], since we reencode the data into bytebuffers before calling caching so that there is less pressure on the GC.
  13. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame the full RDD API cannot be released on Dataframes. Per Michael, absolute freedom for users restricts the types of optimizations that we can do.
  14. Finally, a Data Source for reading from JDBC has been added as built-in source for Spark SQL.  Using this library, Spark SQL can extract data from any existing relational databases that supports JDBC.  Examples include mysql, postgres, H2, and more.  Reading data from one of these systems is as simple as creating a virtual table that points to the external table.  Data from this table can then be easily read in and joined with any of the other sources that Spark SQL supports. This functionality is a great improvement over Spark’s earlier support for JDBC (i.e.,JdbcRDD).  Unlike the pure RDD implementation, this new DataSource supports automatically pushing down predicates, converts the data into a DataFrame that can be easily joined, and is accessible from Python, Java, and SQL in addition to Scala.
  15. Twitter and Cloudera merged efforts in 2012 to develop a columnar format Parquet is a column based storage format. It gets its name from the patterns in parquet flooring. Optimized use case for parquet is when you only need a subset of the total columns. Avro is better if you typically scan/read all of the fields in a row in each query. Typically, one of the most expensive parts of reading and writing data is (de)serialization. Parquet supports predicate push-down and schema projection to target specific columns in your data for filtering and reading — keeping the cost of deserialization to a minimum. Parquet compresses better because columns have a fixed data type (like string, integer, Boolean, etc).  it is easier to apply any encoding schemes on columnar data which could even be column specific such as delta encoding for integers and prefix/dictionary encoding for strings. Also, due to the homogeneity in data, there is a lot more redundancy and duplicates in the values in a given column. This allows better compression in comparison to data stored in row format. The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).  Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).  https://parquet.apache.org/documentation/latest/
  16. Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset. Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file. Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk. Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages. Metadata is written after the data to allow for single pass writing. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially. There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol. The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).  Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).  https://parquet.apache.org/documentation/latest/
  17. First, organizations that store lots of data in parquet often find themselves evolving the schema over time by adding or removing columns.  With this release we add a new feature that will scan the metadata for all files, merging the schemas to come up with a unified representation of the data.  This functionality allows developers to read data where the schema has changed overtime, without the need to perform expensive manual conversions. - In Spark 1.4, we plan to provide an interface that will allow other formats, such as ORC, JSON and CSV, to take advantage of this partitioning functionality.
  18. On the builder you can specific methods… Like do you want to overwrite data already there? Load or save or saveAsTable are actions.
  19. Note that by default in Spark SQL, there is a parameter called spark.sql.shuffle.partititions, which sets the # of partitions in a Dataframe after a shuffle (in case the user didn't manually specify it). Currently, Spark does not do any automatic determination of partitions, it just uses the # in that parameter. Doing more automclasses for Databrick stuff is on our roadmap though. You can change this parameter using: sqlContex.setConf(key, value). 1.6 = adaptive shuffle, look at output of map side, then pick # of reducers. Matie and Yin's hack day project.
  20. Case classes are used when creating classes that primarily hold data. When your class is basically a data-holder, case classes simplify your code and perform common work. With case classes, unlike regular classes, we don’t have to use the new keyword when creating an object. The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection (seen in green) and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. - - - - What is a case class (vs a normal class)? Original purpose was used for matching, but it’s used for more now Scala’s version of a java bean (java has classes primary for data (gettings and settings) and there’s classes mostly for operations Case classes are mostly for data Scala can do reflection to establish/infer the schema of the df (seen in green). You’d want to be more robust about parsing CSV in real life. peopleRDD.toDF uses (a) Scala implicits and (b) the type of the RDD (RDD[Person]) to infer the schema Mention that a case class, in Scala, is basically a Scala bean: A container for data, augmented with useful things by the Scala compiler.
  21. Catalyst is a powerful new optimization framework. The Catalyst framework allows the developers behind Spark SQL to rapidly add new optimizations, enabling us to build a faster system more quickly.  Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
  22. Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst. Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
  23. At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generatesJVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames. Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users.
  24. Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users. The above chart compares the runtime performance of running group-by-aggregation on 10 million integer pairs on a single machine (source code). Since both Scala and Python DataFrame operations are compiled into JVM bytecode for execution, there is little difference between the two languages, and both outperform the vanilla Python RDD variant by a factor of 5 and Scala RDD variant by a factor of 2.
  25. At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them.  A tree is just a Scala object. - - These classes can be used to build up trees; for example, the tree for the expression x+(1+2), would be represented in Scala code as follows: (See Scala code and Diagram)
  26. Pattern matching is a feature of many functional languages that allows extracting values from potentially nested structures of algebraic data types. - The case keyword here is Scala’s standard pattern matching syntax, and can be used to match on the type of an object as well as give names to extracted values (c1 and c2 here). - The pattern matching expression that is passed to transform is a partial function, meaning that it only needs to match to a subset of all possible input trees.  This ability means that rules only need to reason about the trees where a given optimization applies and not those that do not match. Thus, rules do not need to be modified as new types of operators are added to the system.
  27. Rules (and Scala pattern matching in general) can match… -  In the example above, repeated application would constant-fold larger trees, such as (x+0)+(3+3)
  28. Examples below show why you may want to run a rule multiple times Running rules to fixed point means that each rule can be simple and self-contained, and yet still eventually have larger global effects on a tree.  - In our experience, functional transformations on immutable trees make the whole optimizer very easy to reason about and debug. They also enable parallelization in the optimizer, although we do not yet exploit this.
  29.  In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators. - Does Catalyst currently have the capability to generate multiple physical plans? You had mentioned at TtT last week that costing is done eagerly to prune branches that are not allowed (aka greedy algorithm).  - Greedy Algorithm works by making the decision that seems most promising at any moment; it never reconsiders this decision, whatever situation may arise later. A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
  30. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. - An abstract syntax tree for the following code for the Euclidean algorithm: while b ≠ 0if a > ba := a − belseb := b − areturn a - -
  31. Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API. A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code. The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. - An abstract syntax tree for the following code for the Euclidean algorithm: while b ≠ 0if a > ba := a − belseb := b − areturn a - -
  32. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  33. Note, this is not cost based. Cost-based optimization is performed by generating multiple plans using rules, and then computing their costs.
  34. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  35. The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. We thus intend to implement richer cost-based optimization in the future.
  36. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  37. As a simple example, consider the Add, Attribute and Literal tree nodes introduced in Section 4.2, which allowed us to write expressions such as (x+y)+1. Without code generation, such expressions would have to be interpreted for each row of data, by walking down a tree of Add, Attribute and Literal nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, we can write a function to translate a specific expression tree to a Scala AST as follows: The strings beginning with q are quasiquotes, meaning that although they look like strings, they are parsed by the Scala compiler at compile time and represent ASTs for the code within. Quasiquotes can have variables or other ASTs spliced into them, indicated using $ notation. For example, Literal(1) would become the Scala AST for 1, whileAttribute("x") becomes row.get("x"). In the end, a tree like Add(Literal(1), Attribute("x")) becomes an AST for a Scala expression like 1+row.get("x").
  38. A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references. * Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
  39. Sometimes you want to call complex functions to do additional work inside of the SQL queries. UDFs can be inlined in the DataFrame code UDF zipToCity just invokes a lamda function that takes a zipCode and does some custom logic to figure out which city the zip code is located in. I have a function called add demographics, which takes a data frame w/ a user ID and will automatically compute a bunch of demographic information. So, we do a join based on UserID and then adds a new column with the .withColumn… the UDF results will add a new column. This def returns a new DataFrame
  40. All of this is lazy, so SparkSQL can do optimizations much later. For this type of machine learning I’m doing, I may only need the ts column from San Francisco. Note that add_demographics does have extra functionality to filter down to just SF and ts.
  41. So maybe add_demographics was by my co-worker and I just want to use it. So we construct a logical query plan. Since this planning is happening at the logical level, optimizations can even occur across function calls, as shown in the example below. In this example, Spark SQL is able to push the filtering of users by their location through the join, greatly reducing its cost to execute.  This optimization is possible even though the original author of the add_demographics function did not provide a parameter for specifying how to filter users! - Ideally we want to filter the users ahead of time based on the extra predicates, and only do the join on the relevent users.
  42. Even cooler, if I want to optimize this later on… So, here I changed to a partitioned hive table (users) and also used parquet instead of JSON (events). Now with Parquet, SparkSQL notices there are new optimizations that it can now do
  43. Idea of Project Tungsten is to reimagine the execution engine for SparkSQL As a user, when you move to 1.5 you will see significant robustness and speed improvements
  44. Before you had to resort to Hive UDFs or drop into SQL… But now there are 100+ native functions added that at runtime Java bytecode will be constructed to evaluate whatever you need.
  45. Pass a physical plan generated by Catalyst into Streaming