SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Introduction to Spark
Datasets
Functional and relational together at last
Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core
● Interested in using Spark Datasets
● Familiar-ish with Scala or Java or Python
Amanda
What we are going to explore together!
● What is Spark SQL
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● How Datasets are different than DataFrames
Ryan McGilchrist
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
Jon Ross
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Why should we consider Spark SQL?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
● Get to mix functional style and relational style
● Nice performance of Spark SQL flexibility of RDDs
● Strongly typed
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
Andrew Skudder
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has:
○ SQLContext and HiveContext (pre-2.0)
○ Unified in SparkSession post 2.0
Petful
DataFrames, Datasets, and RDDs oh my!
Spark Core:
● RDDs
○ Templated on type, lazily evaluated, distributed collections, arbitrary
data types
Spark SQL:
● DataFrames (e.g. Datasset[Row])
○ Lazily evaluated data, eagerly evaluated schema, relational
● Datasets
○ templated on type, have a matching schema, support both relational
and functional operations
Spark SQL Data Types
● Requires types have Spark SQL encoder
○ Many common basic types already have encoders, nested classes of
common types don’t require their own encoder
○ RDDs support any serializable object
● Many common data types are directly supported
● Can add encoders for others
loiez Deniel
Where to start?
● Load Data in DataFrames & Datasets - use
SparkSession
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables
○ Run SQL queries against them
● Write programmatic queries against DataFrames
● Apply functional transformations to Datasets
U-nagi
Loading with sparkSQL
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson
Loading with sparkSQL
df = sqlContext.read
.format("json")
.load("sample.json")
Jess Johnson
What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%
22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
Sample json record
{"name":"mission",
"pandas":[{"id":1,"zip":"94110","pt":"giant",
"happy":true, "attributes":[0.4,0.5]}]}
Xiahong Chen
Getting the schema
● printSchema() for human readable
● schema for machine readable
Resulting schema:
root
|-- name: string (nullable = true)
|-- pandas: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- zip: string (nullable = true)
| | |-- pt: string (nullable = true)
| | |-- happy: boolean (nullable = false)
| | |-- attributes: array (nullable = true)
| | | |-- element: double (containsNull = false)
Simon Götz
Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes: Array
[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs
Then from DF to DS
val pandas: Dataset[RawPanda] = df.as[RawPanda]
We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster
So what can we do with a DataFrame
● Relational style transformations
● Register it as a table and write raw SQL queries
○ df.registerTable(“murh”) sqlContext.sql(“select * from murh”)
● Write it out (with a similar API as for loading)
● Turn it into an RDD (& back again if needed)
● Turn it into a Dataset
● If you are coming from R or Pandas adjust your
expectations
sebastien batardy
What do our relational queries look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● etc.
How do we write a relational query?
SQL expressions:
df.select(df("place"))
df.filter(df("happyPandas") >= minHappyPandas)
So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it super easy to perform multiple aggregations at
the same time
● Built in shortcuts for aggregates like avg, min, max
● Longer list at http://spark.apache.
org/docs/latest/api/scala/index.html#org.apache.spark.
sql.functions$
Sherrie Thai
Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
Window specs
import org.apache.spark.sql.expressions.Window
val spec = Window.partitionBy("age").orderBy
("capital-gain").rowsBetween(-10, 10)
val rez = df.select(avg("capital-gain").over
(spec))
Ryo Chijiiwa
UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x: len
(x), IntegerType())
Yağmur Adam
Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol, strLen
(stringCol) from myTable")
Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol), dateTimeFunction(format)(df
(unixTimeStamp).cast(TimestampType))
Introducing Datasets
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● DataFrames are essentially Datasets of Row objects (e.g. not strongly typed)
with less operations
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Daisyree Bakker
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
What is functional perf like?
● Generally not as good - can’t introspect normally
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements
Why we should consider Datasets:
● We can solve problems tricky to solve with RDDs
○ Window operations
○ Multiple aggregations
● Fast
○ Awesome optimizer
○ Super space efficient data format
● We can solve problems tricky/frustrating to solve with
Dataframes
○ Writing UDFs and UDAFs can really break your flow
Where to go from here?
● SQL docs
● DataFrame & Dataset API
● High Performance Spark Early Release
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Early Release
High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Chapter 2 free preview thanks to Pepper Data (until
May 21st)
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And some upcoming talks & office hours
● Tomorrow - Workshop
● June
○ Strata London - Spark Performance
○ Datapalooza Tokyo
○ Scala Days Berlin
● July
○ Data Day Seattle
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau

Weitere ähnliche Inhalte

Was ist angesagt?

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 

Was ist angesagt? (20)

Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016Apache Spark Super Happy Funtimes - CHUG 2016
Apache Spark Super Happy Funtimes - CHUG 2016
 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 

Ähnlich wie Introduction to Spark Datasets - Functional and relational together at last

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 

Ähnlich wie Introduction to Spark Datasets - Functional and relational together at last (20)

Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Introduction to Spark Datasets - Functional and relational together at last

  • 1. Introduction to Spark Datasets Functional and relational together at last
  • 2. Who am I? Holden ● I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Software Engineer at IBM’s Spark Technology Center ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
  • 3. Who do I think you all are? ● Nice people* ● Some knowledge of Apache Spark core ● Interested in using Spark Datasets ● Familiar-ish with Scala or Java or Python Amanda
  • 4. What we are going to explore together! ● What is Spark SQL ● Where it fits into the Spark ecosystem ● How DataFrames & Datasets are different from RDDs ● How Datasets are different than DataFrames Ryan McGilchrist
  • 5. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Jon Ross
  • 6. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML Jon Ross bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames
  • 7. Why should we consider Spark SQL? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates Rikki's Refuge
  • 8. Why are Datasets so awesome? ● Get to mix functional style and relational style ● Nice performance of Spark SQL flexibility of RDDs ● Strongly typed Will Folsom
  • 9. What is the performance like? Andrew Skudder
  • 10. How is it so fast? ● Optimizer has more information (schema & operations) ● More efficient storage formats ● Faster serialization ● Some operations directly on serialized data formats Andrew Skudder
  • 11. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 12. Getting started: Our window to the world: ● Core Spark has the SparkContext ● Spark Streaming has the StreamingContext ● SQL has: ○ SQLContext and HiveContext (pre-2.0) ○ Unified in SparkSession post 2.0 Petful
  • 13. DataFrames, Datasets, and RDDs oh my! Spark Core: ● RDDs ○ Templated on type, lazily evaluated, distributed collections, arbitrary data types Spark SQL: ● DataFrames (e.g. Datasset[Row]) ○ Lazily evaluated data, eagerly evaluated schema, relational ● Datasets ○ templated on type, have a matching schema, support both relational and functional operations
  • 14. Spark SQL Data Types ● Requires types have Spark SQL encoder ○ Many common basic types already have encoders, nested classes of common types don’t require their own encoder ○ RDDs support any serializable object ● Many common data types are directly supported ● Can add encoders for others loiez Deniel
  • 15. Where to start? ● Load Data in DataFrames & Datasets - use SparkSession ○ Using the new DataSource API, raw SQL queries, etc. ● Register tables ○ Run SQL queries against them ● Write programmatic queries against DataFrames ● Apply functional transformations to Datasets U-nagi
  • 16. Loading with sparkSQL sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. ● load(“path”) Jess Johnson
  • 17. Loading with sparkSQL df = sqlContext.read .format("json") .load("sample.json") Jess Johnson
  • 18. What about other data formats? ● Built in ○ Parquet ○ JDBC ○ Json (which is amazing!) ○ Orc ○ Hive ● Available as packages ○ csv* ○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc. ○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources% 22 Michael Coghlan *pre-2.0 package, 2.0+ built in hopefully
  • 20. Getting the schema ● printSchema() for human readable ● schema for machine readable
  • 21. Resulting schema: root |-- name: string (nullable = true) |-- pandas: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: long (nullable = false) | | |-- zip: string (nullable = true) | | |-- pt: string (nullable = true) | | |-- happy: boolean (nullable = false) | | |-- attributes: array (nullable = true) | | | |-- element: double (containsNull = false) Simon Götz
  • 22. Sample case class for schema: case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array [Double]) case class PandaPlace(name: String, pandas: Array[RawPanda]) Orangeaurochs
  • 23. Then from DF to DS val pandas: Dataset[RawPanda] = df.as[RawPanda]
  • 24. We can also convert RDDs def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = { rdd.toDS } Nesster
  • 25. So what can we do with a DataFrame ● Relational style transformations ● Register it as a table and write raw SQL queries ○ df.registerTable(“murh”) sqlContext.sql(“select * from murh”) ● Write it out (with a similar API as for loading) ● Turn it into an RDD (& back again if needed) ● Turn it into a Dataset ● If you are coming from R or Pandas adjust your expectations sebastien batardy
  • 26. What do our relational queries look like? Many familiar faces are back with a twist: ● filter ● join ● groupBy - Now safe! And some new ones: ● select ● window ● etc.
  • 27. How do we write a relational query? SQL expressions: df.select(df("place")) df.filter(df("happyPandas") >= minHappyPandas)
  • 28. So whats this new groupBy? ● No longer causes explosions like RDD groupBy ○ Able to introspect and pipeline the aggregation ● Returns a GroupedData (or GroupedDataset) ● Makes it super easy to perform multiple aggregations at the same time ● Built in shortcuts for aggregates like avg, min, max ● Longer list at http://spark.apache. org/docs/latest/api/scala/index.html#org.apache.spark. sql.functions$ Sherrie Thai
  • 29. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 30. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain"))
  • 31. Windowed operations ● Can compute over the past K and next J ● Really hard to do in regular Spark, super easy in SQL Lucie Provencher 1 3 2 1
  • 32. Windowed operations ● Can compute over the past K and next J ● Really hard to do in regular Spark, super easy in SQL Lucie Provencher 1 3 2 1
  • 33. Window specs import org.apache.spark.sql.expressions.Window val spec = Window.partitionBy("age").orderBy ("capital-gain").rowsBetween(-10, 10) val rez = df.select(avg("capital-gain").over (spec)) Ryo Chijiiwa
  • 34. UDFS: Adding custom code sqlContext.udf.register("strLen", (s: String) => s.length()) sqlCtx.registerFunction("strLen", lambda x: len (x), IntegerType()) Yağmur Adam
  • 35. Using UDF on a table: First Register the table: df.registerTempTable("myTable") sqlContext.sql("SELECT firstCol, strLen (stringCol) from myTable")
  • 36. Using UDFs Programmatically def dateTimeFunction(format : String ): UserDefinedFunction = { import org.apache.spark.sql.functions.udf udf((time : Long) => new Timestamp(time * 1000)) } val format = "dd-mm-yyyy" df.select(df(firstCol), dateTimeFunction(format)(df (unixTimeStamp).cast(TimestampType))
  • 37. Introducing Datasets ● New in Spark 1.6 ● Provide templated compile time strongly typed version of DataFrames ● DataFrames are essentially Datasets of Row objects (e.g. not strongly typed) with less operations ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● Still an experimental component (API will change in future versions) ○ Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless Daisyree Bakker
  • 38. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 39. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions (pre-2.0) Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 40. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} } Chris Isherwood
  • 41. What is functional perf like? ● Generally not as good - can’t introspect normally ● SPARK-14083 is working on doing bytecode analysis ● Can still be faster than RDD transformations because of serialization improvements
  • 42. Why we should consider Datasets: ● We can solve problems tricky to solve with RDDs ○ Window operations ○ Multiple aggregations ● Fast ○ Awesome optimizer ○ Super space efficient data format ● We can solve problems tricky/frustrating to solve with Dataframes ○ Writing UDFs and UDAFs can really break your flow
  • 43. Where to go from here? ● SQL docs ● DataFrame & Dataset API ● High Performance Spark Early Release
  • 44. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 45. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Early Release High Performance Spark
  • 46. And the next book….. First four chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 47. And the next book….. First four chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Chapter 2 free preview thanks to Pepper Data (until May 21st) Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 48. And some upcoming talks & office hours ● Tomorrow - Workshop ● June ○ Strata London - Spark Performance ○ Datapalooza Tokyo ○ Scala Days Berlin ● July ○ Data Day Seattle
  • 49. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: http://bit.ly/holdenTestingSpark Will use update results in Strata Presentation & tweet eventually at @holdenkarau