This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
2. Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
3. Who are our TAs?
● Rachel Warren
● Anya Bida
● Pranav Honrao
● Anandha Ranganathan
● Michael Lyubinin
● Matt Gibb
4. What we are going to explore together!
● What is Spark SQL
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● Simple query
● Schemas
● Loading data
● Mixing functional transformations
Ryan McGilchrist
5. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
6. Some pages to keep open
http://bit.ly/sparkDocs
http://bit.ly/sparkScalaDoc
http://bit.ly/sparkSQLFunctions
http://bit.ly/highPerfSparkExamples
Or
https://github.com/high-performance-spark/high-
performance-spark-examples
JOHNNY LAI
9. How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
Andrew Skudder
10. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
11. Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has the SQLContext and HiveContext
For today if you want to explore Datasets use Scala
Petful
12. Launching our shell
./bin/spark-shell --packages
com.databricks:spark-csv_2.11:1.4.0
IPYTHON_OPTS="notebook" ./bin/pyspark --packages
com.databricks:spark-csv_2.11:1.4.0
More packages at
http://www.spark-packages.org
Moyan Brenn
13. You (most likely) want the HiveContext
● it doesn’t require an existing hive installation
● If you have a Hive metastore you can connect to it
● Gives you better UDFs
● More extensive SQL parser in earlier versions of Spark
● If building from source you will need to add “-Phive”
● If you have conflicts with hive you can’t shade use the
SQLContext
Noel Reynolds
14. So what can we do with our context?
● Load Data in DataFrames & Datasets (we will start
here)
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables*
● Start a Hive Thrift Server
● Add jars
○ E.g. add UDFs
● Set configuration variables
○ Like parquet writer, etc.
U-nagi
15. Loading our Data
● I’m really lazy so we are going to start with the same
data as we did for our ML example last time
● https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
● We will add the spark-csv package to load the data
○ --packages com.databricks:spark-csv_2.11:1.4.0
● But this time let's look more at what we are doing
Jess Johnson
16. Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use com.
databricks.spark.csv
● load(“path”)
Jess Johnson
17. Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
18. What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%
22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
19. Ok so we’ve got our Data, what now?
● We can inspect the Schema
● We can start to apply some transformations (relational)
● We can do some machine learning
● We can jump into an RDD or a Dataset for functional
transformations
20. Getting the schema
● printSchema() for human readable
● schema for machine readable
21. Spark SQL Data Types
● Requires types have Spark SQL encoder
○ Many common basic types already have encoders, nested classes of
common types don’t require their own encoder
○ RDDs support any serializable object
● Many common data types are directly supported
● Can add encoders for others
● Datasets are templated on type, DataFrames are not
● Both have schema information
loiez Deniel
22. Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes: Array
[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs
25. Exercise 1:
● Load the adult CSV data
● Print its schema
● Write it out to parquet
● Finished quickly?
○ Try loading some data that doesn’t exist - does this behave differently
than Spark Core?
○ Help your neighbor (if they want)
26. Results:
● What does your schema look like?
● Note since its CSV it is flat - but as we showed with
JSON it can easily be nested
● What if we don’t like that schema?
● Why was reading the non existent file different than with
Spark core?
27. So what can we do with a DataFrame
● Relational style transformations
● Register it as a table and write raw SQL queries
○ df.registerTable(“murh”) sqlContext.sql(“select * from murh”)
● Write it out (with a similar API as for loading)
● Turn it into an RDD (& back again if needed)
● Turn it into a Dataset
● If you are coming from R or Pandas adjust your
expectations
sebastien batardy
28. What do our relational queries look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● etc.
29. How do we write a relational query?
SQL expressions:
df.select(df("place"))
df.filter(df("happyPandas") >= minHappyPandas)
30. So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it super easy to perform multiple aggregations at
the same time
● Built in shortcuts for aggregates like avg, min, max
● Longer list at http://spark.apache.
org/docs/latest/api/scala/index.html#org.apache.spark.
sql.functions$
Sherrie Thai
31. Computing some aggregates by age code:
df.groupBy(“age”).min(“hours-per-week”)
OR
import org.apache.spark.sql.catalyst.expressions.
aggregate._
df.groupBy(“age”).agg(min(“hours-per-week”))
32. Exercise 2: find the avg, min, etc.
Load in the parquet data from exercise 1
● if you didn’t get there its cool, just work from the csv
Grouped by
● Age
● Sex
● Native country
Of the following fields:
● Hours per week
● capital-gain
Clarissa Butelli
33. What were your results?
● How would we have done that with RDDs?
● Can we do aggregates without grouping first?
Clarissa Butelli
34. Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
35. Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1
37. UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x: len
(x), IntegerType())
Yağmur Adam
38. Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql.("SELECT firstCol, strLen
(stringCol)")
39. Using Udfs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol), dateTimeFunction(format)(df
(unixTimeStamp).cast(TimestampType))
40. Introducing Datasets
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● DataFrames are essentially Datasets of Row objects (e.g. not strongly typed)
with less operations
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Daisyree Bakker
41. Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
42. So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
43. And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
44. Exercise 3: Tokenize with “-”s
● Convert our DataFrame to a Dataset (we will need to make a case class)
● We could make a UDF but lets use a Dataset if we are working in Scala
● Split on “-” tokens (we don’t have regular spaces in our data)
● Python users UDF time (or build from src)
● Count the average # of tokens
Nina A.J.
46. What is functional perf like?
● Generally not as good - can’t introspect normally
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements
47. Where to go from here?
● SQL docs
● DataFrame & Dataset API
● High Performance Spark Early Release
49. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Early Release
High Performance Spark
50. And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
51. And some upcoming talks & office hours
● April
○ Local workshops (this workshop) & south bay (Intro to Spark)
● May
○ Apache Con Big Data (Vancouver)
● June
○ Strata London - Spark Performance
○ Datapalooza Tokyo
○ Scala Days Berlin
● July
○ Data Day Seattle
52. Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey: http:
//bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau