꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Introduction to dataset
1. Introduction to Dataset API
Overcoming limitations of Dataframes
https://github.com/shashankgowdal/introduction_to_dataset
2. ● Shashank L
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
3. Agenda
● History of Spark APIs
● Limitations of Dataframes
● Dataset
● Encoders
● Dataset hierarchy
● Performance
● Roadmap
4. RDD API (2011)
● Distributed collection for JVM objects
● Immutable and Fault tolerant
● Processing structured and unstructured data
● Functional transformations
5. Limitations of RDD API
● No schema associated
● Optimization should be done by from user’s end
● Reading from multiple sources is difficult
● Combining multiple sources is difficult
6. DataFrame API (2013)
● Distributed collection for Row objects
● Immutable and Fault tolerant
● Processing structured data
● Optimization from Catalyst optimizer
● Data source API
7. Limitations of Dataframe
● Compile time type safety
● Cannot operate on domain objects
● Functional programming API
8. Compile time safety
val dataframe = sqlContext.read.json("people.json")
dataframe.filter("salary > 1000").show()
Throws Runtime exception
org.apache.spark.sql.AnalysisException: cannot resolve 'salary' given input columns age, name;
9. Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDF = sqlContext.createDataFrame(personRDD)
//Create dataframe from a RDD[Person]
personDF.rdd
//We get back RDD[Row] and not RDD[Person]
11. Dataset
an extension of the DataFrame API that provides a type-safe,
object-oriented programming interface
12. Dataset API
● Type-safe: Operate on domain objects with compiled
lambda functions
● Fast: Code generated encoders for fast serialization
● Interoperable: Easily convert Dataframe Dataset
without boilerplate code
14. Encoders
● Encoder converts from JVM object into a Dataset
row
● Code generated encoders for fast serialization
JVM Object
Dataset row
Encoder
15. Compile time safety check
case class Person(name: String, age: Long)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 12500)
//error: value salary is not a member of Person
16. Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDS = sqlContext.createDataset(personRDD)
//Create Dataset from a RDD
personDS.rdd
//We get back RDD[Person] and not RDD[Row] in Dataframe
17. Functional programming
case class Person(name: String, age: Int)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person]
// Compute histogram of age by name
val hist = ds.groupBy(_.name).mapGroups({
case (name, people) => {
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
})
23. Roadmap
● Dataset, the name itself may change
● Performance optimizations
● Unification of DataFrames with Dataset
● Public API for Encoders
● Support for most of the RDD operators on Dataset
24. Unification of DataFrames with Dataset
class Dataset[T](
val sqlContext: SQLContext,
val queryExecution: QueryExecution)(
implicit val encoder: Encoder[T])
class DataFrame(
sqlContext: SQLContext,
queryExecution: QueryExecution)
extends Dataset[Row](sqlContext, queryExecution)(new RowEncoder)