SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
Making sense of RDDs, DataFrames, SparkSQL and
Datasets APIs
Motivation
overview over the different Spark APIs for working with structured data.
Timeline of Spark APIs
Spark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well known
concept from R / Python Pandas.
Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-oriented
programming interface.
RDD
RDD - Resilient Distributed Dataset
Functional transformations on partitioned collections of opaque objects.
Define case class representing schema of our data.
Each field represent column of the DB.
> 
defined class Person
Create parallelized collection (RDD)
> 
peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95
RDD of type Person
> 
rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98
NB: Return Person objects
> 
Person(Sven,38)
case class Person(name: String, age: Int)
val peopleRDD = sc.parallelize(Array(
Person("Lars", 37),
Person("Sven", 38),
Person("Florian", 39),
Person("Dieter", 37)
))
val rdd = peopleRDD
.filter(_.age > 37)
rdd
.collect
.foreach(println(_))
(http://databricks.com) Import Notebook
RDDs, SQL, DataFrames and DataSets
Person(Florian,39)
DataFrames
Declarative transformations on partitioned collection of tuples.
> 
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
> 
+-------+---+
| name|age|
+-------+---+
| Lars| 37|
| Sven| 38|
|Florian| 39|
| Dieter| 37|
+-------+---+
> 
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
Show only age column
> 
+---+
|age|
+---+
| 37|
| 38|
| 39|
| 37|
+---+
NB: Result set consists of Arrays of String und Ints
> 
[Sven,38]
[Florian,39]
DataSets
Create DataFrames from RDDs
Implicit conversion is also available
> 
peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
NB: Result set consist of Person objects
val peopleDF = peopleRDD.toDF
peopleDF.show()
peopleDF.printSchema()
peopleDF.select("age").show()
peopleDF
.filter("age > 37")
.collect
.foreach(row => println(row))
val peopleDS = peopleRDD.toDS
> 
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
> 
Person(Sven,38)
Person(Florian,39)
> 
res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
> 
res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
> 
Person(Sven,38)
Person(Florian,39)
Spark SQL
> 
import org.apache.spark.sql.SQLContext
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129
Register DataFrame for usage via SQL
> 
The results of SQL queries are DataFrames and support all the usual RDD operations.
> 
res298: org.apache.spark.sql.DataFrame = [name: string, age: int]
Print execution plan
> 
res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- Subquery sparkpeopletbl
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
Optimized by built-in optimizer execution plan
val ds = peopleDS
.filter(_.age > 37)
ds.collect
.foreach(println(_))
ds.queryExecution.analyzed
ds.queryExecution.optimizedPlan
ds.collect
.foreach(println(_))
// Get SQL context from Spark context
// NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor.
In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access the
shared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())."
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
peopleDF.registerTempTable("sparkPeopleTbl")
sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
> 
res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
> 
Sven 38
Florian 39
> 
res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39])
> 
res302: Array[String] = Array(NAME: Sven, NAME: Florian)
> 
res303: Array[String] = Array(NAME: Sven, NAME: Florian)
> 
res304: org.apache.spark.sql.DataFrame = [name: string, age: int]
Running SQL queries agains Parquet files directly
> 
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95
MiB]
> 
0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01-
13T11:08:00.000+0000
19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01-
13T10:33:00.000+0000
D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01-
13T04:58:00.000+0000
67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01-
name age
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan
%sql SELECT * FROM sparkPeopleTbl WHERE age > 37
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect()
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/")
%sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
Showing the first 1000 rows.
13T04:58:00.000+0000
5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01-
13T08:56:00.000+0000
Conclusions
RDDs
RDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should
be prefered.
DataFrames
DataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside
this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and
flexible API.
Datasets
Datasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in
experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster
serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with
structured data, and Spark plans to converge the APIs even further.
Here some benchmarks of DataSets:
Transformation of Data Types in Spark
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

Weitere ähnliche Inhalte

Was ist angesagt?

ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210Mahmoud Samir Fayed
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The WorkMCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The WorkPROIDEA
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
 
Chapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printChapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printAbdii Rashid
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleChristopher Curtin
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco SlotDistributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco SlotCitus Data
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)I Goo Lee.
 
The Ring programming language version 1.9 book - Part 43 of 210
The Ring programming language version 1.9 book - Part 43 of 210The Ring programming language version 1.9 book - Part 43 of 210
The Ring programming language version 1.9 book - Part 43 of 210Mahmoud Samir Fayed
 

Was ist angesagt? (20)

ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210The Ring programming language version 1.9 book - Part 33 of 210
The Ring programming language version 1.9 book - Part 33 of 210
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Grid gain paper
Grid gain paperGrid gain paper
Grid gain paper
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Meet scala
Meet scalaMeet scala
Meet scala
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The WorkMCE^3 - Hannes Verlinde - Let The Symbols Do The Work
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Chapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for printChapter 8 advanced sorting and hashing for print
Chapter 8 advanced sorting and hashing for print
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco SlotDistributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Ember
EmberEmber
Ember
 
The Ring programming language version 1.9 book - Part 43 of 210
The Ring programming language version 1.9 book - Part 43 of 210The Ring programming language version 1.9 book - Part 43 of 210
The Ring programming language version 1.9 book - Part 43 of 210
 

Ähnlich wie Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Yandex
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?SegFaultConf
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Erik-Berndt Scheper
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
R getting spatial
R getting spatialR getting spatial
R getting spatialFAO
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Keshav Murthy
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 

Ähnlich wie Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016 (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 

Mehr von Comsysto Reply GmbH

Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyComsysto Reply GmbH
 
ljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdfljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdfComsysto Reply GmbH
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH
 
Invited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdfInvited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdfComsysto Reply GmbH
 
MicroFrontends für Microservices
MicroFrontends für MicroservicesMicroFrontends für Microservices
MicroFrontends für MicroservicesComsysto Reply GmbH
 
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...Comsysto Reply GmbH
 
Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup Comsysto Reply GmbH
 
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...Comsysto Reply GmbH
 
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo..."Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...Comsysto Reply GmbH
 
Data lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid ArchitecturesData lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid ArchitecturesComsysto Reply GmbH
 
Java 9 Modularity and Project Jigsaw
Java 9 Modularity and Project JigsawJava 9 Modularity and Project Jigsaw
Java 9 Modularity and Project JigsawComsysto Reply GmbH
 
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and MicrosoftDistributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and MicrosoftComsysto Reply GmbH
 
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im VergleichGrundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im VergleichComsysto Reply GmbH
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformComsysto Reply GmbH
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMNEin Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMNComsysto Reply GmbH
 
Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)Comsysto Reply GmbH
 
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica KrizanicJava cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica KrizanicComsysto Reply GmbH
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. FlinkComsysto Reply GmbH
 

Mehr von Comsysto Reply GmbH (20)

Architectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and ConsistentlyArchitectural Decisions: Smoothly and Consistently
Architectural Decisions: Smoothly and Consistently
 
ljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdfljug-meetup-2023-03-hexagonal-architecture.pdf
ljug-meetup-2023-03-hexagonal-architecture.pdf
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
Invited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdfInvited-Talk_PredAnalytics_München (2).pdf
Invited-Talk_PredAnalytics_München (2).pdf
 
MicroFrontends für Microservices
MicroFrontends für MicroservicesMicroFrontends für Microservices
MicroFrontends für Microservices
 
Alles offen = gut(ai)
Alles offen = gut(ai)Alles offen = gut(ai)
Alles offen = gut(ai)
 
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
 
Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup Smart City Munich Kickoff Meetup
Smart City Munich Kickoff Meetup
 
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
 
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo..."Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
 
Data lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid ArchitecturesData lake vs Data Warehouse: Hybrid Architectures
Data lake vs Data Warehouse: Hybrid Architectures
 
Java 9 Modularity and Project Jigsaw
Java 9 Modularity and Project JigsawJava 9 Modularity and Project Jigsaw
Java 9 Modularity and Project Jigsaw
 
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and MicrosoftDistributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
 
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im VergleichGrundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMNEin Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
 
Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)Geospatial applications created using java script(and nosql)
Geospatial applications created using java script(and nosql)
 
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica KrizanicJava cro 2016 - From.... to Scrum by Jurica Krizanic
Java cro 2016 - From.... to Scrum by Jurica Krizanic
 
21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink21.04.2016 Meetup: Spark vs. Flink
21.04.2016 Meetup: Spark vs. Flink
 

Kürzlich hochgeladen

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

  • 1. Making sense of RDDs, DataFrames, SparkSQL and Datasets APIs Motivation overview over the different Spark APIs for working with structured data. Timeline of Spark APIs Spark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well known concept from R / Python Pandas. Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-oriented programming interface. RDD RDD - Resilient Distributed Dataset Functional transformations on partitioned collections of opaque objects. Define case class representing schema of our data. Each field represent column of the DB. >  defined class Person Create parallelized collection (RDD) >  peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95 RDD of type Person >  rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98 NB: Return Person objects >  Person(Sven,38) case class Person(name: String, age: Int) val peopleRDD = sc.parallelize(Array( Person("Lars", 37), Person("Sven", 38), Person("Florian", 39), Person("Dieter", 37) )) val rdd = peopleRDD .filter(_.age > 37) rdd .collect .foreach(println(_)) (http://databricks.com) Import Notebook RDDs, SQL, DataFrames and DataSets
  • 2. Person(Florian,39) DataFrames Declarative transformations on partitioned collection of tuples. >  peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int] >  +-------+---+ | name|age| +-------+---+ | Lars| 37| | Sven| 38| |Florian| 39| | Dieter| 37| +-------+---+ >  root |-- name: string (nullable = true) |-- age: integer (nullable = false) Show only age column >  +---+ |age| +---+ | 37| | 38| | 39| | 37| +---+ NB: Result set consists of Arrays of String und Ints >  [Sven,38] [Florian,39] DataSets Create DataFrames from RDDs Implicit conversion is also available >  peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] NB: Result set consist of Person objects val peopleDF = peopleRDD.toDF peopleDF.show() peopleDF.printSchema() peopleDF.select("age").show() peopleDF .filter("age > 37") .collect .foreach(row => println(row)) val peopleDS = peopleRDD.toDS
  • 3. >  ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] >  Person(Sven,38) Person(Florian,39) >  res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045] +- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97 >  res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045] +- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97 >  Person(Sven,38) Person(Florian,39) Spark SQL >  import org.apache.spark.sql.SQLContext sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129 Register DataFrame for usage via SQL >  The results of SQL queries are DataFrames and support all the usual RDD operations. >  res298: org.apache.spark.sql.DataFrame = [name: string, age: int] Print execution plan >  res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- Subquery sparkpeopletbl +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97 Optimized by built-in optimizer execution plan val ds = peopleDS .filter(_.age > 37) ds.collect .foreach(println(_)) ds.queryExecution.analyzed ds.queryExecution.optimizedPlan ds.collect .foreach(println(_)) // Get SQL context from Spark context // NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access the shared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())." import org.apache.spark.sql.SQLContext val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate()) peopleDF.registerTempTable("sparkPeopleTbl") sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37") sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
  • 4. >  res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97 >  Sven 38 Florian 39 >  res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39]) >  res302: Array[String] = Array(NAME: Sven, NAME: Florian) >  res303: Array[String] = Array(NAME: Sven, NAME: Florian) >  res304: org.apache.spark.sql.DataFrame = [name: string, age: int] Running SQL queries agains Parquet files directly >  ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95 MiB] >  0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01- 13T11:08:00.000+0000 19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01- 13T10:33:00.000+0000 D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01- 13T04:58:00.000+0000 67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01- name age medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan %sql SELECT * FROM sparkPeopleTbl WHERE age > 37 sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect() sql("SELECT * FROM sparkPeopleTbl WHERE age > 37") ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/") %sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
  • 5. Showing the first 1000 rows. 13T04:58:00.000+0000 5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01- 13T08:56:00.000+0000 Conclusions RDDs RDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should be prefered. DataFrames DataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and flexible API. Datasets Datasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with structured data, and Spark plans to converge the APIs even further. Here some benchmarks of DataSets: Transformation of Data Types in Spark